Get the MD5 hash of big files in Python

python

md5

hash

large files

file processing

Get the MD5 hash of big files in Python

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

To hash a large file in Python, do not read the entire file into memory. The standard pattern is to create a hash object, read the file in chunks, and update the hash incrementally.

That approach is both memory-efficient and simple. It works for MD5, SHA-256, and the rest of Python's hashlib algorithms.

Stream the File in Chunks

Here is the core pattern for MD5:

python

1import hashlib
2
3
4def md5_file(path, chunk_size=1024 * 1024):
5    h = hashlib.md5()
6    with open(path, "rb") as f:
7        while True:
8            chunk = f.read(chunk_size)
9            if not chunk:
10                break
11            h.update(chunk)
12    return h.hexdigest()
13
14
15print(md5_file("big.iso"))

This keeps memory usage bounded because only one chunk is held at a time.

The exact chunk size is not critical. Something between a few kilobytes and a few megabytes is usually fine.

Why Chunking Matters

If you do this instead:

python

with open("big.iso", "rb") as f:
    data = f.read()
    digest = hashlib.md5(data).hexdigest()

Python tries to load the full file into memory first. That may be fine for small files, but it becomes wasteful or impossible for large ones.

Streaming avoids that problem entirely.

A More Compact Modern Pattern

You can write the chunk loop more compactly with iter and a sentinel.

python

1import hashlib
2
3
4def md5_file(path, chunk_size=1024 * 1024):
5    h = hashlib.md5()
6    with open(path, "rb") as f:
7        for chunk in iter(lambda: f.read(chunk_size), b""):
8            h.update(chunk)
9    return h.hexdigest()

This is functionally the same as the explicit while loop. Use whichever style your codebase finds clearer.

Command-Line Style Utility Example

For scripts, it is often useful to hash multiple files.

python

1import hashlib
2import sys
3from pathlib import Path
4
5
6def md5_file(path, chunk_size=1024 * 1024):
7    h = hashlib.md5()
8    with open(path, "rb") as f:
9        for chunk in iter(lambda: f.read(chunk_size), b""):
10            h.update(chunk)
11    return h.hexdigest()
12
13
14for arg in sys.argv[1:]:
15    p = Path(arg)
16    print(md5_file(p), p)

This behaves similarly to checksum tools on Unix-like systems.

Security Note About MD5

MD5 is still common for file integrity checks and duplicate detection, but it is not recommended for security-sensitive uses such as password hashing or cryptographic verification.

If you need a stronger checksum for trust or security, use SHA-256 instead.

python

1import hashlib
2
3
4def sha256_file(path, chunk_size=1024 * 1024):
5    h = hashlib.sha256()
6    with open(path, "rb") as f:
7        for chunk in iter(lambda: f.read(chunk_size), b""):
8            h.update(chunk)
9    return h.hexdigest()

The file-reading pattern is identical. Only the algorithm changes.

Performance Considerations

For ordinary file hashing, Python's hashlib implementation is usually fast enough. The bottleneck is often disk I/O rather than Python itself.

If you are hashing many files, you can improve throughput by:

avoiding tiny chunk sizes
reading from fast storage
parallelizing across files when storage and CPU allow it

But for one large file, the standard chunked loop is already the correct baseline solution.

Common Pitfalls

The biggest mistake is reading the entire file into memory before hashing it.

Another common issue is opening the file in text mode instead of binary mode. Always use "rb" for checksum work.

People also use MD5 for security decisions when they only needed a stronger hash such as SHA-256.

Finally, do not over-optimize chunk size prematurely. Start with a reasonable default and change it only if measurement shows a benefit.

Summary

Use hashlib with chunked reads for large files.
Open files in binary mode with "rb".
Update the hash incrementally instead of reading the whole file at once.
The same pattern works for MD5, SHA-256, and similar algorithms.
MD5 is fine for checksums, but not for modern security use.
Disk I/O is often the real bottleneck, not the hash loop itself.