Get the MD5 hash of big files in Python
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
To hash a large file in Python, do not read the entire file into memory. The standard pattern is to create a hash object, read the file in chunks, and update the hash incrementally.
That approach is both memory-efficient and simple. It works for MD5, SHA-256, and the rest of Python's hashlib algorithms.
Stream the File in Chunks
Here is the core pattern for MD5:
This keeps memory usage bounded because only one chunk is held at a time.
The exact chunk size is not critical. Something between a few kilobytes and a few megabytes is usually fine.
Why Chunking Matters
If you do this instead:
Python tries to load the full file into memory first. That may be fine for small files, but it becomes wasteful or impossible for large ones.
Streaming avoids that problem entirely.
A More Compact Modern Pattern
You can write the chunk loop more compactly with iter and a sentinel.
This is functionally the same as the explicit while loop. Use whichever style your codebase finds clearer.
Command-Line Style Utility Example
For scripts, it is often useful to hash multiple files.
This behaves similarly to checksum tools on Unix-like systems.
Security Note About MD5
MD5 is still common for file integrity checks and duplicate detection, but it is not recommended for security-sensitive uses such as password hashing or cryptographic verification.
If you need a stronger checksum for trust or security, use SHA-256 instead.
The file-reading pattern is identical. Only the algorithm changes.
Performance Considerations
For ordinary file hashing, Python's hashlib implementation is usually fast enough. The bottleneck is often disk I/O rather than Python itself.
If you are hashing many files, you can improve throughput by:
- avoiding tiny chunk sizes
- reading from fast storage
- parallelizing across files when storage and CPU allow it
But for one large file, the standard chunked loop is already the correct baseline solution.
Common Pitfalls
The biggest mistake is reading the entire file into memory before hashing it.
Another common issue is opening the file in text mode instead of binary mode. Always use "rb" for checksum work.
People also use MD5 for security decisions when they only needed a stronger hash such as SHA-256.
Finally, do not over-optimize chunk size prematurely. Start with a reasonable default and change it only if measurement shows a benefit.
Summary
- Use
hashlibwith chunked reads for large files. - Open files in binary mode with
"rb". - Update the hash incrementally instead of reading the whole file at once.
- The same pattern works for MD5, SHA-256, and similar algorithms.
- MD5 is fine for checksums, but not for modern security use.
- Disk I/O is often the real bottleneck, not the hash loop itself.

