Finding the hundred largest numbers in a file of a billion
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
If a file contains a billion numbers, sorting the entire file just to keep the top hundred is wasteful. The standard solution is to stream through the file once while maintaining a min-heap of size 100, which keeps memory use tiny and runtime proportional to the file size.
Why a min-heap is the right data structure
You only need the largest 100 values, not a full sorted copy of all input. A min-heap of size 100 is perfect because:
- the smallest value in the current top
100is always at the heap root - values smaller than that root can be ignored immediately
- each accepted update costs only
O(log 100), which is effectively constant
That makes the full pass over the file O(n log 100), which is effectively O(n) for practical purposes.
Streaming algorithm
The algorithm is:
- read numbers one by one
- push the first
100into the heap - for each later number, compare it with the smallest of the current top
100 - if it is larger, replace the heap root
Here is a Python example:
This reads the file once and never stores more than 100 candidate values at a time.
Why full sorting is the wrong first idea
A naïve solution would read all billion values into memory, sort them, and take the largest 100. That has obvious problems:
- memory usage becomes enormous
- sorting cost is much higher than needed
- you do extra work for values that can never reach the top
100
Even if the file fits on disk comfortably, that does not mean it belongs in RAM all at once.
Handling chunked or malformed input
Real files are not always clean. You may need to skip blanks, report bad rows, or parse multiple values per line.
This preserves the same heap strategy while making the parser more robust.
When external sorting is useful instead
If you need the top million values, global ordering, or a full sorted output file, external sorting becomes more relevant. That typically means:
- read manageable chunks
- sort each chunk
- write temporary sorted files
- merge them
But for "top 100 only," external sort is overkill. The heap approach is simpler and faster because it never keeps irrelevant values.
Parallel and distributed variants
If the input is split across many files or machines, the same idea scales naturally:
- compute local top
100for each partition - merge those partial results with another heap
This works because the global top 100 must come from the union of the local top candidates.
That is the same principle used in many large-scale data systems: push the top-k reduction close to the data, then merge compact summaries.
Common Pitfalls
The biggest mistake is sorting everything. That solves the problem, but it ignores the fact that you only need 100 values and wastes memory and time.
Another issue is using a max-heap instead of a min-heap for the retained top set. With a min-heap, the smallest kept value is always easy to compare against. That is exactly what the algorithm needs.
Developers also forget to convert the final heap into descending order before returning or printing it. A heap is not a sorted list; it only guarantees the heap property.
Finally, be careful with parsing overhead. If the file format is more complex than one integer per line, parsing can dominate runtime, so the top-k logic may not be the real bottleneck.
Summary
- Use a min-heap of size
100and stream through the file once. - The heap root tracks the smallest of the current top
100values. - Ignore any new number that is not larger than that root.
- This uses tiny memory and avoids full-file sorting.
- External sorting is useful for global ordering, not for a small top-k result.

