How can I get the most frequent 100 numbers out of 4,000,000,000 numbers?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Finding the 100 most frequent values in 4 billion numbers is a frequency-counting problem, not a sorting problem. The right design depends less on the row count and more on two practical questions: how many distinct numbers exist, and whether the answer must be exact.
If the distinct value count is small enough, one pass with a frequency table is enough. If the distinct count is huge, you need to partition the work so no single process has to hold every counter at once.
Separate Counting from Ranking
The first optimization is conceptual. You do not need to sort 4 billion inputs. You need to count occurrences and then keep only the top 100 counts.
For a manageable number of distinct values, a streaming counter plus a fixed-size heap is exact and efficient:
This works because the expensive part is storing the frequency table, not storing the top 100 results. The heap stays tiny. The counter is what determines whether this strategy fits in memory.
Partition When the Distinct Count Is Too Large
If the number of unique values is too large for one in-memory counter, split the input into buckets first. The same value must always go to the same bucket, so hashing works well.
This design turns one impossible counting task into many smaller exact ones. It also maps cleanly onto distributed systems because each bucket can be processed independently.
Use an Array When the Range Is Small
Sometimes the input volume is huge but the numeric range is tiny. In that case, an array is faster and usually more memory efficient than a hash map.
This works well for ranges such as event codes or bounded identifiers. It is a poor choice when values are sparse across a huge range, because the array would waste memory on values that never appear.
Decide Early Whether Approximation Is Allowed
If the exact top 100 is not required, heavy-hitter algorithms can reduce memory usage dramatically. Techniques such as Misra-Gries or Count-Min Sketch are useful when the goal is to detect the dominant values rather than report exact counts for every candidate.
That is not just an implementation detail. It is a product requirement. If the answer feeds billing, auditing, or a customer-facing report, approximation may be unacceptable. If the answer feeds dashboard exploration or anomaly detection, it may be a perfectly reasonable tradeoff.
Common Pitfalls
The biggest mistake is sorting all 4 billion numbers directly. Sorting solves a different problem and usually costs more CPU, disk, and memory than counting frequencies first.
Another mistake is ignoring distinct cardinality. Four billion rows can still be easy if there are only a few thousand unique values, and still be hard if almost every value is different.
I/O is also a real bottleneck at this scale. A good algorithm can still perform badly if it reads inefficiently, writes too many temporary files, or runs far from the data source.
Finally, do not choose an approximate algorithm by accident. If the result must be exact, partitioning and exact counting are the safer design even when they are operationally heavier.
Summary
- Treat the task as frequency counting followed by top-100 ranking.
- Use a small heap for the final ranking step instead of sorting every candidate.
- Count in memory only if the distinct value set fits comfortably.
- Partition by hash when exact counts do not fit in one process.
- Use dense arrays only when the numeric range is genuinely small.

