Finding the hundred largest numbers in a file of a billion

algorithm

data-processing

large-scale-computation

file-handling

number-sorting

Finding the hundred largest numbers in a file of a billion

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

If a file contains a billion numbers, sorting the entire file just to keep the top hundred is wasteful. The standard solution is to stream through the file once while maintaining a min-heap of size 100, which keeps memory use tiny and runtime proportional to the file size.

Why a min-heap is the right data structure

You only need the largest 100 values, not a full sorted copy of all input. A min-heap of size 100 is perfect because:

the smallest value in the current top 100 is always at the heap root
values smaller than that root can be ignored immediately
each accepted update costs only O(log 100), which is effectively constant

That makes the full pass over the file O(n log 100), which is effectively O(n) for practical purposes.

Streaming algorithm

The algorithm is:

read numbers one by one
push the first 100 into the heap
for each later number, compare it with the smallest of the current top 100
if it is larger, replace the heap root

Here is a Python example:

python

1import heapq
2
3def top_k_numbers(path, k=100):
4    heap = []
5
6    with open(path, "r", encoding="utf-8") as f:
7        for line in f:
8            value = int(line.strip())
9
10            if len(heap) < k:
11                heapq.heappush(heap, value)
12            elif value > heap[0]:
13                heapq.heapreplace(heap, value)
14
15    return sorted(heap, reverse=True)

This reads the file once and never stores more than 100 candidate values at a time.

Why full sorting is the wrong first idea

A naïve solution would read all billion values into memory, sort them, and take the largest 100. That has obvious problems:

memory usage becomes enormous
sorting cost is much higher than needed
you do extra work for values that can never reach the top 100

Even if the file fits on disk comfortably, that does not mean it belongs in RAM all at once.

Handling chunked or malformed input

Real files are not always clean. You may need to skip blanks, report bad rows, or parse multiple values per line.

python

1import heapq
2
3def top_k_safe(path, k=100):
4    heap = []
5
6    with open(path, "r", encoding="utf-8") as f:
7        for raw_line in f:
8            line = raw_line.strip()
9            if not line:
10                continue
11
12            try:
13                value = int(line)
14            except ValueError:
15                continue
16
17            if len(heap) < k:
18                heapq.heappush(heap, value)
19            elif value > heap[0]:
20                heapq.heapreplace(heap, value)
21
22    return sorted(heap, reverse=True)

This preserves the same heap strategy while making the parser more robust.

When external sorting is useful instead

If you need the top million values, global ordering, or a full sorted output file, external sorting becomes more relevant. That typically means:

read manageable chunks
sort each chunk
write temporary sorted files
merge them

But for "top 100 only," external sort is overkill. The heap approach is simpler and faster because it never keeps irrelevant values.

Parallel and distributed variants

If the input is split across many files or machines, the same idea scales naturally:

compute local top 100 for each partition
merge those partial results with another heap

This works because the global top 100 must come from the union of the local top candidates.

That is the same principle used in many large-scale data systems: push the top-k reduction close to the data, then merge compact summaries.

Common Pitfalls

The biggest mistake is sorting everything. That solves the problem, but it ignores the fact that you only need 100 values and wastes memory and time.

Another issue is using a max-heap instead of a min-heap for the retained top set. With a min-heap, the smallest kept value is always easy to compare against. That is exactly what the algorithm needs.

Developers also forget to convert the final heap into descending order before returning or printing it. A heap is not a sorted list; it only guarantees the heap property.

Finally, be careful with parsing overhead. If the file format is more complex than one integer per line, parsing can dominate runtime, so the top-k logic may not be the real bottleneck.

Summary

Use a min-heap of size 100 and stream through the file once.
The heap root tracks the smallest of the current top 100 values.
Ignore any new number that is not larger than that root.
This uses tiny memory and avoids full-file sorting.
External sorting is useful for global ordering, not for a small top-k result.