Interview Question
Median Finding
Large Data Set
Algorithm Design
Data Structures

Interview Question Find Median From Mega Number Of Integers

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

The phrase "mega number of integers" can mean two different interview problems: either a running median over a stream, or an exact median of data so large it does not fit in memory. The best answer depends on which version the interviewer actually means, so the first step is to clarify the model before jumping into code.

Case 1: Streaming Median with Two Heaps

If numbers arrive one by one and you want to report the median at any time, the standard answer is two heaps:

  • a max-heap for the lower half
  • a min-heap for the upper half

The heaps are balanced so that:

  • their sizes differ by at most one
  • every value in the left heap is less than or equal to every value in the right heap

Then the median is:

  • the top of the larger heap if the count is odd
  • the average of both tops if the count is even
python
1import heapq
2
3
4class RunningMedian:
5    def __init__(self):
6        self.left = []   # max-heap via negative values
7        self.right = []  # min-heap
8
9    def add(self, value):
10        if not self.left or value <= -self.left[0]:
11            heapq.heappush(self.left, -value)
12        else:
13            heapq.heappush(self.right, value)
14
15        if len(self.left) > len(self.right) + 1:
16            heapq.heappush(self.right, -heapq.heappop(self.left))
17        elif len(self.right) > len(self.left):
18            heapq.heappush(self.left, -heapq.heappop(self.right))
19
20    def median(self):
21        if len(self.left) > len(self.right):
22            return -self.left[0]
23        return (-self.left[0] + self.right[0]) / 2
24
25
26rm = RunningMedian()
27for n in [5, 2, 10, 4]:
28    rm.add(n)
29    print(rm.median())

Insertion is O(log n), and median lookup is O(1).

Case 2: Exact Median of Data Too Large for Memory

If the dataset is massive and stored on disk, the interview may be testing whether you notice that the two-heap approach still requires storing everything.

Then the question changes. Possible exact strategies include:

  • external sorting
  • repeated partitioning with disk-based passes
  • counting by value range if the integer range is small and known

The most general exact answer is external sort:

  1. sort chunks that fit in memory
  2. write sorted runs to disk
  3. merge the runs
  4. stop when you reach the middle position

This is slower than in-memory heaps, but it is scalable and exact.

When the Integer Range Is Small

If the integers are known to lie in a limited range, such as 0 to 1_000_000, a counting approach can be better than sorting.

You count frequencies by value:

python
1def median_from_counts(values, max_value):
2    counts = [0] * (max_value + 1)
3    total = 0
4
5    for v in values:
6        counts[v] += 1
7        total += 1
8
9    target = total // 2
10    seen = 0
11
12    for value, count in enumerate(counts):
13        seen += count
14        if seen > target:
15            return value

This is effectively linear in the number of items plus the value range. It is a strong answer only when the range is truly bounded and not absurdly large.

What Interviewers Usually Want You to Notice

The best candidates usually say something like:

  • "If this is a streaming median, I would use two heaps."
  • "If the data does not fit in memory and I need the exact final median, I need an external-memory strategy."
  • "If the value range is bounded, counting may beat general sorting."

That shows you are modeling the constraints instead of memorizing one trick.

Approximate Answers Are a Different Problem

Sometimes the interviewer will allow approximation for truly enormous data. Then quantile sketches and streaming summaries become relevant. But unless the interviewer explicitly relaxes the requirement, "median" usually means exact median.

That is why you should not jump to approximate methods without confirming the requirement first.

Space Complexity Matters More Than Usual

Interview problems about huge inputs are rarely about just time complexity. They are usually testing whether you think about:

  • memory limits
  • sequential disk access
  • number of passes over the data
  • whether exactness is required

So a correct but memory-hungry O(n log n) answer may still miss the real point of the question.

Common Pitfalls

  • Giving the two-heaps answer without checking whether the full dataset can fit in memory.
  • Assuming streaming median and external exact median are the same problem.
  • Forgetting that bounded integer ranges can change the optimal solution completely.
  • Offering approximate quantiles when the question asks for the exact median.
  • Focusing only on time complexity and ignoring storage constraints.

Summary

  • For a running median on a stream, the standard answer is two heaps.
  • For an exact median of data larger than memory, use an external-memory strategy such as external sort.
  • If the integer range is small, counting can be even better.
  • The right solution depends on whether the problem is streaming, disk-based, bounded-range, exact, or approximate.
  • In interviews, clarifying the constraints is usually part of the expected answer.

Course illustration
Course illustration

All Rights Reserved.