C++
Running Median
Algorithm
Data Structures
Programming

C Efficiently Calculating a Running Median

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

To efficiently calculate a running median in C++, a solid understanding of data structures and algorithms is essential. The running median is the median of all elements read so far in a sequence of numbers. This requires dynamically calculating the median as each new number is processed, which can be challenging if not approached correctly.

In this article, I will explore the implementation of an efficient method to calculate the running median using a combination of two heaps: a max-heap and a min-heap. This approach optimizes for both time and space complexity, making it suitable for handling large streams of data.

Key Concepts

Before diving into the implementation, let's summarize the core concepts used in the solution:

  • Median: The median is the middle value in a list of numbers. For an odd number of elements, it is the middle element. For an even number, it is the average of the two middle elements.
  • Heaps: These are specialized tree-based data structures that satisfy the heap property. A max-heap allows retrieving the maximum element quickly, while a min-heap allows retrieving the minimum.
  • Data Stream: A sequence of data elements made available over time rather than as a complete set at once.

Strategy

To maintain the ability to calculate the median efficiently as numbers arrive in the data stream, the following strategy is used:

  1. Two-Heaps Approach:
    • Maintain two heaps: a max-heap to store the lower half of the numbers and a min-heap to store the upper half.
    • Ensure that the max-heap contains all numbers less than the median and the min-heap contains all numbers greater than the median.
  2. Balancing Heaps:
    • The heaps need to be balanced such that the max-heap can have at most one more element than the min-heap.
    • When the total number of elements is odd, the extra element is in the max-heap.
  3. Median Calculation:
    • If the heaps are equal in size, the median is the average of the top elements of both heaps.
    • If the max-heap has one more element, the median is the top element of the max-heap.

Implementation

With these concepts in mind, we can efficiently calculate the running median using C++ and its Standard Template Library (STL) for handling priority queues.

  • addNumber(int num): This function adds a new number to one of the two heaps. If the number is less than or equal to the top of the max-heap, it's added to the max-heap. Otherwise, it's added to the min-heap. After each insertion, it checks and balances the sizes of the heaps.
  • getMedian(): This returns the median of all numbers seen so far. If the heaps are balanced, it returns the average of the two top elements. If not, it returns the top of the max-heap, as it contains the extra element.
  • Heap Balancing: The balance is maintained such that no single heap exceeds the other by more than one element. This involves moving the top element from one heap to the other as needed.
  • Time Complexity: Each insertion and balancing operation takes O(logn)O(\log n) time, where nn is the number of elements seen so far. Retrieving the median has constant time complexity, O(1)O(1), due to the direct access to the heap tops.
  • Space Complexity: The algorithm requires O(n)O(n) space, where nn is the total number of elements processed, as all elements need to be stored in the heaps.
  • Handling Large Data: For practical implementations, you may want to handle large datasets by periodically clearing or archiving old entries to manage memory usage.
  • Variations in Median Calculation: Occasionally, a weighted median or other statistical measures may be preferred, which can extend this approach's capabilities by adjusting how elements are organized and retrieved.

Course illustration
Course illustration

All Rights Reserved.