How external merge sort algorithm works?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
External Merge Sort is a pivotal algorithm in computer science, especially when working with large datasets that exceed the main memory capacity. Unlike internal sorting algorithms that assume the entire dataset can fit in RAM, external merge sort is designed to handle huge amounts of data stored in slower external memory, such as disks. Below, we explore the workings of the external merge sort algorithm, providing technical details, examples, and a summary table.
How External Merge Sort Works
Overview
External merge sort is a two-phase algorithm consisting of the run-creation phase and the merge phase. Here's a breakdown of each phase:
- Run-Creation Phase
- The input data is divided into manageable chunks small enough to fit into memory.
- Each chunk is sorted using an internal sorting algorithm (e.g., QuickSort or HeapSort).
- The sorted chunks, called "runs," are then written back to disk.
- Merge Phase
- Multiple runs are merged in rounds, where each round reduces the number of runs.
- This process continues until a single sorted file remains.
- Efficient multi-way merging techniques and data structures, such as min-heaps, play a crucial role in this phase.
Detailed Explanation
Run-Creation Phase
In the run-creation phase, the dataset is partitioned into segments that can fit into the RAM. For instance, let's assume we have a dataset of 1 GB, and our RAM capacity is 100 MB. The dataset can be split into 10 chunks, each of 100 MB.
- Sorting In-Memory:
- Each 100 MB chunk is loaded into RAM and sorted using a conventional internal memory sorting algorithm.
- Once sorted, these runs are written back to the disk as temporary files.
Consider a simplified run-creation example:
| Run | Data in RAM | Sorted Data | Written to Disk |
| 1 | [6, 5, 3] | [3, 5, 6] | Run1.dat |
| 2 | [9, 8, 7] | [7, 8, 9] | Run2.dat |
| 3 | [4, 2, 1] | [1, 2, 4] | Run3.dat |
Each run is sorted individually and stored back on the disk.
Merge Phase
The core operation in the merge phase is the k-way merge, where signifies the number of runs being merged simultaneously. The principle is akin to the final step of the merge sort algorithm but adapted for external data.
- Multi-Way Merge:
- A suitable structure, like a min-heap of size , is employed to keep track of the smallest elements from each run.
- The smallest element is repeatedly extracted from this heap, appended to the output file, and replaced with the next element from the corresponding run.
Assuming and three runs `Run1.dat`, `Run2.dat`, and `Run3.dat` each contain sorted data, a min-heap helps effectively manage merging:
- Initialize a min-heap with the first element of each run.
- Extract the minimum element from the heap and write it to an output run.
- Replace the extracted element with the next element from its respective run.
- Repeat until all runs are exhausted.
Here's how the k-way merge functions:
| Step | Min-Heap | Output | Remaining in Runs |
| 1 | [1, 3, 7] | 1 | 2 from Run3, [3, 5, 6], [7, 8, 9] |
| 2 | [2, 3, 7] | 1, 2 | [] from Run3, [3, 5, 6], [7, 8, 9] |
| 3 | [3, 5, 7] | 1, 2, 3 | [] from Run3, [5, 6], [7, 8, 9] |
This process continues until all elements from all runs are merged.
Additional Concepts
- Buffering: Instead of processing single elements, blocks of data are moved between disk and memory to improve I/O efficiency. A proper buffering strategy can significantly enhance performance by minimizing disk I/O operations.
- Progressive Merging: In cases where RAM can hold more than two runs, more than a binary merge is feasible—resulting in fewer passes over the data and more efficiency.
- Parallelization: Depending on the system's capabilities, external merge sort can be parallelized. Different parts of the process, such as sorting individual runs and merging runs, can be handled concurrently.
Summary Table
The table below summarizes key aspects of the external merge sort algorithm:
| Aspect | Description |
| Data Handling | Suitable for datasets larger than RAM capacity |
| Process Phases | Run-Creation Phase Merge Phase |
| Sorting Technique | Internal sorting for runs k-way merging for runs |
| Data Structure Used | Min-Heap for efficient merge operation |
| I/O Efficiency | Improved using data buffering |
| Parallelization | Possible in modern systems |
The External Merge Sort algorithm is instrumental for efficient sorting of huge datasets, optimizing both CPU and I/O operations by extending internal algorithms for external data constraints. Understanding these concepts can significantly aid developers and computer scientists in handling large-scale data sorting tasks efficiently.

