Is there a fast algorithm to merge sorted BTrees?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
The fastest practical algorithm to merge two sorted B-trees (or B+ trees) extracts all leaf entries from both trees in sorted order (O(n + m) via sequential leaf traversal), merges the two sorted sequences, and bulk-loads a new B-tree from the merged result in O((n + m) / B) I/O operations, where B is the block size. This bottom-up bulk-loading approach is significantly faster than inserting elements one-by-one (O((n + m) log(n + m))) because it minimizes random I/O and builds perfectly balanced trees. For B+ trees specifically, the linked leaf nodes make the extraction step trivially efficient.
B-Tree vs B+ Tree Structure
B+ trees are preferred for merging because all data lives in linked leaf nodes. Traversing the leaves is a sequential scan — optimal for disk I/O.
Algorithm 1: Extract, Merge, Bulk-Load (Optimal)
Total time complexity: O(n + m). Total I/O complexity: O((n + m) / B) where B is the B-tree node capacity. This is optimal — you cannot do better than reading all elements once.
Algorithm 2: Bulk-Loading Bottom-Up
Bulk-loading produces a perfectly balanced tree with optimal node utilization (all nodes at least half full, most are completely full).
Algorithm 3: Incremental Merge (Insert from Smaller into Larger)
When one tree has m elements and the other has n elements where m << n, inserting from the small tree into the large tree takes O(m log n) which is better than O(n + m) for the full rebuild. The crossover point depends on the ratio m/n and the cost of I/O vs computation.
Algorithm 4: Merge-Sort Style for Multiple Trees
Complexity Comparison
| Algorithm | Time | I/O | Best When |
| Extract + merge + bulk-load | O(n + m) | O((n+m)/B) | General case |
| Insert small into large | O(m log n) | O(m log_B n) | m much smaller than n |
| k-way merge + bulk-load | O(N log k) | O(N/B + k) | Merging many trees |
| Naive insert all | O((n+m) log(n+m)) | O((n+m) log_B(n+m)) | Never optimal |
Common Pitfalls
- Inserting one-by-one instead of bulk-loading: Inserting n+m elements individually into a new tree takes O((n+m) log(n+m)) with random I/O for each insertion. Bulk-loading from sorted data is O(n+m) with sequential I/O — orders of magnitude faster for large trees.
- Not exploiting B+ tree leaf links: B+ trees have linked leaf nodes allowing O(n) sequential traversal. B-trees require an in-order traversal visiting internal nodes, which is still O(n) but involves more random I/O due to interleaved internal and leaf node accesses.
- Underutilizing node capacity after merge: Merging by alternating insertions from two trees can leave nodes half-empty, wasting disk space and increasing tree height. Bulk-loading fills nodes to capacity (or a controlled fill factor), producing compact trees.
- Ignoring duplicate key handling: If both trees contain the same key, the merge must define a conflict resolution strategy (keep first, keep last, merge values). Without explicit handling, duplicate keys cause incorrect tree structure or data loss.
- Assuming in-memory algorithms apply to disk-based B-trees: B-tree merge performance is dominated by I/O, not computation. An algorithm with lower computational complexity but more random I/O (like binary search insertion) is slower in practice than a computationally heavier but sequential-I/O algorithm (like bulk-loading).
Summary
- The optimal B-tree merge algorithm extracts sorted leaves from both trees, merges the sequences, and bulk-loads a new tree — O(n + m) time
- Bulk-loading bottom-up produces perfectly balanced trees with maximum node utilization
- For B+ trees, linked leaf nodes make sorted extraction a simple sequential scan
- When one tree is much smaller, insert its elements into the larger tree instead — O(m log n)
- For merging k trees, use a priority queue for k-way merge followed by bulk-loading — O(N log k)
- Avoid naive one-by-one insertion — it produces O((n+m) log(n+m)) random I/O operations

