B+ Trees
Algorithms
Data Structures
Merging Algorithms
Computer Science

Is there a fast algorithm to merge sorted BTrees?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

The fastest practical algorithm to merge two sorted B-trees (or B+ trees) extracts all leaf entries from both trees in sorted order (O(n + m) via sequential leaf traversal), merges the two sorted sequences, and bulk-loads a new B-tree from the merged result in O((n + m) / B) I/O operations, where B is the block size. This bottom-up bulk-loading approach is significantly faster than inserting elements one-by-one (O((n + m) log(n + m))) because it minimizes random I/O and builds perfectly balanced trees. For B+ trees specifically, the linked leaf nodes make the extraction step trivially efficient.

B-Tree vs B+ Tree Structure

 
1B-Tree (order 3):
2        [20, 40]
3       /    |    \
4   [10]  [30]  [50, 60]
5   Data in ALL nodes
6
7B+ Tree (order 3):
8        [20, 40]
9       /    |    \
10   [10][20,30][40,50,60]Linked leaf nodes hold ALL data
11   Internal nodes are index-only

B+ trees are preferred for merging because all data lives in linked leaf nodes. Traversing the leaves is a sequential scan — optimal for disk I/O.

Algorithm 1: Extract, Merge, Bulk-Load (Optimal)

 
1Step 1: Extract sorted sequences from both trees
2  Tree A leaves: [1, 3, 5, 7, 9, 11]
3  Tree B leaves: [2, 4, 6, 8, 10, 12]
4
5Step 2: Merge the two sorted sequences (standard merge)
6  Merged: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
7
8Step 3: Bulk-load a new B+ tree bottom-up
9  Fill leaf nodes to capacity:
10    [1,2,3][4,5,6][7,8,9][10,11,12]
11  Build internal nodes:
12    [4, 7, 10]
13  Build root:
14    [7]
python
1def merge_btrees(tree_a, tree_b, order):
2    # Step 1: Extract sorted elements from leaf nodes
3    seq_a = list(tree_a.iterate_leaves())  # O(n)
4    seq_b = list(tree_b.iterate_leaves())  # O(m)
5
6    # Step 2: Merge two sorted sequences
7    merged = merge_sorted(seq_a, seq_b)     # O(n + m)
8
9    # Step 3: Bulk-load a new B-tree
10    return bulk_load_btree(merged, order)    # O((n + m) / B)
11
12def merge_sorted(a, b):
13    """Standard merge of two sorted lists."""
14    result = []
15    i, j = 0, 0
16    while i < len(a) and j < len(b):
17        if a[i] <= b[j]:
18            result.append(a[i])
19            i += 1
20        else:
21            result.append(b[j])
22            j += 1
23    result.extend(a[i:])
24    result.extend(b[j:])
25    return result

Total time complexity: O(n + m). Total I/O complexity: O((n + m) / B) where B is the B-tree node capacity. This is optimal — you cannot do better than reading all elements once.

Algorithm 2: Bulk-Loading Bottom-Up

python
1def bulk_load_btree(sorted_data, order):
2    """Build a B+ tree from sorted data, bottom-up."""
3    max_keys = order - 1
4
5    # Step 1: Fill leaf nodes
6    leaves = []
7    for i in range(0, len(sorted_data), max_keys):
8        node = LeafNode(keys=sorted_data[i:i + max_keys])
9        leaves.append(node)
10
11    # Link leaf nodes
12    for i in range(len(leaves) - 1):
13        leaves[i].next = leaves[i + 1]
14
15    # Step 2: Build internal levels bottom-up
16    current_level = leaves
17    while len(current_level) > 1:
18        next_level = []
19        for i in range(0, len(current_level), order):
20            children = current_level[i:i + order]
21            # Separator keys from first key of each child (except first)
22            keys = [child.keys[0] for child in children[1:]]
23            parent = InternalNode(keys=keys, children=children)
24            next_level.append(parent)
25        current_level = next_level
26
27    return BPlusTree(root=current_level[0])

Bulk-loading produces a perfectly balanced tree with optimal node utilization (all nodes at least half full, most are completely full).

Algorithm 3: Incremental Merge (Insert from Smaller into Larger)

python
1def incremental_merge(large_tree, small_tree):
2    """Insert all elements from smaller tree into larger tree.
3    Faster when one tree is much smaller than the other."""
4    for key, value in small_tree.iterate_leaves():
5        large_tree.insert(key, value)
6    return large_tree
7
8# Time: O(m * log_B(n)) where m = small tree size, n = large tree size
9# Better than extract-merge-rebuild when m << n

When one tree has m elements and the other has n elements where m << n, inserting from the small tree into the large tree takes O(m log n) which is better than O(n + m) for the full rebuild. The crossover point depends on the ratio m/n and the cost of I/O vs computation.

Algorithm 4: Merge-Sort Style for Multiple Trees

python
1def merge_k_btrees(trees, order):
2    """Merge k sorted B-trees using a priority queue."""
3    import heapq
4
5    # Create iterators for each tree's leaf nodes
6    iterators = [tree.iterate_leaves() for tree in trees]
7
8    # Use a min-heap for k-way merge
9    heap = []
10    for i, it in enumerate(iterators):
11        first = next(it, None)
12        if first is not None:
13            heapq.heappush(heap, (first, i))
14
15    merged = []
16    while heap:
17        value, tree_idx = heapq.heappop(heap)
18        merged.append(value)
19        next_val = next(iterators[tree_idx], None)
20        if next_val is not None:
21            heapq.heappush(heap, (next_val, tree_idx))
22
23    return bulk_load_btree(merged, order)
24
25# Time: O(N * log(k)) where N = total elements, k = number of trees

Complexity Comparison

AlgorithmTimeI/OBest When
Extract + merge + bulk-loadO(n + m)O((n+m)/B)General case
Insert small into largeO(m log n)O(m log_B n)m much smaller than n
k-way merge + bulk-loadO(N log k)O(N/B + k)Merging many trees
Naive insert allO((n+m) log(n+m))O((n+m) log_B(n+m))Never optimal

Common Pitfalls

  • Inserting one-by-one instead of bulk-loading: Inserting n+m elements individually into a new tree takes O((n+m) log(n+m)) with random I/O for each insertion. Bulk-loading from sorted data is O(n+m) with sequential I/O — orders of magnitude faster for large trees.
  • Not exploiting B+ tree leaf links: B+ trees have linked leaf nodes allowing O(n) sequential traversal. B-trees require an in-order traversal visiting internal nodes, which is still O(n) but involves more random I/O due to interleaved internal and leaf node accesses.
  • Underutilizing node capacity after merge: Merging by alternating insertions from two trees can leave nodes half-empty, wasting disk space and increasing tree height. Bulk-loading fills nodes to capacity (or a controlled fill factor), producing compact trees.
  • Ignoring duplicate key handling: If both trees contain the same key, the merge must define a conflict resolution strategy (keep first, keep last, merge values). Without explicit handling, duplicate keys cause incorrect tree structure or data loss.
  • Assuming in-memory algorithms apply to disk-based B-trees: B-tree merge performance is dominated by I/O, not computation. An algorithm with lower computational complexity but more random I/O (like binary search insertion) is slower in practice than a computationally heavier but sequential-I/O algorithm (like bulk-loading).

Summary

  • The optimal B-tree merge algorithm extracts sorted leaves from both trees, merges the sequences, and bulk-loads a new tree — O(n + m) time
  • Bulk-loading bottom-up produces perfectly balanced trees with maximum node utilization
  • For B+ trees, linked leaf nodes make sorted extraction a simple sequential scan
  • When one tree is much smaller, insert its elements into the larger tree instead — O(m log n)
  • For merging k trees, use a priority queue for k-way merge followed by bulk-loading — O(N log k)
  • Avoid naive one-by-one insertion — it produces O((n+m) log(n+m)) random I/O operations

Course illustration
Course illustration

All Rights Reserved.