Write Amplification in B-Trees and LSM Trees: The Ratio Math

January 20, 2026

Write amplification is the ratio that storage engineers actually budget against. If your application writes 1 GB and the storage device writes 8 GB, your amplification is 8x. That is bytes you paid for, IOPS you consumed, SSD endurance you burned. Both B-trees and LSMs have a problem here. They have very different problems.

Start with the B-tree. A single PUT of 100 bytes does not write 100 bytes to disk. The engine writes the operation to the WAL first, which is one log append. Then it rewrites the leaf page that holds the key. Pages are typically 8 KB or 16 KB, so even a tiny update flushes a full page. That is already 8 KB plus the WAL entry for a 100-byte logical write. Amplification is roughly 80x at the page level for very small values, and that is before you split anything.

When the hot leaf fills, it splits. Now you write two new leaf pages and rewrite the parent to point at them. If the parent fills, that splits too. A single logical insert can cascade into three or four page writes on a bad day. B-tree write amplification is bounded but spiky: most writes touch one page, some writes touch the whole height of the tree.

LSM amplification looks calm at first and then explodes. The write itself is cheap: a WAL append and an in-memory memtable update. The memtable flushes to an L0 SSTable, which is your second physical write of that data. Then compaction kicks in. To keep reads fast, the engine periodically merges SSTables and pushes them to the next level. A key written once at L0 will be rewritten when L0 compacts into L1, then again into L2, then L3, then L4.

The math is roughly the level multiplier raised to the depth. RocksDB defaults to 10x per level. Five levels means up to about 10 to 50 physical writes per logical write, depending on workload. Some setups hit 30x or more, especially with skewed key distributions.

The trade is in when you pay. B-trees pay during the write, synchronously. The user sees the cost in tail latency. LSMs pay during compaction, asynchronously. The user sees the cost in unpredictable background IO that competes with foreground reads and can stall writes if compaction falls behind. A compaction backlog is the LSM equivalent of a B-tree page split storm, and it shows up in production as a sudden drop in read throughput while the engine catches up.

Key takeaway

Write amplification is the gap between what your application wrote and what storage actually wrote. B-trees pay it up front in page rewrites. LSMs defer it to compaction. Both have a multiplier, and both multipliers matter for capacity planning.

Originally posted on LinkedIn. View original.