LSM Compaction Strategies: Size-Tiered, Leveled, and the Workload That Picks for You

January 24, 2026

Every LSM tree carries the same triangle of amplifications: write amp counts how many times a byte gets rewritten, space amp counts how much extra disk you burn holding old versions, and read amp counts how many files a single read has to probe. Compaction strategy is the dial that moves these three against each other. You cannot win on all three.

Size-tiered compaction is the original. When N similarly-sized SSTables accumulate at a level, they merge into one larger file at the next tier. Compaction is rare and large. Write amplification stays low because most bytes get rewritten only when a new tier forms. Space amplification climbs because, just before a major compaction, you can hold two full copies of the dataset on disk. Read amplification climbs too: each level can have multiple overlapping files, so a point read may probe many of them, leaning on bloom filters to skip the misses. Cassandra's default for years.

Leveled compaction flips the bet. Each level has a strict size limit, and within a level, key ranges do not overlap. A read consults at most one file per level. Reads are predictable, space amplification is bounded near 1.1x, but write amplification can hit 20x or higher because a single new key may force rewriting whole chains of files as it propagates down. RocksDB defaults here for a reason: mixed workloads want predictable reads.

Universal compaction, RocksDB's hybrid, behaves like size-tiered for most ranges but cleans up periodically to bound space amp. It is the right pick when you have bursty writes and cannot afford the worst case of either pure strategy.

The production failure I watched: a metrics store on Cassandra ran size-tiered across all tables. One tenant's time series exploded after a product launch. A major compaction kicked in on that table and started building a single 800 GB SSTable. Halfway through, the node's disk hit 95 percent because the old files and the new compaction output both lived simultaneously, the textbook 2x space amp peak. Writes paused, compaction failed to checkpoint, and the node went read-only. Restoring took a manual SSTable cleanup plus a major. The fix was per-table compaction strategy: leveled on the hot multi-tenant tables to cap space amp, size-tiered only on append-only audit tables. Disk capacity planning was reset to assume 2x peak headroom per node, not steady-state.

Compaction is not background noise. It is your storage engine's posture under load.

Key takeaway

Compaction policy is the dial that moves write amplification, space amplification, and read amplification against each other. Workload picks the policy: write-heavy logs lean size-tiered, mixed OLTP leans leveled.

Originally posted on LinkedIn. View original.