LSM Trees: Why Writes Are Almost Free and Reads Pay the Bill

February 1, 2026


The trick behind databases that swallow millions of writes per second is that they refuse to do random I/O on the write path. An LSM tree, the engine inside Cassandra, RocksDB, ScyllaDB, and LevelDB, lands every write in two places: a sequential append to the Write-Ahead Log on disk, and an insert into the memtable, an in-memory sorted structure usually built as a skip list or balanced tree. The WAL gives you durability if the process dies. The memtable gives you a sorted view to flush later. Both operations are effectively O(1) and require no random disk seeks.

When the memtable fills, it flushes to disk as an immutable Sorted String Table, an SSTable. New SSTables land at Level 0. They can have overlapping key ranges because each flush is independent. As L0 fills, a compaction process merges overlapping files into Level 1, then L1 into L2, and so on. Each level is roughly 10x larger than the one above it and, in leveled compaction, has non-overlapping key ranges.

Reads pay for the asymmetry. To find a key, the engine checks the memtable first, then walks L0 file by file because L0 can overlap, then descends through L1, L2, L3 picking the one file per level whose range contains the key. Two structures keep this fast. A bloom filter per SSTable lets the read skip files that definitely do not contain the key. A sparse index per SSTable lets it jump to the right block within a file. Cost is roughly O(log n) where n grows with the number of levels.

The production failure I keep watching shows up under heavy ingest. A team ran Cassandra under a write-heavy workload using the default size-tiered compaction. L0 SSTables piled up faster than compaction threads could drain them. The L0 to L1 promotion stalled because L0 had more files than the strategy's overlap budget allowed, which made the situation feed itself: more flushes, more L0 files, more stalling. P99 reads exploded because every lookup had to consult 40-plus L0 files, each with its own bloom filter check and block fetch. The fix had two parts: raise the compaction thread count to match the disk's parallelism, and switch the table to Leveled compaction so L0 stayed shallow. Read p99 dropped 30x within a day.

An LSM tree is a contract. You pay write cost in deferred work, and the bill comes due as compaction pressure under load.

Key takeaway

An LSM tree turns random writes into sequential appends by buffering in memory and flushing sorted files to disk. Reads pay for that win by climbing levels, and compaction quietly does the cleanup that keeps both sides honest.

Originally posted on LinkedIn. View original.


All Rights Reserved.