LSM Reads: Bloom Filters and Sparse Indexes Are What Make GETs Survivable

January 23, 2026

If an LSM tree did the naive thing on every GET, it would scan every SSTable on every level looking for one key. With eight levels and dozens of files per level, that is hundreds of disk reads to answer one lookup. Nobody would use it. The reason it works is that two small in-memory structures filter the work before any disk seek happens.

The first is the bloom filter, one per SSTable. It is a bit array seeded by a handful of hash functions, and it answers one question: is this key definitely not in this file? False negatives never happen. False positives happen at a tuneable rate, usually around 1% with ten bits per key. That ten bits per key is the budget you need to remember. For a billion keys per node, that is roughly 1.25GB of RAM, and that RAM is doing the work of keeping every irrelevant file out of the IO path.

The second is the sparse index, also held in memory per SSTable. Instead of indexing every key, it indexes one key per block, typically every 64th or every 128th. A binary search on the sparse index gives you a starting offset on disk. You read one block, scan a few hundred entries, and you are done. The index is small enough that you can afford to keep it resident, and small enough that the binary search never escapes L1 cache.

Read amplification is the product of levels traversed, bloom false positives, and the per-file binary search cost. With well-tuned blooms, the dominant term collapses to almost nothing, and a GET resolves in one or two block reads.

The production failure here is a team that decided bloom filters were optional. They were running RocksDB on a fleet where each node held 150GB of data, and the blooms consumed 1.5GB of RAM per node. Someone benchmarked memory pressure during a hot week and disabled blooms to claw that back. The dataset still fit, the writes still worked, and the staging suite passed. In production, every GET now walked the sparse index for every SSTable on every level. P99 reads went from 2ms to 80ms. The fix was to put the blooms back and treat that RAM as load-bearing, not as headroom. Rule of thumb: budget roughly 1% of your on-disk dataset as bloom RAM and never let it get traded away.

Bloom filters are not a tuning knob. They are part of the data structure.

Key takeaway

LSM reads are fast because they avoid work, not because disks are. Bloom filters skip whole files, sparse indexes skip whole blocks, and turning either of them off turns a 2ms GET into an 80ms GET.

Originally posted on LinkedIn. View original.