B-Tree vs LSM: Stop Comparing Features, Start Comparing Disk Shapes
January 21, 2026
Most B-tree vs LSM arguments waste energy comparing features. The real distinction is one design decision: in-place mutation versus append-only. Once you see it at the disk layer, every tradeoff falls out for free.
A B-tree stores data in fixed-size pages, typically 8 KB or 16 KB, arranged in a tree where leaves are linked for range scans. The unit of I/O is a page, not a row. An update locates the right leaf page, modifies it, fsyncs the change, and writes a WAL entry first for crash recovery. Writes are random because any key can land in any page. Reads are predictable: descend the tree, hit a leaf, return. Range scans walk the linked leaves sequentially. The shape is symmetric. You pay roughly the same I/O for a read and a write.
An LSM tree refuses random writes. Every insert appends to a WAL and updates an in-memory memtable. When the memtable fills, it flushes to disk as an immutable Sorted String Table. New SSTables stack up, and a background compaction process merges them later. Writes are sequential and cheap because nothing on disk is ever mutated. Reads pay the cost: a key might live in the memtable, in any of several L0 files, or in deeper levels, so the engine probes structures until it finds the answer. Bloom filters and sparse indexes mask this, but they do not eliminate it.
The tradeoff maps cleanly to workload. Write-heavy ingest patterns, things like logs, metrics, social feeds, event streams, fit LSM because sequential I/O scales with disk bandwidth. Read-heavy OLTP with frequent point updates fits B-tree because every read goes through a predictable path and the cache hit rate stays high.
The production failure I keep watching is the workload mismatch. A team picked Cassandra for an internal OLTP service because "we need scale," even though their actual load was 50:1 read-to-write with strict latency budgets. Reads required scatter-gather across replicas plus per-key bloom filter probes plus SSTable block fetches. P99 read latency landed at 5x what a single Postgres instance had delivered in the prototype. They lived with it for six months, debugging every spike, before reverting to a sharded Postgres setup. The lesson was not that Cassandra is slow. It was that an append-only disk shape costs you on reads, and OLTP latency budgets cannot absorb that cost.
You are not picking an algorithm. You are picking how your data lives on disk, and the disk shape decides everything else.
The right way to compare B-trees and LSM trees is at the disk layer: one mutates pages in place, the other appends immutable files and merges later. Every tradeoff falls out of that one choice.
Originally posted on LinkedIn. View original.