Why B-Trees and LSM Trees Both Exist: The Disk Physics Behind the Split

January 21, 2026


The reason two storage structures coexist is older than either of them. It is a fact about hardware. Storage devices do not care how much data you move. They care where you move it from and where you put it next. Once you accept that, B-trees and LSM trees stop looking like rivals and start looking like the two reasonable answers to the same question.

On a spinning disk, a random seek costs about 10 milliseconds. Sequential reads and writes run at 100 MB/s or more. The ratio between them is roughly a million to one. That ratio is what drove the original B-tree design. If random access is that expensive, you want to do as little of it as possible per operation. So you pay once during the write, in place, on a single page, and you keep the tree balanced so that every read is also a tiny number of seeks. The cost lives on the write side, but it is small and predictable, and the read side stays fast forever.

SSDs flattened the random seek penalty, but they introduced a new constraint. NAND flash cannot overwrite a page. To rewrite a 4KB page the controller has to erase a much larger block, sometimes a megabyte, and rewrite everything in it. Random in-place updates work, but they wear out the device and trigger garbage collection that competes with your workload. Sequential writes are still strongly preferred, just for different physical reasons.

That is where LSM trees became the better fit. An LSM does not write where the data belongs. It writes wherever the next sequential byte happens to be. The write-ahead log is one append, the memtable update is in RAM, and the eventual SSTable flush is a streaming write of a sorted chunk. Nothing seeks. Nothing erases a half-full block. The work of putting data in the right place is moved to compaction, which runs in the background and can be batched and rate-limited.

The cost shows up on the read side. To read a key you may have to check the memtable, then several SSTable levels, then maybe a bloom filter to skip levels that do not contain the key. Range scans suffer the most because you are merging sorted runs from different files instead of walking one tree.

So the choice is workload-shaped. Write-heavy ingestion paths like logs, metrics, and event streams pick LSM. Read-heavy transactional workloads with range queries pick B-trees. Neither is better. The disk forced the trade.

Key takeaway

B-trees and LSM trees are not competing structures. They are two different answers to the same hardware question: random IO is expensive, sequential IO is cheap, and a database has to pick which one its write path pays for.

Originally posted on LinkedIn. View original.


All Rights Reserved.