B-Trees vs LSM-Trees: How the Write Path Differs

February 11, 2026

B-trees and LSM-trees are the two storage structures behind almost every database you have ever used, and the cleanest way to understand them is to look at what happens on a single INSERT or UPDATE. The write path is where the two designs diverge, and almost every other property of the engine follows from that one choice.

A B-tree write finds the correct leaf page by walking down from the root, loads that page from disk if it is not in the buffer pool, mutates the row in place, and writes the page back. If the page is full, the engine splits it and may have to update the parent. The dominant cost is the disk I/O to fetch and rewrite that specific page. The location of the page is determined by the key, so consecutive writes to unrelated keys land in unrelated pages. That is random I/O. Postgres, MySQL InnoDB, SQLite, and most traditional relational engines work this way. The point is that the canonical layout on disk is always current and always sorted by key. There is no background work to make a read correct.

An LSM-tree write does none of that. It appends the write to a write-ahead log for durability, then inserts it into an in-memory structure called a memtable, usually a skip list or a balanced tree. The disk is not touched in the hot path beyond the sequential WAL append. When the memtable fills, the engine flushes it to disk as a new immutable sorted file, called an SSTable, written sequentially. New writes go into a fresh memtable. The on-disk state is now a stack of sorted runs, each one immutable. A background compaction process periodically merges overlapping SSTables into larger ones, which is how the engine reclaims space from overwritten keys and keeps the number of files bounded. RocksDB, LevelDB, Cassandra, HBase, and ScyllaDB all use this shape.

The practical consequence is that B-tree writes are roughly one random page write each, while LSM writes are sequential and batched. On a workload of small random writes, an LSM-tree can absorb an order of magnitude more throughput on the same disk. The cost is that reads now have to consult the memtable plus every overlapping SSTable, mitigated by bloom filters and compaction. B-trees pay on the write to keep reads simple. LSMs pay on the read to keep writes cheap. Pick the one whose tradeoff matches the workload.

Key takeaway

B-trees mutate the leaf page on disk and pay random I/O per write. LSM-trees buffer writes in memory, flush sorted runs, and compact in the background. Everything else follows from that choice.

Originally posted on LinkedIn. View original.