B+- Trees, LSM Trees, WAL, and MVCC

Topics Covered

B+-Trees: Traditional Balanced Indexing

Point lookup — finding one key

Range scan — finding many keys

Write amplification and page splits

Other index types worth knowing

When NOT to index

LSM Trees: Log-Structured Merge Trees for Write Optimization

The write path

The read path — and why it is slower

Compaction — the background maintenance

The three amplification factors

Which databases use which engine

Tuning LSM-tree performance in production

Write-Ahead Logging (WAL): Foundation of Durability

Why not write data pages directly?

Crash recovery — redo and undo

Checkpoints — bounding recovery time

WAL size management

WAL and fsync — the durability guarantee

MVCC: Multi-Version Concurrency Control

How it works — version chains

Snapshot isolation in practice

Garbage collection — vacuum

Write-write conflicts

MVCC and indexes

Practical MVCC advice

How B+-trees, LSM-trees, WAL, and MVCC connect

Summary: when to use what

Think of a library where the front desk does not store books — it stores signposts that tell you which room to go to. Inside each room are more signposts, until you reach the shelves that hold the actual books.

That is a B+-tree — the most important data structure in database internals, and the default index type for nearly every relational database.

  • Internal nodes = signposts (contain keys for navigation only, no actual data).
  • Leaf nodes = shelves (sorted keys + pointers to the actual rows on disk).
  • All leaves are at the same depth — the tree is always balanced, regardless of how many keys are inserted.
  • Leaves are linked left-to-right, so walking through consecutive keys is fast (range scans).

PostgreSQL's default "B-tree" index is exactly this. MySQL/InnoDB's primary key is a clustered B+-tree where the leaf nodes store the entire row, not just a pointer. The distinction matters: PostgreSQL's B+-tree leaves store pointers (TIDs) to heap rows, requiring a second lookup. InnoDB's clustered index stores the row data directly in the leaf, eliminating that second lookup for primary key queries.

B Plus Tree Lookup and Range Scan

Point lookup — finding one key

To find key 35 in a tree with root keys [17, 42]:

  1. Root: 35 is between 17 and 42, so follow the middle child pointer.
  2. Internal node has keys [23, 35]. 35 matches, follow the right child.
  3. Leaf node contains the key 35 with a pointer to the row on disk.

Total I/O: 3 page reads (root + internal + leaf). For a table with 100 million rows and a branching factor of 500, the tree is only 3-4 levels deep. Every point lookup costs 3-4 disk reads, regardless of table size. This is O(log_B N) where B is the branching factor — much wider and shallower than a binary tree. Compare this to a binary tree: log_2(100M) is about 27 levels. A B+-tree with branching factor 500: log_500(100M) is about 3.3 levels. The B+-tree is 8x shallower, meaning 8x fewer disk reads per lookup.

Buffer pool optimization. In practice, the root and top internal pages are almost always cached in the database's buffer pool (shared memory). A 100M-row table with a 4-level index has 1 root page, roughly 500 internal pages at level 2, and around 250,000 internal pages at level 3. The root and level-2 pages fit easily in memory (a few MB). This means most point lookups actually require only 1-2 disk reads — the leaf page and possibly one internal page — because the rest is served from cache.


Range scan — finding many keys

SELECT * FROM orders WHERE amount BETWEEN 100 AND 500

The tree navigates to the leaf containing key 100 (same 3-4 page reads), then follows the leaf-to-leaf linked list forward until it passes key 500. If 1,000 rows match and each leaf page holds 200 keys, this scan reads about 5 sequential leaf pages after the initial lookup. Sequential disk reads are orders of magnitude faster than random reads.

Key Insight

B+-trees are optimized for reads, not writes. Every INSERT must find the correct leaf, modify it, and potentially split the leaf and propagate the split up the tree. Under high write loads (100K+ inserts/sec), the random I/O from updating scattered leaf pages becomes the bottleneck. This is exactly the problem LSM-trees solve by converting random writes into sequential writes.


Write amplification and page splits

When a leaf is full and a new key arrives that belongs there, the database splits the leaf into two pages and inserts a new separator key into the parent. If the parent is also full, it splits too — cascading up to the root. Each split writes multiple pages to disk.

Write amplification = the ratio of bytes written to disk versus bytes of data inserted. B+-trees have moderate write amplification (typically 10-30x) because each insert modifies a leaf page, potentially triggers splits, and writes WAL entries. For read-heavy workloads (90%+ reads) this is fine. For write-heavy workloads, LSM-trees offer a better trade-off.

Partial indexes and covering indexes. Two important B+-tree optimizations:

  • Partial indexes (CREATE INDEX ON orders(customer_id) WHERE status = 'pending') index only rows matching a predicate. If 95% of orders are completed, the index is 20x smaller, uses less memory, and has lower write amplification because most inserts skip it entirely.
  • Covering indexes (CREATE INDEX ON orders(customer_id) INCLUDE (total, created_at)) store additional columns in the leaf nodes. A query that only needs customer_id, total, and created_at can be answered entirely from the index without touching the table — an "index-only scan." This eliminates the heap lookup that normally follows an index lookup.

Other index types worth knowing

B+-trees are the default, but not the only option:

  • Hash indexes — O(1) point lookups by hashing the key. Cannot do range scans (WHERE x > 5). PostgreSQL supports them but they are rarely used because B+-trees handle both point lookups (O(log N), but practically 1-2 disk reads) and range scans.
  • GIN (Generalized Inverted Index) — for full-text search and array containment. Maps each token to a list of row pointers. CREATE INDEX ON articles USING gin(to_tsvector('english', body)).
  • BRIN (Block Range Index) — for naturally ordered data like timestamps. Stores the min/max value for each physical block of rows. Extremely small index size (KB instead of GB) for append-only tables where rows are inserted in timestamp order. A query like WHERE created_at > '2025-01-01' skips entire blocks whose max timestamp is before that date.

When NOT to index

Not every column needs an index. Indexes are write-time costs that pay off at read-time. Before adding an index, check:

  • Selectivity — does the column have enough distinct values? A boolean column (is_active) with 50/50 distribution has low selectivity. A B+-tree index on it would return half the table, and the query planner would choose a full table scan instead. Indexes help most on columns with high selectivity (unique IDs, timestamps, email addresses).
  • Write volume — a table with 100K inserts/sec and 100 reads/sec is write-dominated. Each index adds write overhead. Only add indexes that serve critical read paths.
  • Table size — a table with 1,000 rows does not need indexes. The entire table fits in one or two memory pages. A sequential scan of 1,000 rows is faster than an index traversal (which has fixed overhead from traversing the tree structure).

How to find missing indexes. PostgreSQL's pg_stat_user_tables tracks sequential scans (seq_scan) and index scans (idx_scan) per table. A table with millions of rows and a high seq_scan count is a candidate for indexing. Use EXPLAIN ANALYZE to see the actual query plan — look for Seq Scan nodes with large row estimates. MySQL's slow query log identifies queries exceeding a latency threshold, and EXPLAIN shows whether they use indexes. The combination of slow queries + missing index usage is the most common and highest-impact performance optimization in production databases.

How to find unused indexes. PostgreSQL's pg_stat_user_indexes tracks how often each index is scanned (idx_scan). An index with zero scans over a month is likely unused — it costs write performance (every INSERT/UPDATE/DELETE must maintain it) without providing read benefit. Drop unused indexes to reduce write amplification. Before dropping, verify with pg_stat_user_indexes.idx_scan over a representative time period that covers all workload patterns (daily batches, monthly reports, yearly audits).

LSM-trees flip the B+-tree trade-off: writes are fast (sequential I/O), reads are slower (must check multiple levels). The key insight is that converting random writes into sequential writes gives you 10-100x better write throughput on both spinning disks and SSDs.

Why such a large gap? On a spinning disk, a random read requires a physical head seek (8-10ms) plus rotational latency (4ms), totaling roughly 12ms per operation — about 80 random IOPS. Sequential reads from the same disk can sustain 100MB/sec because the head does not move. Even on SSDs, where random reads are 0.1ms, sequential reads are still 5-10x faster due to internal parallelism and prefetching. LSM-trees exploit this asymmetry ruthlessly.

LSM Tree Write Path

The write path

  1. Write-ahead log (WAL) — every write is first appended to a log file on disk for durability. This is a sequential append — the fastest possible disk operation.
  2. Memtable — the write is inserted into an in-memory balanced tree (red-black tree or skip list), sorted by key.
  3. Flush — when the memtable reaches a size threshold (typically 64MB-256MB), it is written to disk as an immutable, sorted file called an SSTable (Sorted String Table). This is one sequential write.

No random I/O happens during writes. Every write is sequential — first to the WAL, then the in-memory sort, then a bulk flush. This is why Cassandra, RocksDB, LevelDB, and HBase all use LSM-trees.

Memtable details. The memtable is typically a skip list or red-black tree — both provide O(log N) insert and lookup. When the active memtable reaches its size threshold, it becomes immutable (no more writes accepted) and a new empty memtable takes its place. The immutable memtable is flushed to disk in the background while the new memtable accepts writes. This means writes are never blocked by flushes — there is always an active memtable ready to accept data.

SSTable format. Each SSTable on disk is an immutable, sorted file with several components: a data block (sorted key-value pairs), an index block (sparse key-to-offset mappings for binary search), a bloom filter (for quick existence checks), and metadata (compression, creation timestamp, key range). Because SSTables are immutable, they never need locking and can be read concurrently by multiple threads. Deletes are handled by writing a tombstone marker — a special record that says "this key has been deleted." The tombstone is cleaned up during compaction.


The read path — and why it is slower

To read a key, the system must check multiple places in order:

  1. Check the memtable (in memory — fast).
  2. Check each SSTable on disk, starting from the most recent.

In the worst case, the key does not exist, and the system checks every SSTable at every level before returning "not found." Two optimizations make this practical:

Bloom filters — a probabilistic data structure that tells you "definitely not in this SSTable" or "maybe in this SSTable." A bloom filter with a 1% false-positive rate eliminates 99% of unnecessary SSTable reads. Each SSTable has its own bloom filter loaded in memory.

Sparse index — each SSTable stores a sorted index of key ranges. The system binary-searches the sparse index to jump directly to the correct block within the SSTable, rather than scanning from the start.

Read amplification in numbers. An LSM-tree with 5 levels and 10 SSTables per level has 50 potential locations for a key. With bloom filters (1% false positive), a point read checks the memtable (free), then hits roughly 0.5 false-positive SSTables out of 50. Without bloom filters, every read checks all 50 SSTables — catastrophically slow. This is why Cassandra allocates significant memory to bloom filters (default: 10% of heap) and why monitoring bloom filter false-positive rates is critical for production LSM systems.


Compaction — the background maintenance

Over time, SSTables accumulate. Multiple SSTables may contain different versions of the same key (old value, updated value, deleted tombstone). Compaction merges SSTables, keeping only the latest version of each key and discarding tombstones.

Size-tiered compaction — when enough SSTables of similar size accumulate, merge them into one larger SSTable. Simple, good write throughput, but temporarily uses 2x disk space during the merge.

Leveled compaction — SSTables are organized into levels (L0, L1, L2...). Each level has a size limit (10x the previous). Compaction merges one SSTable from level N into the overlapping SSTables in level N+1. Better read performance (fewer SSTables per level), but higher write amplification (data gets rewritten as it moves through levels).

Space amplification. LSM-trees can use 2-10x more disk space than the logical data size because multiple versions of the same key exist across SSTables until compaction merges them. Size-tiered compaction temporarily doubles disk usage during a merge (old + new SSTables coexist until the old is deleted). Systems like RocksDB monitor compaction_pending_bytes and will throttle writes if disk space becomes critical.


The three amplification factors

Every storage engine lives on a triangle of three amplification factors. Improving one worsens at least one other:

  • Write amplification — bytes written to disk / bytes of data written by application. B+-trees: 10-30x (random page rewrites). LSM-trees with leveled compaction: 10-30x (data rewritten across levels). LSM with size-tiered: 2-4x (lower, but space amplification is worse).
  • Read amplification — disk reads per query. B+-trees: 1-4 reads (single tree traversal). LSM-trees: potentially dozens of reads (checking multiple SSTables), reduced by bloom filters.
  • Space amplification — disk space used / logical data size. B+-trees: 1.5-3x (page fragmentation, dead space from deletes). LSM-trees: 2-10x (multiple versions in different SSTables).

There is no storage engine that wins on all three. The choice depends on your workload: read-heavy favors B+-trees (low read amplification), write-heavy favors LSM-trees (low write amplification), space-constrained favors B+-trees (lower space amplification).


Which databases use which engine

DatabaseStorage EngineIndex TypeBest For
PostgreSQLHeap + B+-treeB+-treeOLTP, mixed read/write
MySQL/InnoDBClustered B+-treeB+-treeOLTP, primary key lookups
CassandraLSM-treeSSTableWrite-heavy, time-series
RocksDBLSM-treeSSTableEmbedded, write-heavy
MongoDB (WiredTiger)Both B+-tree and LSMB+-tree (default)Document workloads
TiDBLSM-tree (TiKV)SSTable + B+-treeDistributed SQL

Some databases let you choose. MongoDB's WiredTiger engine supports both B+-tree (default, better for reads) and LSM-tree (better for writes) via configuration. RocksDB is an embeddable LSM-tree engine used as the storage backend for many databases (CockroachDB, TiKV, Kafka Streams state stores).

Hybrid approaches. Some modern systems use both structures. WiredTiger uses a B+-tree for the primary index and an LSM-tree-like write buffer for fast ingestion. TiDB separates the SQL layer (which thinks in B+-tree terms) from the storage layer (TiKV, which uses RocksDB/LSM). The SQL layer generates sorted key ranges; the storage layer handles the write-optimized persistence. This layered architecture allows each component to use the data structure best suited to its role.


Tuning LSM-tree performance in production

If you operate Cassandra, RocksDB, or HBase, these are the knobs that matter most:

  • Memtable size (write_buffer_size in RocksDB, memtable_heap_space in Cassandra) — larger memtables mean fewer flushes (better write throughput) but more memory usage and longer recovery times (more WAL to replay on crash). Typical range: 64MB to 256MB.
  • Level 0 file count trigger (level0_file_num_compaction_trigger) — how many L0 SSTables accumulate before triggering compaction to L1. Higher values mean fewer compactions but worse read performance (more L0 files to check on every read). Lower values mean more frequent compaction but better read latency.
  • Bloom filter bits per key — more bits means lower false-positive rate but more memory. 10 bits/key gives roughly 1% false-positive rate. 15 bits gives 0.1%. For most workloads, 10 bits is sufficient.
  • Compaction concurrency (max_background_compactions) — how many compaction threads run simultaneously. More threads mean faster compaction but compete with query threads for CPU and disk I/O. Monitor compaction pending bytes: if it grows continuously, increase concurrency.
  • Block cache size — in-memory cache for SSTable data blocks. Unlike the memtable (which caches writes), the block cache accelerates reads by keeping frequently-accessed SSTable blocks in memory. Set this to the largest value your memory budget allows after accounting for memtables, bloom filters, and OS page cache.

Monitoring LSM health. Track these metrics in production:

  • Compaction pending bytes — how much data is waiting to be compacted. Continuously rising means compaction cannot keep up with writes.
  • Write stalls — RocksDB will throttle or stop writes entirely if too many L0 files accumulate. This manifests as sudden latency spikes. Monitor rocksdb.is-write-stopped and rocksdb.actual-delayed-write-rate.
  • SSTable count per level — helps identify which level is the bottleneck. A large L0 count means flushes are outpacing L0-to-L1 compaction.
  • Read latency percentiles — P99 read latency is the best indicator of whether compaction is keeping up. If P99 degrades while P50 stays flat, SSTables are accumulating and some reads are hitting many files.

Setting up alerting on these metrics before you need them is critical. A write stall on a production Cassandra cluster during peak traffic is not the time to learn how compaction tuning works. Establish baselines during normal operation and alert on deviations — compaction pending bytes growing for more than 5 minutes, P99 read latency exceeding 2x baseline, or write stall events occurring at all.

Interview Tip

In interviews, the classic comparison question is B+-tree vs. LSM-tree. The one-sentence answer: B+-trees optimize for reads (sorted on disk, one traversal), LSM-trees optimize for writes (sequential I/O, no random writes). Cassandra uses LSM-trees because it is write-heavy (append-only event data). PostgreSQL uses B+-trees because OLTP workloads are read-heavy. Knowing which data stores use which structure signals that you understand the trade-off, not just the theory.

Every database makes a promise: if it said "commit succeeded," the data survives a crash. WAL is how that promise is kept.

The rule is simple — write the change to a durable log before modifying the actual data pages. If the database crashes after the log write but before the data page write, the log has everything needed to reconstruct the change on restart. If the crash happens before the log write, no data pages were touched, so the state is consistent.

WAL Crash Recovery Flow

Why not write data pages directly?

Data pages are scattered across disk. Updating a B+-tree leaf, then its parent, then the free-space map involves random writes to different disk locations. If the database crashes between these writes, the data is in an inconsistent state — one page is updated, the others are not.

WAL solves this by making every change an append to a single sequential file. Sequential appends are the fastest disk operation and are atomic at the OS level (a single fsync). The actual data pages are updated lazily in the background. If a crash happens, the WAL provides a complete record of what happened and in what order.

The key principle is "log first, apply later." The WAL is the source of truth. Data pages are merely a cache of what the WAL describes. Even if every data page were corrupted, the database could reconstruct them entirely from the WAL (starting from a known-good backup). This is why WAL is sometimes called the "journal" — it is the authoritative record of all database activity.


Crash recovery — redo and undo

On startup after a crash, the database reads the WAL from the last checkpoint (a known-good point where all data pages matched the WAL):

  1. Redo phase — replay all committed transactions from the checkpoint forward. These transactions were committed (logged in WAL) but their data pages may not have been written to disk yet. Redo ensures the data pages reflect all committed work.
  2. Undo phase — roll back any in-progress transactions that never committed. These transactions wrote WAL entries (for rollback tracking) but the commit record is missing from the WAL, meaning the transaction was interrupted by the crash.

After both phases, the database is in a consistent state: all committed transactions are applied, all uncommitted transactions are reverted. This process typically takes seconds, even for databases with millions of rows.

Recovery time in numbers. A database with a 5-minute checkpoint interval needs to replay at most 5 minutes of WAL on crash recovery. At a typical WAL write rate of 10MB/sec, that is about 3GB of WAL to replay. Modern SSDs read at 2GB/sec, so recovery takes roughly 2 seconds of I/O plus CPU time for applying changes. Even under heavy write loads, WAL recovery is almost always under 30 seconds — far faster than alternatives like full table scans for consistency checks.


Checkpoints — bounding recovery time

Without checkpoints, crash recovery would replay the entire WAL from the beginning of time. Checkpoints flush all dirty (modified) data pages to disk and write a checkpoint record to the WAL. On recovery, the database only replays WAL entries after the last checkpoint.

The trade-off: Frequent checkpoints (every 30 seconds) mean faster recovery but more I/O during normal operation (flushing pages). Infrequent checkpoints (every 5 minutes) mean less I/O overhead but slower recovery. PostgreSQL's checkpoint_timeout defaults to 5 minutes. PostgreSQL also spreads the flush over the interval using checkpoint_completion_target (default 0.9), so 90% of the interval is spent gradually flushing pages rather than all at once.

WAL archiving and replication. WAL is not only for crash recovery. PostgreSQL ships WAL segments to followers for streaming replication — followers are essentially replaying the leader's WAL continuously. WAL archiving stores segments to external storage (S3, NFS) for point-in-time recovery: restore a base backup, then replay WAL up to any target timestamp. This is how production databases achieve both high availability (replication) and disaster recovery (PITR) from the same mechanism.


WAL size management

WAL files accumulate on disk between checkpoints. PostgreSQL manages this with three settings:

  • max_wal_size — the maximum total WAL size before triggering a checkpoint (default 1GB). When WAL reaches this limit, PostgreSQL forces a checkpoint even if checkpoint_timeout has not elapsed, flushing dirty pages and allowing old WAL segments to be recycled.
  • wal_keep_size — the minimum WAL to retain for replication. Even after a checkpoint, WAL segments needed by connected replicas are not deleted. If a replica falls behind and the required WAL has been recycled, the replica must be re-initialized from a full backup.
  • archive_command — a shell command executed for each completed WAL segment to copy it to external storage. This enables point-in-time recovery (PITR) by preserving the complete WAL history.

A write-heavy database producing 100MB of WAL per minute will generate 6GB per hour. Without archiving, only the last max_wal_size is available for recovery. With archiving, the full history is preserved externally, allowing recovery to any point in time at the cost of storage.


WAL and fsync — the durability guarantee

WAL only provides durability if it reaches stable storage before the database acknowledges the commit. This is the role of fsync — it forces the operating system to flush its file cache to the physical disk. Without fsync, the OS can buffer writes in memory and report them as "written" when they are actually still in volatile cache. A power failure loses the buffer.

PostgreSQL's synchronous_commit = on (default) calls fsync on every commit's WAL record. This guarantees durability but adds latency (1-2ms per commit on SSDs, 5-10ms on spinning disks). Setting synchronous_commit = off returns before the fsync completes — faster commits, but a crash in the fsync window (roughly 500ms) can lose recent commits. This is an acceptable trade-off for some workloads (session data, counters) but not for financial transactions.

The interaction between WAL, fsync, and disk hardware is the foundation of database durability. Battery-backed write caches on storage controllers can reduce fsync latency by acknowledging writes as soon as they reach the battery-protected cache, before the physical disk platters or NAND cells are written. This is why enterprise databases run on hardware with write caches — it eliminates the durability-vs-performance trade-off at the hardware level.

Cloud storage and durability. On AWS, EBS volumes provide built-in replication across availability zones, which means even without fsync, data on EBS is unlikely to be lost due to single-disk failure. However, the OS page cache is still volatile — a crash can lose unflushed writes. Cloud databases still use WAL and fsync to protect against process crashes, OS crashes, and the (rare) case of EBS volume failure. The cloud does not eliminate WAL — it just changes which failure modes you worry about.

Common Pitfall

A common misconception is that WAL is only about crash recovery. WAL is also the mechanism behind replication (followers stream and replay the leader's WAL), point-in-time recovery (restore to any moment by replaying WAL up to a target timestamp), and change data capture (CDC tools like Debezium read the WAL to stream changes to downstream systems). The WAL is the single source of truth for everything that happened to the database.

How does a database let readers and writers operate on the same row simultaneously without blocking each other?

The answer is MVCC — instead of locking a row so only one transaction can access it, the database keeps multiple versions of each row. Readers see a consistent snapshot of the data as it existed when their transaction started. Writers create new versions without disturbing the snapshot that readers are using. The database manages the version chain and garbage collection automatically.

The result: readers never block writers, and writers never block readers. This is the foundation of PostgreSQL, MySQL/InnoDB, Oracle, and every modern OLTP database.

Why not just use locks? Traditional locking (two-phase locking) makes readers wait for writers and writers wait for readers. A long-running analytics query that reads 10 million rows would block every write to those rows for the entire query duration — potentially minutes. MVCC eliminates this contention by giving the analytics query a snapshot that is unaffected by concurrent writes. The writes proceed freely, creating new versions that the analytics query cannot see.


How it works — version chains

When a row is updated, the database does not overwrite the existing version. Instead, it creates a new version with the updated values and links it to the old version. Each version is tagged with the transaction ID (xid) that created it. The old version remains readable by any transaction whose snapshot predates the new version.

In PostgreSQL, the version chain is stored directly in the table heap (old and new versions are adjacent rows in the same file). In MySQL/InnoDB, the latest version is in the primary key B+-tree and old versions live in a separate undo log. The PostgreSQL approach is simpler but causes table bloat. The MySQL approach avoids bloat but makes reading old versions slower (requires traversing undo log pointers).

 
1Row: user_id=42
2  Version 3 (xid=105): name="Alice", created by UPDATE
3  Version 2 (xid=98):  name="Alicia", created by UPDATE
4  Version 1 (xid=50):  name="alice", created by INSERT

When Transaction 100 reads user_id=42, it sees Version 2 (the latest version created by a transaction that committed before transaction 100 started). Transaction 105's update is invisible to Transaction 100, even though it may have already been written to disk.


Snapshot isolation in practice

Each transaction gets a snapshot — a consistent view of the database at a point in time. The snapshot is defined by a list of which transactions were committed at the moment the snapshot was taken.

Visibility rule: a version is visible to transaction T if:

  1. The version was created by a transaction that committed before T's snapshot, AND
  2. The version was not deleted by a transaction that committed before T's snapshot.

This means a long-running analytics query (transaction 100) sees a frozen view of the data, even as hundreds of writes happen around it. No locks. No blocking. The analytics query reads old versions while writers create new ones.


Garbage collection — vacuum

Old versions that are no longer visible to any active transaction are dead weight. Vacuum (PostgreSQL) or purge (MySQL) reclaims this space.

PostgreSQL's MVCC stores old versions in the same table file (heap). Without regular vacuuming, the table grows unbounded — a table with 1 million live rows might occupy disk space for 10 million row versions. This is table bloat. Autovacuum runs in the background, but if write-heavy workloads create versions faster than autovacuum can clean them, bloat accumulates.

MySQL/InnoDB stores old versions in a separate undo log. The purge thread cleans the undo log continuously. This avoids table bloat but adds complexity to version chain traversal (the engine must follow pointers from the data page into the undo log to find older versions).

Long-running transactions are the enemy of MVCC. A transaction that stays open for hours prevents the database from reclaiming any version created after that transaction started — those versions might still be needed by the old snapshot. In PostgreSQL, this manifests as table bloat. In MySQL, it manifests as a growing undo log that cannot be purged. Both cause performance degradation. Connection pool idle_in_transaction_session_timeout settings exist specifically to kill forgotten long-running transactions before they cause damage.


Write-write conflicts

MVCC does not eliminate all conflicts. If two concurrent transactions update the same row, the second transaction detects a conflict (the row was modified by a committed transaction after its snapshot). Under snapshot isolation, this typically causes the second transaction to abort with a serialization error. The application must retry the transaction.

This is fundamentally different from locking, where the second transaction would wait for the first to finish. MVCC's approach is optimistic — assume conflicts are rare, and abort only when they happen. For most OLTP workloads with low contention, this means far less time spent waiting.

Isolation levels and MVCC. Different isolation levels change what the snapshot sees:

  • Read Committed (PostgreSQL default) — each statement gets a fresh snapshot. A transaction sees commits made by other transactions between its own statements. Prevents dirty reads.
  • Repeatable Read (MySQL default) — the transaction gets one snapshot at the start. All statements see the same data, even if other transactions commit in between. Prevents dirty reads and non-repeatable reads.
  • Serializable — the strictest level. MVCC plus additional checks (serializable snapshot isolation in PostgreSQL) detect and abort transactions that would have produced anomalies under true serial execution. Highest safety, highest abort rate.

For most applications, Read Committed or Repeatable Read provides sufficient isolation. Serializable is reserved for workloads where even phantom reads (new rows appearing mid-transaction) are unacceptable, such as financial reporting.


MVCC and indexes

MVCC affects index maintenance too. In PostgreSQL, when a row is updated, the new version gets a new physical location in the heap. Every index pointing to the old version must also create a new entry pointing to the new version — even if the indexed columns did not change. This is why updates in PostgreSQL are expensive on heavily-indexed tables.

PostgreSQL mitigates this with HOT updates (Heap-Only Tuples): if the new row version fits on the same heap page and no indexed column changed, the database skips creating new index entries. The old index entries point to the old version, which contains a forwarding pointer to the new version on the same page. HOT updates significantly reduce the write amplification of updates on tables with many indexes.

Monitor HOT update rates via pg_stat_user_tables.n_tup_hot_upd versus n_tup_upd. A low HOT ratio means updates are creating new index entries on every write — either because indexed columns are changing or because heap pages are too full to hold a new version (tune fillfactor to leave room for in-place updates).

MySQL/InnoDB avoids this problem entirely because the clustered index stores the row data. An update modifies the row in-place in the clustered index and writes the old version to the undo log. Secondary indexes still point to the primary key (not a physical address), so they remain valid regardless of row movement — no secondary index update is needed unless a secondary-indexed column changed.


Practical MVCC advice

Monitor for long-running transactions. Set idle_in_transaction_session_timeout in PostgreSQL (default: off, set to 5 minutes for production). In MySQL, monitor information_schema.innodb_trx for transactions running longer than expected. A single forgotten BEGIN in a psql session can hold back vacuum for hours and cause gigabytes of table bloat.

Understand your isolation level. Most application bugs related to MVCC come from misunderstanding the default isolation level. PostgreSQL defaults to Read Committed — each statement within a transaction sees the latest committed data. If you run SELECT balance followed by UPDATE balance = balance - 100 in the same transaction, the UPDATE might see a different balance than the SELECT if another transaction committed in between. Use SELECT ... FOR UPDATE to lock the row for the duration of the transaction when you need read-then-write consistency.

Retry on serialization errors. If you use Repeatable Read or Serializable isolation, your application MUST handle serialization failures (SQLSTATE 40001 in PostgreSQL). These are not bugs — they are the expected cost of optimistic concurrency. Wrap transaction logic in a retry loop with a limited number of attempts. Most serialization failures succeed on the first retry because the conflicting transaction has already committed.


How B+-trees, LSM-trees, WAL, and MVCC connect

These four mechanisms are not independent — they work together in every database:

  1. A write arrives. The database writes a WAL entry first (durability guarantee).
  2. The write is applied to the in-memory data structure — a B+-tree buffer page (PostgreSQL) or an LSM memtable (Cassandra).
  3. MVCC creates a new version of the affected row, tagged with the transaction ID. Old versions remain visible to concurrent readers.
  4. Eventually, the in-memory changes are flushed to disk — as B+-tree page writes (checkpoint) or SSTable files (memtable flush).
  5. Vacuum/purge cleans up old MVCC versions that no active transaction can see.

Understanding these mechanisms individually is important, but understanding how they compose is what separates a junior engineer ("I know what a B+-tree is") from a senior one ("I know why PostgreSQL's checkpoint causes latency spikes and how to tune it").


Summary: when to use what

You need...Use...Why
Fast reads, moderate writesB+-tree (PostgreSQL, MySQL)O(log_B N) point lookups, sequential range scans
Fast writes, tolerable read latencyLSM-tree (Cassandra, RocksDB)All writes are sequential, no random I/O
Crash recovery for any storage engineWALLog-first guarantees atomicity and durability
Concurrent readers and writersMVCCReaders see snapshots, writers create versions, no blocking
Full-text searchGIN indexInverted index maps tokens to documents
Time-series on append-only tablesBRIN indexMin/max per block, tiny index footprint
Mixed workload with both reads and writesB+-tree with tuned buffer poolCache hot pages in memory, accept moderate write amplification
Write-heavy with periodic batch readsLSM-tree with leveled compactionBalance compaction cost against read latency

In a system design interview, state which storage engine you would choose and why. "We need high write throughput for event ingestion, so I would use Cassandra (LSM-tree) for the event store. For the user profile service, PostgreSQL (B+-tree) because profiles are read-heavy and we need ACID transactions for billing data." This demonstrates that you understand the trade-offs and can match the storage engine to the workload.

The deeper insight is that most production systems use multiple storage engines for different access patterns — a pattern called polyglot persistence. An e-commerce platform might use PostgreSQL for orders (ACID, B+-tree), Cassandra for product activity streams (write-heavy, LSM-tree), Redis for session data (in-memory, no persistence needed), and Elasticsearch for product search (inverted index). Each storage engine excels at exactly one access pattern, and the system architect's job is to route each workload to the right engine.

The cost of polyglot persistence is operational complexity — each storage engine has its own backup strategy, monitoring dashboards, scaling procedures, and failure modes. A team running PostgreSQL, Cassandra, Redis, and Elasticsearch needs expertise in four different systems. This is why startups often start with "PostgreSQL for everything" and only introduce additional storage engines when a specific workload proves that PostgreSQL cannot meet its requirements. Premature polyglot persistence is as dangerous as premature optimization.

The rule of thumb: start with one storage engine that covers 80% of your access patterns (usually PostgreSQL or MySQL). When a specific workload hits a performance wall that cannot be solved with indexing, caching, or query optimization, add a purpose-built storage engine for that workload. Each new engine must justify its operational cost with measurable performance improvement. Document the decision (why this engine, what problem it solves, what workload it handles) in your architecture decision records so future engineers understand the rationale.

In interviews, demonstrating awareness of these internal mechanisms (B+-tree depth calculations, LSM write amplification, WAL fsync behavior, MVCC snapshot visibility) signals deep understanding. You do not need to recite implementation details, but knowing why PostgreSQL uses B+-trees (read-optimized) while Cassandra uses LSM-trees (write-optimized) — and being able to explain the trade-off in terms of I/O patterns — is the level of understanding that distinguishes strong system design candidates.

The strongest signal you can give is connecting the storage engine choice to the system design: "For the notification timeline, I would use Cassandra because timelines are append-heavy (write-optimized, LSM-tree) and reads are sequential (sorted by timestamp within each user's partition). For the user account service, I would use PostgreSQL because account data is read-heavy, requires ACID transactions for balance updates, and benefits from B+-tree indexes on email and phone number for lookup queries."

This connects the abstract knowledge (B+-tree vs. LSM-tree trade-offs) to concrete design decisions (which database for which service), which is exactly what interviewers want to hear. The goal is not to recite textbook definitions but to demonstrate that you can apply storage engine knowledge to real system design problems, choose the right tool for each workload, and explain the trade-offs you are accepting.