System Design Fundamentals
Networking & APIs
Partitioning, Replication & Consistency
Caching & Edge
Messaging & Streaming
Reliability & Operability
Security & Privacy
Transaction Isolation Levels and Anomalies
PostgreSQL achieves concurrency through MVCC (Multi-Version Concurrency Control) — readers never block writers, and writers never block readers. Instead of locking rows for reads, PostgreSQL keeps multiple versions of each row and lets each transaction see a consistent snapshot.

How MVCC Works
Every row carries two hidden fields: xmin (the transaction ID that created this version) and xmax (the transaction ID that deleted or replaced it). When you UPDATE a row, PostgreSQL does not overwrite it in-place. It marks the old version with xmax = current_txid and inserts a brand new version with xmin = current_txid. Readers check these fields against their snapshot to decide which version is visible.
Consider a concrete example:
Both versions coexist on disk until vacuum removes the old one.
This means a long-running report query sees a frozen-in-time snapshot of the database while concurrent transactions continue to insert, update, and delete rows without interference. The trade-off is dead rows — old versions that no transaction needs anymore. PostgreSQL's autovacuum process reclaims these periodically, but long-running transactions prevent cleanup and cause table bloat.
Autovacuum and Dead Tuple Management
Autovacuum runs in the background, scanning tables for dead tuples that are no longer visible to any active snapshot. It marks their space as reusable but does not shrink the table file. Over time, heavy UPDATE/DELETE workloads without adequate vacuuming lead to table bloat — the table file grows even though the live data has not.
If n_dead_tup climbs steadily, autovacuum is not keeping up and you need to tune its frequency or worker count.
Isolation Levels
PostgreSQL supports three isolation levels (it treats Read Uncommitted as Read Committed):
Read Committed (default) — Each statement sees a fresh snapshot. Two SELECTs within the same transaction may return different results if another transaction commits between them. This prevents dirty reads but allows non-repeatable reads and phantoms.
Repeatable Read — The snapshot is taken at the first statement and held for the entire transaction. All reads see the same data regardless of concurrent commits. If a concurrent transaction modifies a row you are about to update, PostgreSQL aborts your transaction with a serialization error rather than silently applying a lost update.
Serializable — Uses Serializable Snapshot Isolation (SSI) to detect conflicts that would violate serial execution order. PostgreSQL tracks read and write dependencies across transactions and aborts one side when it detects a dangerous cycle. SSI works by maintaining two types of locks in memory: SIRead locks (tracking what each transaction has read) and write locks. When PostgreSQL detects a "dangerous structure" — a cycle where T1 reads data T2 wrote, and T2 reads data T1 wrote — it aborts one transaction. Applications must be prepared to retry aborted transactions.
Write-Ahead Log (WAL)
All changes hit the WAL before reaching data pages. The WAL is an append-only sequential log on disk — far faster than random writes to heap pages. On crash, PostgreSQL replays the WAL to restore committed data. This is why commits are fast (one sequential write) even when the actual table pages have not been flushed yet.
The WAL also serves as the foundation for replication. Streaming replication sends WAL records to standby servers, which replay them to stay in sync. This means the same mechanism that provides crash recovery also provides high availability — a standby can take over with minimal data loss because it has been continuously replaying WAL records.
WAL segment files accumulate on disk until they are archived or recycled. The max_wal_size parameter controls how much WAL accumulates before a checkpoint is triggered. A checkpoint forces all dirty buffers to disk, allowing old WAL segments to be recycled. The tension is clear: frequent checkpoints keep WAL size small but cause I/O spikes. Infrequent checkpoints reduce I/O spikes but increase recovery time after a crash (more WAL to replay).
Practical Implications
Choosing the right isolation level in PostgreSQL depends on your workload pattern:
- OLTP with independent short transactions: Read Committed is sufficient. Each statement sees the latest committed data, and short transactions minimize conflict risk.
- Financial or inventory systems: Repeatable Read prevents lost updates. Two concurrent transactions cannot both read a balance and write a new value — one will be aborted.
- Complex business rules with cross-table invariants: Serializable ensures that concurrent execution produces the same result as some serial ordering. The cost is higher abort rates under contention.
MVCC is fundamentally a space-time trade-off. You trade disk space (multiple row versions, dead tuples, vacuum overhead) for time (zero read locks, no reader-writer blocking). This trade-off explains why long-running transactions are dangerous in PostgreSQL — they pin old row versions alive and prevent vacuum from reclaiming space.
InnoDB's default isolation level is Repeatable Read — one step higher than PostgreSQL's Read Committed default. To achieve this without the trade-offs of full Serializable mode, InnoDB combines MVCC snapshots for reads with next-key locks for writes. This hybrid approach prevents phantoms without the overhead of serializable dependency tracking. Understanding InnoDB's concurrency model matters because MySQL is the most widely deployed relational database — chances are high you will encounter it in production.
MVCC and Undo Logs
Like PostgreSQL, InnoDB maintains multiple row versions. But instead of storing old versions inline in the heap, InnoDB keeps them in a separate undo log. Each row in the clustered index contains a pointer to its undo chain. When a reader needs an older version, it follows the chain backward until it finds the version visible to its snapshot.
The snapshot is taken at the first consistent read (SELECT) in the transaction. All subsequent reads within the same transaction see the same snapshot. This is what makes Repeatable Read work — your second SELECT returns the same rows as the first, even if another transaction has committed changes in between.
Next-Key Locks (Gap Locks)
MVCC handles reads, but what about writes? Consider this scenario: Transaction A runs SELECT * FROM orders WHERE price BETWEEN 10 AND 20 FOR UPDATE. Without gap locks, Transaction B could insert a row with price=15 and commit. When A re-reads, a phantom row appears.
InnoDB prevents this with next-key locks — a combination of the row lock and a lock on the gap before the row:
The gap lock range depends on the index structure, not the query range. If the index has entries at 8, 12, and 22, a range query for 10-20 locks the gaps (8,12], (12,22], which is broader than the query range. This is why gap locks can cause unexpected blocking — a transaction inserting price=21 would also be blocked because it falls within the (12,22] gap.
Gap locks only exist at Repeatable Read and above. Switching to Read Committed disables gap locks, which eliminates phantom prevention but also eliminates the unexpected blocking. This is a common tuning trade-off for high-concurrency systems where phantom reads are acceptable.

Clustered Index Architecture
InnoDB stores rows in primary key order within a B+Tree (the clustered index). Secondary indexes store (secondary_key, primary_key) pairs. A secondary index lookup requires two B+Tree traversals: first the secondary index to find the PK, then the clustered index to find the actual row (a "bookmark lookup"):
This is why covering indexes (indexes that include all queried columns) are particularly valuable in InnoDB — they avoid the second lookup entirely.
Write Path
Updates follow this path: (1) copy old version to undo log, (2) modify row in buffer pool, (3) append redo log entry, (4) on COMMIT, fsync the redo log. Dirty pages are flushed to disk later by a background thread. The redo log is InnoDB's equivalent of PostgreSQL's WAL — it guarantees durability via sequential writes without waiting for random page flushes.
The undo log serves double duty: it provides old versions for MVCC readers AND enables rollback. If a transaction aborts, InnoDB follows the undo chain to reverse each modification. The purge thread periodically removes undo entries that no active transaction needs. If the purge thread falls behind (due to long-running transactions holding references to old versions), the undo tablespace grows — InnoDB's equivalent of PostgreSQL's table bloat problem.
Buffer Pool
InnoDB's buffer pool is a large in-memory cache of data pages and index pages. Most reads are served from the buffer pool without touching disk. The buffer pool uses an LRU variant (young/old sublists) to balance recently accessed pages against frequently accessed pages. When the buffer pool is properly sized (typically 70-80% of available RAM on a dedicated database server), the vast majority of reads hit memory. Writes also modify pages in the buffer pool first, then flush to disk asynchronously — this is why InnoDB's write performance depends heavily on buffer pool size.
Monitoring Lock Contention
When gap locks cause unexpected blocking or deadlocks, InnoDB provides diagnostic tools:
High innodb_row_lock_waits in SHOW ENGINE INNODB STATUS indicates contention. If most waits are gap locks, consider whether Read Committed (which disables gap locks) is acceptable for that workload.
In interviews, the key difference between PostgreSQL and InnoDB concurrency is how they prevent phantoms. PostgreSQL at Repeatable Read uses SSI to detect conflicts after the fact and abort. InnoDB uses gap locks to prevent conflicts from happening in the first place. PostgreSQL retries; InnoDB blocks.
MongoDB's concurrency model starts from a fundamentally different assumption: single-document operations are atomic by default, but cross-document consistency is opt-in. This reflects MongoDB's design philosophy of prioritizing write throughput and horizontal scalability over strict multi-document ACID guarantees.
Document-Level Atomicity
Every single-document operation (insert, update, delete) is atomic without any explicit transaction. If you update three fields in one document, either all three change or none do. But if you need to update a document in the orders collection AND a document in the inventory collection as one atomic unit, you need an explicit multi-document transaction (available since MongoDB 4.0 for replica sets, 4.2 for sharded clusters).
This single-document atomicity is more powerful than it first appears. Because MongoDB documents can contain nested subdocuments and arrays, you can embed an entire order with its line items in one document:
Updating the order status, adding a timeline entry, and incrementing the shipping attempt counter are all one atomic operation. The schema design determines whether you need transactions.
WiredTiger MVCC
MongoDB's WiredTiger storage engine uses MVCC with document-level locking. Reads use snapshot timestamps to determine document visibility. Writes hold brief document-level locks (not collection or database locks) and create new versions linked via an undo chain. This means readers and writers on different documents never block each other, and readers on the same document see the version consistent with their snapshot.
WiredTiger maintains an in-memory cache (default 50% of RAM minus 1GB) for frequently accessed documents. Documents are stored in a compressed format on disk and decompressed in cache. The cache uses a combination of eviction policies to manage memory pressure, and checkpoint operations periodically write dirty pages to disk for durability.
Read Concern and Write Concern
Instead of traditional isolation levels, MongoDB uses tunable consistency via readConcern and writeConcern.
readConcern "local" (default) — Returns the most recent data on the node, even if it might be rolled back during a replica set failover. Fast but not durable.
readConcern "majority" — Returns only data acknowledged by a majority of replica set members. Slower but guaranteed to survive failover.
readConcern "snapshot" — Available inside multi-document transactions. Provides snapshot isolation across multiple documents, similar to PostgreSQL's Repeatable Read.
writeConcern controls durability:
w: 1 means acknowledged by the primary. w: "majority" means acknowledged by a majority of replicas. j: true means written to the journal on disk.
The combination of readConcern and writeConcern gives fine-grained control. For example, readConcern: "majority" with writeConcern: { w: "majority" } provides causal consistency — you are guaranteed to read your own writes even if you are routed to a different replica set member. This is conceptually similar to Cassandra's QUORUM + QUORUM overlap guarantee.
Multi-Document Transactions
Multi-document transactions have a 60-second default time limit and incur performance overhead from snapshot management and conflict detection:
If two transactions modify the same document, one is aborted with a write conflict error. The guidance from MongoDB is to design schemas to minimize the need for multi-document transactions — embedding related data in a single document avoids them entirely.
MongoDB multi-document transactions are not free. They have a 60-second time limit, consume more WiredTiger cache for snapshot tracking, and abort on write-write conflicts. If you find yourself using transactions frequently, it often signals a schema design issue — consider embedding related data in a single document instead.
Cassandra throws out the entire isolation level framework. There are no transactions in the traditional sense, no MVCC snapshots, and no row locks. Instead, Cassandra provides partition-scoped atomicity with tunable consistency — a fundamentally different model designed for write-heavy workloads across multiple data centers.
Partition-Scoped Atomicity
A write to a single partition (all rows sharing the same partition key) is atomic. But there is no cross-partition atomicity. If you need to update data in two different partitions as one unit, you cannot — one may succeed while the other fails. This constraint drives Cassandra's data modeling philosophy: design one table per query pattern and denormalize aggressively so each query touches exactly one partition.
Tunable Consistency
Cassandra lets you choose consistency per query via consistency levels (CL):
With a replication factor of 3:
CL ONE — Write acknowledged by one replica, read from one replica. Fastest but may read stale data.
CL QUORUM — Write acknowledged by 2 of 3 replicas, read from 2 of 3. If write CL + read CL > replication factor, you are guaranteed to read your own writes (quorum intersection).
CL ALL — All replicas must respond. Strongest consistency but one unavailable replica blocks the entire operation.
The critical insight is that QUORUM + QUORUM gives you linearizable reads for a single partition without any locking mechanism. The overlap between write and read replicas guarantees that at least one replica has the latest version.
Lightweight Transactions (LWT)
For conditional updates (compare-and-swap), Cassandra offers Lightweight Transactions using Paxos consensus:
This provides serial isolation for one partition but costs 4 round-trips (Paxos prepare, promise, propose, accept) instead of 1. Use LWT sparingly — it exists for correctness, not performance.
LWT is useful for exactly one pattern: preventing duplicate writes where idempotency matters. User registration (IF NOT EXISTS), inventory reservation (IF quantity >= requested), and unique constraint enforcement are the canonical use cases. For everything else, the 4x latency penalty is not worth it.
LSM-Tree Write Path
Writes in Cassandra are extremely fast because they are always sequential. The write path: (1) append to the commit log on disk (durability), (2) write to the memtable in memory. When the memtable is full, it flushes to an immutable SSTable on disk. Background compaction merges SSTables and discards tombstones (deleted data markers). This is why Cassandra excels at write-heavy workloads — every write is a sequential append, never a random disk seek.
Read Path
Reads are more expensive. Cassandra must check multiple locations for data:
- Memtable — the in-memory write buffer (always checked first)
- Row cache — optional in-memory cache of frequently read partitions
- Bloom filter — per-SSTable probabilistic filter (eliminates SSTables that definitely do not contain the key)
- Partition index — on-disk index within each remaining SSTable
- Data blocks — actual data from the SSTable
Cassandra merges results from all sources by timestamp (last-write-wins). This is why read performance degrades with many SSTables — compaction is essential to keep read latency low.
Anti-Entropy and Repair
Since Cassandra uses eventual consistency, replicas can drift apart. Read repair corrects inconsistencies during normal reads — when a coordinator detects that replicas returned different versions, it sends the latest version to the stale replicas. For data that is rarely read, periodic anti-entropy repair (using Merkle trees to compare data across replicas) ensures consistency. Without regular repairs, stale data can persist indefinitely on replicas that missed a write.
The repair process works in three steps: (1) each replica builds a Merkle tree (hash tree) of its data for a given token range, (2) the coordinator compares trees to find partitions where hashes differ, (3) stale replicas stream the correct data from up-to-date replicas. Running repair on a regular schedule (typically weekly) is essential for correctness in production Cassandra clusters.

Default Isolation and Concurrency Model
| Database | Default Isolation | Concurrency Mechanism | Phantom Prevention |
| PostgreSQL | Read Committed | MVCC (inline versions) | SSI at Serializable |
| MySQL InnoDB | Repeatable Read | MVCC + next-key locks | Gap locks at RR |
| MongoDB | Document-level atomic | MVCC (WiredTiger) | Snapshot readConcern in txns |
| Cassandra | Partition-scoped | Last-write-wins | N/A (no cross-partition reads) |
Write and Recovery Mechanisms
| Database | Write Durability | Recovery Mechanism | Old Version Storage |
| PostgreSQL | WAL (append-only log) | WAL replay on crash | Inline (heap) + vacuum |
| MySQL InnoDB | Redo log (append-only) | Redo replay + undo rollback | Undo log (separate tablespace) |
| MongoDB | Journal (WiredTiger) | Journal replay | In-cache undo chain |
| Cassandra | Commit log + memtable | Commit log replay | N/A (last-write-wins) |
Consistency vs Scale Spectrum
PostgreSQL and InnoDB sit on the consistency-first end. Both provide full ACID with rich query capabilities (joins, subqueries, constraints). They scale vertically well and can scale reads horizontally via replicas, but write scaling requires sharding strategies that add complexity.
MongoDB occupies the middle ground. Single-document operations are fast and atomic. Multi-document ACID is available but opt-in and bounded (60-second limit). Horizontal scaling via sharding is built into the platform from the start.
Cassandra sits on the scale-first end. It sacrifices relational features entirely for linear write scalability across data centers. Tunable consistency lets you dial between speed and correctness per query, but you accept eventual consistency as the baseline.
Dead Tuple / Version Management
Every MVCC database must clean up old versions eventually. PostgreSQL uses autovacuum on the heap. InnoDB uses a purge thread on the undo tablespace. MongoDB relies on WiredTiger cache eviction. All three suffer when long-running transactions hold references to old versions — preventing cleanup and increasing memory or disk usage. Cassandra sidesteps this entirely with last-write-wins and compaction of SSTables, but loses the ability to provide snapshot reads.
When to Choose What
PostgreSQL/InnoDB — Correctness is non-negotiable. Financial transactions, inventory management, user authentication. You need joins, constraints, and transactions that span multiple tables.
MongoDB — Your data model is document-shaped (nested, variable-schema). Most operations touch a single document. You need horizontal scalability and can tolerate eventual consistency for non-critical reads with opt-in ACID for critical flows.
Cassandra — Write throughput dominates. Time-series data, activity feeds, IoT sensor data. Multi-datacenter replication is a hard requirement. Each query pattern maps to a single table and you accept no joins.
Polyglot Persistence in Practice
Many production systems use multiple databases, each handling the access pattern it was designed for. A common architecture: PostgreSQL for user accounts and billing (strong consistency, ACID transactions), MongoDB for product catalogs (flexible schemas, document-shaped data), and Cassandra for event logging and analytics (high write throughput, time-series access patterns). The application layer routes each operation to the appropriate database based on its consistency and scalability requirements.
The cost of polyglot persistence is operational complexity — three databases to monitor, backup, upgrade, and tune. Teams must understand the concurrency model of each database to avoid subtle bugs. A developer accustomed to PostgreSQL's strong consistency guarantees may accidentally assume the same guarantees in Cassandra and introduce a race condition that only manifests under high concurrency.
Common Interview Mistakes
Several misconceptions come up repeatedly in system design interviews:
- "MongoDB has no transactions" — This was true before version 4.0. Since 4.0 (replica sets) and 4.2 (sharded clusters), MongoDB supports multi-document ACID transactions with snapshot isolation. The nuance is that they have performance trade-offs and should not be the default.
- "Cassandra is eventually consistent, period" — Cassandra's consistency is tunable per query. QUORUM reads + QUORUM writes provide linearizable consistency for single-partition operations. CL ALL provides the strongest guarantee. The "eventual consistency" label applies when using CL ONE.
- "Higher isolation level = always better" — Higher isolation levels reduce anomalies but increase contention, lock duration, and abort rates. Read Committed handles 90%+ of real-world workloads correctly. Serializable should be reserved for specific invariant-sensitive operations, not applied as a blanket default.
- "PostgreSQL and MySQL handle isolation the same way" — They both offer similar isolation level names but the implementations differ fundamentally. PostgreSQL uses SSI for Serializable (optimistic, abort-based). InnoDB uses gap locks for phantom prevention at Repeatable Read (pessimistic, block-based). Same guarantees, different failure modes.
The choice between these databases is not about which is 'better' — it is about which consistency-scalability trade-off matches your workload. PostgreSQL pays for consistency with vertical scaling limits. Cassandra pays for horizontal scale with no cross-partition transactions. MongoDB tries to offer both, with opt-in transactions that add latency and complexity.