Transaction Isolation Levels and Anomalies

Course

System Design Fundamentals

Transaction Isolation Levels and Anomalies

Topics Covered

PostgreSQL: Concurrency Control and Storage Internals

How MVCC Works

Autovacuum and Dead Tuple Management

Isolation Levels

Write-Ahead Log (WAL)

Practical Implications

MySQL InnoDB: Concurrency Control and Storage Internals

MVCC and Undo Logs

Next-Key Locks (Gap Locks)

Clustered Index Architecture

Write Path

Buffer Pool

Monitoring Lock Contention

MongoDB: Concurrency and Storage Internals

Document-Level Atomicity

WiredTiger MVCC

Read Concern and Write Concern

Multi-Document Transactions

Cassandra: Concurrency and Storage Internals

Partition-Scoped Atomicity

Tunable Consistency

Lightweight Transactions (LWT)

LSM-Tree Write Path

Read Path

Anti-Entropy and Repair

Comparison Summary

Default Isolation and Concurrency Model

Write and Recovery Mechanisms

Consistency vs Scale Spectrum

Dead Tuple / Version Management

When to Choose What

Polyglot Persistence in Practice

Common Interview Mistakes

PostgreSQL: Concurrency Control and Storage Internals

PostgreSQL achieves concurrency through MVCC (Multi-Version Concurrency Control): readers never block writers, and writers never block readers. Instead of locking rows for reads, PostgreSQL keeps multiple versions of each row and lets each transaction see a consistent snapshot.

How MVCC Works

Every row carries two hidden fields: xmin (the transaction ID that created this version) and xmax (the transaction ID that deleted or replaced it). When you UPDATE a row, PostgreSQL does not overwrite it in-place. It marks the old version with xmax = current_txid and inserts a brand new version with xmin = current_txid. Readers check these fields against their snapshot to decide which version is visible.

Consider a concrete example:

sql

1-- Transaction 100 inserts a row
2BEGIN; -- txid = 100
3INSERT INTO accounts (id, balance) VALUES (1, 500);
4COMMIT;
5-- Row on disk: xmin=100, xmax=null
6
7-- Transaction 200 updates the row
8BEGIN; -- txid = 200
9UPDATE accounts SET balance = 700 WHERE id = 1;
10-- Old version: xmin=100, xmax=200
11-- New version: xmin=200, xmax=null
12COMMIT;
13
14-- Transaction 150 (started before 200) still sees balance=500
15-- Its snapshot excludes txid 200, so the old version is visible
16-- A new transaction starting after 200 commits sees balance=700

Both versions coexist on disk until vacuum removes the old one.

This means a long-running report query sees a frozen-in-time snapshot of the database while concurrent transactions continue to insert, update, and delete rows without interference. The trade-off is dead rows: old versions that no transaction needs anymore. PostgreSQL's autovacuum process reclaims these periodically, but long-running transactions prevent cleanup and cause table bloat.

Autovacuum and Dead Tuple Management

Autovacuum runs in the background, scanning tables for dead tuples that are no longer visible to any active snapshot. It marks their space as reusable but does not shrink the table file. Over time, heavy UPDATE/DELETE workloads without adequate vacuuming lead to table bloat: the table file grows even though the live data has not.

sql

1-- Monitor dead tuple accumulation
2SELECT relname, n_dead_tup, n_live_tup,
3       last_autovacuum, last_autoanalyze
4FROM pg_stat_user_tables
5WHERE n_dead_tup > 10000
6ORDER BY n_dead_tup DESC;
7
8-- Find long-running transactions blocking vacuum
9SELECT pid, now() - xact_start AS duration, state, query
10FROM pg_stat_activity
11WHERE state != 'idle'
12ORDER BY duration DESC;

If n_dead_tup climbs steadily, autovacuum is not keeping up and you need to tune its frequency or worker count.

Isolation Levels

PostgreSQL supports three isolation levels (it treats Read Uncommitted as Read Committed):

Read Committed (default): Each statement sees a fresh snapshot. Two SELECTs within the same transaction may return different results if another transaction commits between them. This prevents dirty reads but allows non-repeatable reads and phantoms.

Repeatable Read: The snapshot is taken at the first statement and held for the entire transaction. All reads see the same data regardless of concurrent commits. If a concurrent transaction modifies a row you are about to update, PostgreSQL aborts your transaction with a serialization error rather than silently applying a lost update.

Serializable: Uses Serializable Snapshot Isolation (SSI) to detect conflicts that would violate serial execution order. PostgreSQL tracks read and write dependencies across transactions and aborts one side when it detects a dangerous cycle. SSI works by maintaining two types of locks in memory: SIRead locks (tracking what each transaction has read) and write locks. When PostgreSQL detects a "dangerous structure" (a cycle where T1 reads data T2 wrote, and T2 reads data T1 wrote), it aborts one transaction. Applications must be prepared to retry aborted transactions.

Write-Ahead Log (WAL)

All changes hit the WAL before reaching data pages. The WAL is an append-only sequential log on disk, far faster than random writes to heap pages. On crash, PostgreSQL replays the WAL to restore committed data. This is why commits are fast (one sequential write) even when the actual table pages have not been flushed yet.

The WAL also serves as the foundation for replication. Streaming replication sends WAL records to standby servers, which replay them to stay in sync. This means the same mechanism that provides crash recovery also provides high availability: a standby can take over with minimal data loss because it has been continuously replaying WAL records.

WAL segment files accumulate on disk until they are archived or recycled. The max_wal_size parameter controls how much WAL accumulates before a checkpoint is triggered. A checkpoint forces all dirty buffers to disk, allowing old WAL segments to be recycled. The tension is clear: frequent checkpoints keep WAL size small but cause I/O spikes. Infrequent checkpoints reduce I/O spikes but increase recovery time after a crash (more WAL to replay).

Practical Implications

Choosing the right isolation level in PostgreSQL depends on your workload pattern:

OLTP with independent short transactions: Read Committed is sufficient. Each statement sees the latest committed data, and short transactions minimize conflict risk.
Financial or inventory systems: Repeatable Read prevents lost updates. Two concurrent transactions cannot both read a balance and write a new value; one will be aborted.
Complex business rules with cross-table invariants: Serializable ensures that concurrent execution produces the same result as some serial ordering. The cost is higher abort rates under contention.

Key Insight

MVCC is fundamentally a space-time trade-off. You trade disk space (multiple row versions, dead tuples, vacuum overhead) for time (zero read locks, no reader-writer blocking). This trade-off explains why long-running transactions are dangerous in PostgreSQL: they pin old row versions alive and prevent vacuum from reclaiming space.

MySQL InnoDB: Concurrency Control and Storage Internals

InnoDB's default isolation level is Repeatable Read, one step higher than PostgreSQL's Read Committed default. To achieve this without the trade-offs of full Serializable mode, InnoDB combines MVCC snapshots for reads with next-key locks for writes. This hybrid approach prevents phantoms without the overhead of serializable dependency tracking. Understanding InnoDB's concurrency model matters because MySQL is the most widely deployed relational database. Chances are high you will encounter it in production.

MVCC and Undo Logs

Like PostgreSQL, InnoDB maintains multiple row versions. But instead of storing old versions inline in the heap, InnoDB keeps them in a separate undo log. Each row in the clustered index contains a pointer to its undo chain. When a reader needs an older version, it follows the chain backward until it finds the version visible to its snapshot.

The snapshot is taken at the first consistent read (SELECT) in the transaction. All subsequent reads within the same transaction see the same snapshot. This is what makes Repeatable Read work: your second SELECT returns the same rows as the first, even if another transaction has committed changes in between.

Next-Key Locks (Gap Locks)

MVCC handles reads, but what about writes? Consider this scenario: Transaction A runs SELECT * FROM orders WHERE price BETWEEN 10 AND 20 FOR UPDATE. Without gap locks, Transaction B could insert a row with price=15 and commit. When A re-reads, a phantom row appears.

InnoDB prevents this with next-key locks: a combination of the row lock and a lock on the gap before the row:

sql

1-- Transaction A
2BEGIN;
3SELECT * FROM orders WHERE price BETWEEN 10 AND 20 FOR UPDATE;
4-- InnoDB locks: existing rows in range + gaps between index entries
5-- If index has entries at 8, 12, 22:
6--   Row lock on price=12
7--   Gap lock on (8, 12] and (12, 22]
8
9-- Transaction B (concurrent)
10INSERT INTO orders (price) VALUES (15);
11-- BLOCKS -- price=15 falls in the locked gap (12, 22]
12-- Even price=21 would block because it falls in (12, 22]
13
14-- Transaction B waits until A commits or rolls back

The gap lock range depends on the index structure, not the query range. If the index has entries at 8, 12, and 22, a range query for 10-20 locks the gaps (8,12], (12,22], which is broader than the query range. This is why gap locks can cause unexpected blocking: a transaction inserting price=21 would also be blocked because it falls within the (12,22] gap.

Gap locks only exist at Repeatable Read and above. Switching to Read Committed disables gap locks, which eliminates phantom prevention but also eliminates the unexpected blocking. This is a common tuning trade-off for high-concurrency systems where phantom reads are acceptable.

Clustered Index Architecture

InnoDB stores rows in primary key order within a B+Tree (the clustered index). Secondary indexes store (secondary_key, primary_key) pairs. A secondary index lookup requires two B+Tree traversals: first the secondary index to find the PK, then the clustered index to find the actual row (a "bookmark lookup"):

sql

1-- This query uses the secondary index on email,
2-- then does a bookmark lookup to the clustered index for the full row
3SELECT * FROM users WHERE email = '[email protected]';
4-- Path: email index -> find PK=42 -> clustered index -> fetch row
5
6-- A covering index avoids the second lookup entirely
7CREATE INDEX idx_email_name ON users (email, name);
8SELECT name FROM users WHERE email = '[email protected]';
9-- Path: email_name index -> return name directly (index-only scan)

This is why covering indexes (indexes that include all queried columns) are particularly valuable in InnoDB: they avoid the second lookup entirely.

Write Path

Updates follow this path: (1) copy old version to undo log, (2) modify row in buffer pool, (3) append redo log entry, (4) on COMMIT, fsync the redo log. Dirty pages are flushed to disk later by a background thread. The redo log is InnoDB's equivalent of PostgreSQL's WAL: it guarantees durability via sequential writes without waiting for random page flushes.

The undo log serves double duty: it provides old versions for MVCC readers AND enables rollback. If a transaction aborts, InnoDB follows the undo chain to reverse each modification. The purge thread periodically removes undo entries that no active transaction needs. If the purge thread falls behind (due to long-running transactions holding references to old versions), the undo tablespace grows, InnoDB's equivalent of PostgreSQL's table bloat problem.

Buffer Pool

InnoDB's buffer pool is a large in-memory cache of data pages and index pages. Most reads are served from the buffer pool without touching disk. The buffer pool uses an LRU variant (young/old sublists) to balance recently accessed pages against frequently accessed pages. When the buffer pool is properly sized (typically 70-80% of available RAM on a dedicated database server), the vast majority of reads hit memory. Writes also modify pages in the buffer pool first, then flush to disk asynchronously. This is why InnoDB's write performance depends heavily on buffer pool size.

Monitoring Lock Contention

When gap locks cause unexpected blocking or deadlocks, InnoDB provides diagnostic tools:

sql

1-- Show current locks and lock waits
2SELECT * FROM performance_schema.data_locks
3WHERE LOCK_TYPE = 'RECORD'
4ORDER BY ENGINE_TRANSACTION_ID;
5
6-- Show which transactions are waiting for which locks
7SELECT
8  r.trx_id AS waiting_trx,
9  r.trx_mysql_thread_id AS waiting_thread,
10  b.trx_id AS blocking_trx,
11  b.trx_mysql_thread_id AS blocking_thread
12FROM information_schema.innodb_lock_waits w
13JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
14JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;

High innodb_row_lock_waits in SHOW ENGINE INNODB STATUS indicates contention. If most waits are gap locks, consider whether Read Committed (which disables gap locks) is acceptable for that workload.

Interview Tip

In interviews, the key difference between PostgreSQL and InnoDB concurrency is how they prevent phantoms. PostgreSQL at Repeatable Read uses SSI to detect conflicts after the fact and abort. InnoDB uses gap locks to prevent conflicts from happening in the first place. PostgreSQL retries; InnoDB blocks.

MongoDB: Concurrency and Storage Internals

MongoDB's concurrency model starts from a fundamentally different assumption: single-document operations are atomic by default, but cross-document consistency is opt-in. This reflects MongoDB's design philosophy of prioritizing write throughput and horizontal scalability over strict multi-document ACID guarantees.

Document-Level Atomicity

Every single-document operation (insert, update, delete) is atomic without any explicit transaction. If you update three fields in one document, either all three change or none do. But if you need to update a document in the orders collection AND a document in the inventory collection as one atomic unit, you need an explicit multi-document transaction (available since MongoDB 4.0 for replica sets, 4.2 for sharded clusters).

This single-document atomicity is more powerful than it first appears. Because MongoDB documents can contain nested subdocuments and arrays, you can embed an entire order with its line items in one document:

javascript

1// Single atomic operation -- no transaction needed
2db.orders.updateOne(
3  { _id: orderId },
4  {
5    $set: { status: "shipped" },
6    $push: { timeline: { event: "shipped", ts: new Date() } },
7    $inc: { "shipping.attempts": 1 }
8  }
9);
10// All three changes (status, timeline, shipping) apply atomically

Updating the order status, adding a timeline entry, and incrementing the shipping attempt counter are all one atomic operation. The schema design determines whether you need transactions.

WiredTiger MVCC

MongoDB's WiredTiger storage engine uses MVCC with document-level locking. Reads use snapshot timestamps to determine document visibility. Writes hold brief document-level locks (not collection or database locks) and create new versions linked via an undo chain. This means readers and writers on different documents never block each other, and readers on the same document see the version consistent with their snapshot.

WiredTiger maintains an in-memory cache (default 50% of RAM minus 1GB) for frequently accessed documents. Documents are stored in a compressed format on disk and decompressed in cache. The cache uses a combination of eviction policies to manage memory pressure, and checkpoint operations periodically write dirty pages to disk for durability.

Read Concern and Write Concern

Instead of traditional isolation levels, MongoDB uses tunable consistency via readConcern and writeConcern.

readConcern "local" (default): Returns the most recent data on the node, even if it might be rolled back during a replica set failover. Fast but not durable.

readConcern "majority": Returns only data acknowledged by a majority of replica set members. Slower but guaranteed to survive failover.

readConcern "snapshot": Available inside multi-document transactions. Provides snapshot isolation across multiple documents, similar to PostgreSQL's Repeatable Read.

writeConcern controls durability:

javascript

1// Fast but risky -- acknowledged by primary only
2db.orders.insertOne(
3  { item: "widget", qty: 10 },
4  { writeConcern: { w: 1 } }
5);
6
7// Durable -- acknowledged by majority + journaled
8db.orders.insertOne(
9  { item: "widget", qty: 10 },
10  { writeConcern: { w: "majority", j: true } }
11);

w: 1 means acknowledged by the primary. w: "majority" means acknowledged by a majority of replicas. j: true means written to the journal on disk.

The combination of readConcern and writeConcern gives fine-grained control. For example, readConcern: "majority" with writeConcern: { w: "majority" } provides causal consistency: you are guaranteed to read your own writes even if you are routed to a different replica set member. This is conceptually similar to Cassandra's QUORUM + QUORUM overlap guarantee.

Multi-Document Transactions

Multi-document transactions have a 60-second default time limit and incur performance overhead from snapshot management and conflict detection:

javascript

1// Multi-document transaction example
2const session = client.startSession();
3try {
4  session.startTransaction({
5    readConcern: { level: "snapshot" },
6    writeConcern: { w: "majority" }
7  });
8
9  // Debit source account
10  await accounts.updateOne(
11    { _id: sourceId, balance: { $gte: amount } },
12    { $inc: { balance: -amount } },
13    { session }
14  );
15
16  // Credit destination account
17  await accounts.updateOne(
18    { _id: destId },
19    { $inc: { balance: amount } },
20    { session }
21  );
22
23  await session.commitTransaction();
24} catch (error) {
25  await session.abortTransaction();
26  // If TransientTransactionError, retry the whole transaction
27} finally {
28  session.endSession();
29}

If two transactions modify the same document, one is aborted with a write conflict error. The guidance from MongoDB is to design schemas to minimize the need for multi-document transactions; embedding related data in a single document avoids them entirely.

Common Pitfall

MongoDB multi-document transactions are not free. They have a 60-second time limit, consume more WiredTiger cache for snapshot tracking, and abort on write-write conflicts. If you find yourself using transactions frequently, it often signals a schema design issue. Consider embedding related data in a single document instead.

Cassandra: Concurrency and Storage Internals

Cassandra throws out the entire isolation level framework. There are no transactions in the traditional sense, no MVCC snapshots, and no row locks. Instead, Cassandra provides partition-scoped atomicity with tunable consistency, a fundamentally different model designed for write-heavy workloads across multiple data centers.

Partition-Scoped Atomicity

A write to a single partition (all rows sharing the same partition key) is atomic. But there is no cross-partition atomicity. If you need to update data in two different partitions as one unit, you cannot; one may succeed while the other fails. This constraint drives Cassandra's data modeling philosophy: design one table per query pattern and denormalize aggressively so each query touches exactly one partition.

Tunable Consistency

Cassandra lets you choose consistency per query via consistency levels (CL):

sql

1-- Fast write -- acknowledged by one replica
2INSERT INTO events (user_id, ts, event)
3VALUES (uuid(), toTimestamp(now()), 'page_view')
4USING CONSISTENCY ONE;
5
6-- Durable write -- acknowledged by quorum
7INSERT INTO payments (id, amount, status)
8VALUES (uuid(), 99.99, 'completed')
9USING CONSISTENCY QUORUM;
10
11-- Strong read -- guaranteed to see latest write
12SELECT * FROM payments WHERE id = ?
13USING CONSISTENCY QUORUM;

With a replication factor of 3:

CL ONE: Write acknowledged by one replica, read from one replica. Fastest but may read stale data.

CL QUORUM: Write acknowledged by 2 of 3 replicas, read from 2 of 3. If write CL + read CL > replication factor, you are guaranteed to read your own writes (quorum intersection).

CL ALL: All replicas must respond. Strongest consistency but one unavailable replica blocks the entire operation.

The critical insight is that QUORUM + QUORUM gives you linearizable reads for a single partition without any locking mechanism. The overlap between write and read replicas guarantees that at least one replica has the latest version.

Lightweight Transactions (LWT)

For conditional updates (compare-and-swap), Cassandra offers Lightweight Transactions using Paxos consensus:

sql

1-- Conditional insert -- only succeeds if no row exists
2INSERT INTO users (id, email) VALUES (1, '[email protected]')
3IF NOT EXISTS;
4
5-- Conditional update -- compare-and-swap pattern
6UPDATE inventory SET quantity = 8
7WHERE product_id = 42
8IF quantity = 10;
9-- Returns [applied]=true if current quantity was 10
10-- Returns [applied]=false with current value if not

This provides serial isolation for one partition but costs 4 round-trips (Paxos prepare, promise, propose, accept) instead of 1. Use LWT sparingly; it exists for correctness, not performance.

LWT is useful for exactly one pattern: preventing duplicate writes where idempotency matters. User registration (IF NOT EXISTS), inventory reservation (IF quantity >= requested), and unique constraint enforcement are the canonical use cases. For everything else, the 4x latency penalty is not worth it.

LSM-Tree Write Path

Writes in Cassandra are extremely fast because they are always sequential. The write path: (1) append to the commit log on disk (durability), (2) write to the memtable in memory. When the memtable is full, it flushes to an immutable SSTable on disk. Background compaction merges SSTables and discards tombstones (deleted data markers). This is why Cassandra excels at write-heavy workloads: every write is a sequential append, never a random disk seek.

Read Path

Reads are more expensive. Cassandra must check multiple locations for data:

Memtable: the in-memory write buffer (always checked first)
Row cache: optional in-memory cache of frequently read partitions
Bloom filter: per-SSTable probabilistic filter (eliminates SSTables that definitely do not contain the key)
Partition index: on-disk index within each remaining SSTable
Data blocks: actual data from the SSTable

Cassandra merges results from all sources by timestamp (last-write-wins). This is why read performance degrades with many SSTables: compaction is essential to keep read latency low.

Anti-Entropy and Repair

Since Cassandra uses eventual consistency, replicas can drift apart. Read repair corrects inconsistencies during normal reads: when a coordinator detects that replicas returned different versions, it sends the latest version to the stale replicas. For data that is rarely read, periodic anti-entropy repair (using Merkle trees to compare data across replicas) ensures consistency. Without regular repairs, stale data can persist indefinitely on replicas that missed a write.

The repair process works in three steps: (1) each replica builds a Merkle tree (hash tree) of its data for a given token range, (2) the coordinator compares trees to find partitions where hashes differ, (3) stale replicas stream the correct data from up-to-date replicas. Running repair on a regular schedule (typically weekly) is essential for correctness in production Cassandra clusters.

Comparison Summary

Default Isolation and Concurrency Model

Database	Default Isolation	Concurrency Mechanism	Phantom Prevention
PostgreSQL	Read Committed	MVCC (inline versions)	SSI at Serializable
MySQL InnoDB	Repeatable Read	MVCC + next-key locks	Gap locks at RR
MongoDB	Document-level atomic	MVCC (WiredTiger)	Snapshot readConcern in txns
Cassandra	Partition-scoped	Last-write-wins	N/A (no cross-partition reads)

Write and Recovery Mechanisms

Database	Write Durability	Recovery Mechanism	Old Version Storage
PostgreSQL	WAL (append-only log)	WAL replay on crash	Inline (heap) + vacuum
MySQL InnoDB	Redo log (append-only)	Redo replay + undo rollback	Undo log (separate tablespace)
MongoDB	Journal (WiredTiger)	Journal replay	In-cache undo chain
Cassandra	Commit log + memtable	Commit log replay	N/A (last-write-wins)

Consistency vs Scale Spectrum

PostgreSQL and InnoDB sit on the consistency-first end. Both provide full ACID with rich query capabilities (joins, subqueries, constraints). They scale vertically well and can scale reads horizontally via replicas, but write scaling requires sharding strategies that add complexity.

MongoDB occupies the middle ground. Single-document operations are fast and atomic. Multi-document ACID is available but opt-in and bounded (60-second limit). Horizontal scaling via sharding is built into the platform from the start.

Cassandra sits on the scale-first end. It sacrifices relational features entirely for linear write scalability across data centers. Tunable consistency lets you dial between speed and correctness per query, but you accept eventual consistency as the baseline.

Dead Tuple / Version Management

Every MVCC database must clean up old versions eventually. PostgreSQL uses autovacuum on the heap. InnoDB uses a purge thread on the undo tablespace. MongoDB relies on WiredTiger cache eviction. All three suffer when long-running transactions hold references to old versions, preventing cleanup and increasing memory or disk usage. Cassandra sidesteps this entirely with last-write-wins and compaction of SSTables, but loses the ability to provide snapshot reads.

When to Choose What

PostgreSQL/InnoDB: Correctness is non-negotiable. Financial transactions, inventory management, user authentication. You need joins, constraints, and transactions that span multiple tables.

MongoDB: Your data model is document-shaped (nested, variable-schema). Most operations touch a single document. You need horizontal scalability and can tolerate eventual consistency for non-critical reads with opt-in ACID for critical flows.

Cassandra: Write throughput dominates. Time-series data, activity feeds, IoT sensor data. Multi-datacenter replication is a hard requirement. Each query pattern maps to a single table and you accept no joins.

Polyglot Persistence in Practice

Many production systems use multiple databases, each handling the access pattern it was designed for. A common architecture: PostgreSQL for user accounts and billing (strong consistency, ACID transactions), MongoDB for product catalogs (flexible schemas, document-shaped data), and Cassandra for event logging and analytics (high write throughput, time-series access patterns). The application layer routes each operation to the appropriate database based on its consistency and scalability requirements.

The cost of polyglot persistence is operational complexity: three databases to monitor, backup, upgrade, and tune. Teams must understand the concurrency model of each database to avoid subtle bugs. A developer accustomed to PostgreSQL's strong consistency guarantees may accidentally assume the same guarantees in Cassandra and introduce a race condition that only manifests under high concurrency.

Common Interview Mistakes

Several misconceptions come up repeatedly in system design interviews:

"MongoDB has no transactions": This was true before version 4.0. Since 4.0 (replica sets) and 4.2 (sharded clusters), MongoDB supports multi-document ACID transactions with snapshot isolation. The nuance is that they have performance trade-offs and should not be the default.
"Cassandra is eventually consistent, period": Cassandra's consistency is tunable per query. QUORUM reads + QUORUM writes provide linearizable consistency for single-partition operations. CL ALL provides the strongest guarantee. The "eventual consistency" label applies when using CL ONE.
"Higher isolation level = always better": Higher isolation levels reduce anomalies but increase contention, lock duration, and abort rates. Read Committed handles 90%+ of real-world workloads correctly. Serializable should be reserved for specific invariant-sensitive operations, not applied as a blanket default.
"PostgreSQL and MySQL handle isolation the same way": They both offer similar isolation level names but the implementations differ fundamentally. PostgreSQL uses SSI for Serializable (optimistic, abort-based). InnoDB uses gap locks for phantom prevention at Repeatable Read (pessimistic, block-based). Same guarantees, different failure modes.

Key Insight

The choice between these databases is not about which is 'better'; it is about which consistency-scalability trade-off matches your workload. PostgreSQL pays for consistency with vertical scaling limits. Cassandra pays for horizontal scale with no cross-partition transactions. MongoDB tries to offer both, with opt-in transactions that add latency and complexity.

Course

System Design Fundamentals

Networking & APIs

Storage & Data Modeling

Partitioning, Replication & Consistency

Caching & Edge

Messaging & Streaming

Reliability & Operability

Security & Privacy

Common Interview Scenarios