Consistency Models in Distributed Systems

Course

System Design Fundamentals

Consistency Models in Distributed Systems

Topics Covered

What is Consistency and Why Does It Matter?

The Replication Lag Problem

Why Multiple Models Exist

The Consistency Spectrum

Strong Consistency and Linearizability

Linearizability: The Strongest Guarantee

How Strong Consistency Works

Linearizability vs. Serializability

Linearizability Violation Example

Google Spanner and TrueTime

The Cost of Strong Consistency

When to Use Strong Consistency

Real-World Latency Benchmarks

Eventual Consistency and Convergence

The Convergence Window

Why Eventual Consistency Enables Higher Availability

Conflict Resolution

Anti-Entropy Mechanisms

Tombstones and Deletion Consistency

Real-World Eventual Consistency Examples

When to Use Eventual Consistency

Session Guarantees and Tunable Consistency

Read Your Own Writes

Monotonic Reads

Monotonic Writes

Causal Consistency

Comparing Session Guarantees

Tunable Consistency (Cassandra Model)

Version Vectors in Practice

Combining Session Guarantees

CAP Theorem and Real-World Trade-offs

CP Systems: Consistency Over Availability

AP Systems: Availability Over Consistency

The "CA" Misconception

Real-World Partitions and Gray Failures

Beyond CAP: The PACELC Framework

Choosing the Right Consistency Model

The Decision Framework

Real-World Patterns

Decision Table by Use Case

How Major Systems Handle Consistency

Common Mistakes

Monitoring Consistency in Production

What is Consistency and Why Does It Matter?

On a single-server database, consistency is invisible. You write a value, you read it back, and you get what you wrote. There is exactly one copy of the data, so there is no ambiguity about which version is "current."

Distributed systems break this simplicity. The moment you replicate data across multiple nodes (for availability, fault tolerance, or read performance), you introduce a fundamental question: when a client reads data, which version does it see? The answer depends on the consistency model the system implements.

A consistency model is a contract between the distributed system and its clients. It defines the rules about when writes become visible to reads. Strong consistency promises that every read sees the most recent write. Eventual consistency promises only that all replicas will converge eventually, but a read right after a write might return stale data.

The consistency model you choose shapes your entire architecture. A banking system that shows different balances on different ATMs has a correctness problem. A social media feed that shows a post 2 seconds late has a user experience problem, but not a correctness problem. Different applications tolerate different levels of staleness, and the consistency model must match the application's requirements.

The Replication Lag Problem

Consider a system with one primary and two replicas. A client writes "balance = $500" to the primary. The primary acknowledges the write and begins replicating to the replicas. If a second client reads from a replica before replication completes, it sees the old balance, say $600. This is replication lag, and it is the root cause of every consistency challenge in distributed systems.

The question is not whether replication lag exists; it always does, even if measured in microseconds. The question is what guarantees the system provides about how reads behave during the lag window.

Why Multiple Models Exist

Strong consistency eliminates the lag problem but at a cost: every write must wait for all replicas to acknowledge before returning to the client. This adds latency (often 10-100ms per write depending on network distance) and reduces availability (if a replica is down, the write blocks or fails).

Eventual consistency eliminates the latency cost but exposes stale reads. Between these extremes, several intermediate models offer different trade-offs: session consistency guarantees that a single client sees its own writes, causal consistency preserves cause-and-effect ordering, and tunable consistency lets you choose per-query.

The Consistency Spectrum

From weakest to strongest, the main consistency models are:

Eventual consistency: all replicas converge eventually, but reads may return stale data during the convergence window. No ordering guarantees between different clients.
Monotonic reads: once a client sees a value, subsequent reads never return an older value. Prevents the "time travel" effect where refreshing a page shows older data.
Read-your-writes: a client always sees its own writes on subsequent reads. Other clients may still see stale data.
Causal consistency: operations that are causally related are seen in order by all clients. A reply to a post is always seen after the post itself.
Linearizability: the strongest guarantee. Every read returns the most recent write. The system behaves as if there is a single copy of the data.

Each level adds a guarantee and a cost. The key skill in system design is choosing the weakest model that satisfies your application's correctness requirements, nothing stronger.

Key Insight

Consistency is not a binary choice between 'consistent' and 'inconsistent.' It is a spectrum of guarantees about read/write ordering. The right model depends on what your application can tolerate: a banking system needs strong consistency for balances, but eventual consistency is fine for its transaction history search index.

Strong Consistency and Linearizability

Strong consistency means the system behaves as if there is a single copy of the data. After a write completes, every subsequent read, from any client, on any node, returns that write's value or a later one. No client ever sees a value that has been superseded.

Linearizability: The Strongest Guarantee

Linearizability is the formal name for the strongest single-object consistency model. It requires that every operation appears to take effect at a single instant between its start and end time. Once a write is acknowledged, no read anywhere in the system can return an older value.

The key constraint is the "real-time ordering" requirement. If operation A completes before operation B starts (by wall-clock time), then B must see A's effect. This eliminates any window where stale data is visible.

How Strong Consistency Works

Two main mechanisms achieve strong consistency:

Synchronous replication: The primary waits for all replicas (or a quorum) to acknowledge the write before responding to the client. This guarantees that any replica can serve a consistent read. The cost is that write latency includes the slowest replica's network round-trip.

Consensus protocols (Raft, Paxos): A leader proposes a write, and a majority of nodes must agree before the write is committed. Any node in the majority can serve reads because it has the committed value. Raft is used by etcd, CockroachDB, and TiKV. Paxos variants power Google Spanner and Chubby.

1Raft consensus for a write:
2
3Client to Leader: write(key=balance, value=500)
4Leader to Follower 1: AppendEntries(balance=500)
5Leader to Follower 2: AppendEntries(balance=500)
6Follower 1 to Leader: ACK
7Follower 2 to Leader: ACK     (majority achieved: 3/3)
8Leader to Client: write committed

After the leader receives a majority of ACKs, the write is committed. Any read to any node in the majority returns the committed value.

Linearizability vs. Serializability

These terms are often confused. They are different guarantees:

Linearizability is a single-object, real-time ordering guarantee. It says "reads and writes to this one key behave as if there is one copy." It does not say anything about transactions spanning multiple keys.

Serializability is a multi-object, transaction-level guarantee. It says "concurrent transactions produce the same result as if they ran one at a time." It does not require real-time ordering. A serializable database can reorder transactions as long as the result is equivalent to some serial order.

Strict serializability combines both: transactions are serializable AND respect real-time ordering. This is what CockroachDB and Google Spanner provide. It is the strongest possible guarantee for multi-key operations.

Linearizability Violation Example

Consider an online auction where the current highest bid is stored in a replicated database. Client A reads the highest bid from Replica 1 and sees $100. Client B submits a $150 bid, which is committed to the primary and replicated to Replica 2 but not yet to Replica 1. Client C now reads from Replica 1 and also sees $100. Client C, believing $100 is the current highest bid, submits a $110 bid, which is lower than the actual highest bid of $150.

In a linearizable system, this cannot happen. Client C's read started after Client B's write completed, so Client C must see $150 (or a higher value). The linearizability guarantee ensures that once a write is committed and acknowledged, no subsequent read returns a pre-write value. The auction system needs linearizability to prevent users from making decisions based on outdated information.

Without linearizability, the system might accept Client C's $110 bid as valid (because Replica 1 says the highest bid is $100), then discover later that the actual highest bid was $150. This creates a confusing user experience and potential disputes.

Google Spanner and TrueTime

Google Spanner achieves strict serializability across data centers using TrueTime, a clock API that returns a time interval with bounded uncertainty rather than a single timestamp. Before committing a transaction, Spanner waits for the uncertainty interval to pass, ensuring that the commit timestamp is in the future relative to all other nodes. This "commit wait" is typically 7ms (half the uncertainty bound). The result: every transaction gets a globally meaningful timestamp, enabling consistent reads across data centers without coordination at read time.

This approach trades write latency (the commit wait) for read efficiency. Once a transaction is committed, any node can serve reads at that timestamp without contacting other nodes. The bounded clock uncertainty is maintained by GPS receivers and atomic clocks in every data center.

The Cost of Strong Consistency

Strong consistency has three costs:

Latency: Every write waits for replica acknowledgment. In a single-region deployment, this adds 1-5ms. Across regions (US East to EU West), it adds 80-150ms per write
Availability: If a majority of nodes are unreachable (network partition, multi-node failure), the system cannot accept writes. It chooses consistency over availability
Throughput: Consensus requires coordination on every write, limiting maximum write throughput compared to asynchronous replication

When to Use Strong Consistency

Use strong consistency when the cost of a stale read is higher than the cost of write latency:

Financial transactions (account balances, transfers)
Inventory management (prevent overselling)
Distributed locks and leader election (exactly one leader)
Configuration stores (all nodes must see the same config)
User authentication state (logout must be immediate everywhere)

Real-World Latency Benchmarks

Understanding concrete numbers helps calibrate when strong consistency is practical:

Same rack: 0.1-0.5ms write overhead (etcd cluster on dedicated hardware)
Same region, different AZs: 1-5ms overhead (CockroachDB multi-AZ)
US East to US West: 60-80ms overhead (Spanner cross-region)
US East to EU West: 80-120ms overhead (Spanner cross-continent)
US East to Asia Pacific: 150-250ms overhead (Global Spanner deployment)

For a single-region deployment, strong consistency adds negligible latency: a 2ms write overhead is invisible to users. The cost becomes significant only for cross-region deployments, where every write pays the speed-of-light tax. This is why many architectures use strong consistency within a region but relax to eventual consistency across regions.

Interview Tip

If your deployment is single-region (one cloud region with multiple availability zones), strong consistency is almost free: 1-5ms overhead per write. The 'strong consistency is expensive' argument applies primarily to multi-region deployments where cross-region round-trips dominate. Do not avoid strong consistency based on latency concerns if you are deploying within a single region.

Eventual Consistency and Convergence

Eventual consistency makes a weaker promise: if no new writes occur, all replicas will eventually converge to the same value. During the convergence window, different replicas may return different values for the same key. The system prioritizes availability and low latency over immediate consistency.

The Convergence Window

The convergence window is the time between a write being acknowledged and all replicas having that write. In a well-functioning system, this window is typically 10-200 milliseconds. During this window, reads from different replicas may return different values.

The convergence window depends on:

Network latency between replicas (1ms same-rack, 80-150ms cross-region)
Replication queue depth (how many pending writes are waiting to be replicated)
Replica load (a busy replica may delay applying replicated writes)

Eventual consistency does not specify a maximum convergence time. In practice, it is bounded by network round-trip time and replica processing speed. But under heavy load or network issues, the window can stretch to seconds or even minutes.

Why Eventual Consistency Enables Higher Availability

Eventual consistency decouples the write acknowledgment from replication. The primary acknowledges the write as soon as it is persisted locally, without waiting for replicas. This means:

Writes never block on slow replicas: A replica in a remote region with 150ms latency does not slow down local writes. The write returns in under 1ms (local disk), and replication happens in the background.
Writes succeed even when replicas are down: If a replica is temporarily unreachable (network issue, maintenance, crash), writes continue unaffected. The replica catches up when it recovers via the replication log.
No single point of failure for writes: In a multi-primary (leaderless) eventually consistent system, any node can accept writes. There is no leader to fail. Cassandra and DynamoDB both allow writes to any replica, making them highly available by design.

The trade-off is that reads during the convergence window may return stale data. But for many workloads, a sub-second window of staleness is a small price to pay for the ability to accept writes at any time, from any node, with sub-millisecond latency.

Conflict Resolution

When multiple replicas accept writes concurrently (multi-primary or leaderless replication), the same key can receive different values on different replicas. These are write conflicts that must be resolved during convergence.

Last-Write-Wins (LWW): Each write carries a timestamp. When conflicts are detected, the write with the latest timestamp wins. Simple but data-losing: if two clients write different values at nearly the same time, one value is silently discarded. DynamoDB and Cassandra use LWW by default.

Version vectors: Each replica maintains a vector of counters tracking how many writes it has seen from every other replica. When two versions are concurrent (neither causally precedes the other), the system detects the conflict and either merges the values or presents both to the application for resolution. Riak uses version vectors.

CRDTs (Conflict-free Replicated Data Types): Data structures designed to merge automatically without conflicts. A G-Counter (grow-only counter) lets each replica increment independently; the merged value is the sum of all replica counters. CRDTs work for specific data types (counters, sets, registers) but not for arbitrary data.

1CRDT G-Counter example:
2
3Replica A: {A: 5, B: 0, C: 0}  local count = 5
4Replica B: {A: 0, B: 3, C: 0}  local count = 3
5Replica C: {A: 0, B: 0, C: 7}  local count = 7
6
7Merged:    {A: 5, B: 3, C: 7}  total count = 15

Each replica only increments its own entry. Merging takes the maximum of each entry. No conflicts are possible because entries never decrease and each replica owns exactly one entry.

CRDTs are not a universal solution. They work well for specific data types (counters, sets, flags, registers) but not for arbitrary operations. A CRDT shopping cart (OR-Set) handles add/remove gracefully, but a CRDT bank account balance does not exist: you cannot decrement a G-Counter, and allowing concurrent decrements on a bank balance requires coordination that CRDTs are designed to avoid. Choose CRDTs when your data structure fits a known CRDT type; use application-level conflict resolution otherwise.

Anti-Entropy Mechanisms

Anti-entropy ensures replicas converge even when normal replication misses updates:

Read repair: When a coordinator reads from multiple replicas and detects a stale value, it sends the latest value back to the stale replica. This passively repairs inconsistencies during normal read operations. Cassandra uses read repair extensively.

Merkle tree comparison: Each replica builds a hash tree (Merkle tree) over its data. Replicas periodically exchange root hashes. If roots differ, they walk down the tree to identify exactly which data ranges are inconsistent, then synchronize only the differing ranges. This is efficient because it only transfers the data that actually differs.

Hinted handoff: When a write's target replica is temporarily unreachable, another node stores the write as a "hint." When the target recovers, the hint is replayed. This ensures writes are not lost during temporary failures. Hinted handoff has limits: if the target is down for too long (default 3 hours in Cassandra), the hint expires and a full anti-entropy repair is needed to restore consistency.

These three mechanisms work together: hinted handoff handles short outages (minutes), read repair fixes inconsistencies on popular keys during normal reads, and Merkle tree comparison catches everything else during periodic background repair cycles. Together, they ensure that all replicas eventually converge regardless of what failures occur during normal operation.

Tombstones and Deletion Consistency

Deletes in eventually consistent systems are surprisingly tricky. If you simply remove a record from one replica, other replicas still have the record and will re-replicate it during anti-entropy, resurrecting the deleted data. The solution is tombstones: instead of deleting, write a special "this record is deleted" marker. The tombstone replicates like any other write. After a grace period (default 10 days in Cassandra), the tombstone is garbage collected.

The grace period must be longer than the maximum time a replica can be offline. If a replica is down for 15 days and tombstones expire after 10 days, that replica still has the old data and the tombstone is gone, so the data is resurrected. This is why Cassandra operations teams monitor replica downtime and run repair before bringing long-offline replicas back online.

Tombstones also accumulate if an application frequently deletes and recreates records (e.g., expiring sessions). Too many tombstones slow down reads because the database must scan through them. Monitoring tombstone count per read and tuning the gc_grace_seconds parameter are essential operational tasks for eventually consistent systems that handle frequent deletes.

Real-World Eventual Consistency Examples

DNS: When you update a DNS record (e.g., pointing example.com to a new IP), the change propagates through a hierarchy of caches. Each cache serves the old IP until its TTL expires. A TTL of 3600 seconds means some clients see the old IP for up to an hour after the change. This is eventual consistency with a convergence window equal to the maximum TTL across all caches.

CDN cache invalidation: When a website deploys new content, CDN edge servers worldwide continue serving the cached old version until the cache is invalidated or expires. Cloudflare reports that cache purge propagation takes 30-150ms to most PoPs but can take seconds to remote locations. During this window, users in different regions see different versions of the same page.

DynamoDB Global Tables: DynamoDB replicates tables across AWS regions with a typical replication lag of under 1 second. During this window, a write in us-east-1 is not visible to readers in eu-west-1. Applications that cannot tolerate this lag must use the same region for reads and writes. Applications that can (product catalog, user preferences) benefit from low-latency local reads worldwide.

Cassandra multi-datacenter: Cassandra replicates asynchronously between data centers by default. A write in DC1 is acknowledged locally, then replicated to DC2. The replication lag depends on network latency and cluster load. Under normal conditions, cross-DC replication completes in 50-200ms. Under heavy load, it can stretch to seconds. Read repair and anti-entropy repair ensure eventual convergence even if replication messages are lost.

Common Pitfall

Eventual consistency does not mean 'inconsistent.' It means 'temporarily stale.' The system guarantees convergence: all replicas will reach the same state. The question is how long it takes and what happens to reads during the window. For many applications, a 100ms convergence window is indistinguishable from strong consistency to the end user.

When to Use Eventual Consistency

Eventual consistency is the right choice when:

Staleness is invisible to users (social feeds, activity logs, recommendation engines)
Availability and low latency matter more than perfect accuracy (DNS, CDN caches)
Write throughput must be maximized (time-series data, IoT sensor streams)
The data is append-only or naturally conflict-free (event logs, clickstream data)

Session Guarantees and Tunable Consistency

Between strong and eventual consistency lies a practical middle ground: session guarantees and tunable consistency. These models let you get "good enough" consistency for specific operations without paying the full cost of strong consistency for every read and write.

Read Your Own Writes

Read-your-writes is the most commonly needed session guarantee. It promises that after a client writes a value, that same client will always see the written value (or a newer one) on subsequent reads. Other clients may still see stale data; the guarantee is scoped to the writing client's session.

Implementation approaches:

Route reads to the primary: After a write, the client's subsequent reads are sent to the primary (which always has the latest data) instead of a replica. Simple but creates a hot spot on the primary for write-heavy clients.

Version token tracking: The write response includes a version token (e.g., a log sequence number). The client attaches this token to subsequent reads. The system routes the read to any replica that has caught up to at least that version. This distributes read load while guaranteeing freshness for the writing client.

Sticky sessions: Route all of a client's requests to the same replica. If the client writes and reads from the same replica, it sees its own writes. But if that replica fails, the session breaks and the guarantee is lost until the client is re-routed to another replica.

Monotonic Reads

Monotonic reads guarantee that once a client sees a value at version N, it never sees a value at version M where M is less than N on subsequent reads. Without this guarantee, a client might read version 5, then read version 3 on the next request (because the second read hit a different, more stale replica). This time-travel effect is confusing and can break application logic that assumes data only moves forward.

Implementation: track the latest version each client has seen (in a session cookie or server-side session store), and route subsequent reads to replicas that are at least as current as that version. This is weaker than read-your-writes; it does not require seeing your own writes, only that reads do not go backward.

Monotonic Writes

Monotonic writes guarantee that writes from a single client are applied in the order they were issued. If a client writes A then writes B, every replica applies A before B. Without this guarantee, a replica might receive B before A (due to network reordering), producing an inconsistent state.

This matters for operations that depend on ordering: a client creates a user account (write 1), then updates the account's email (write 2). If write 2 arrives at a replica before write 1, the update targets a non-existent account. Monotonic writes prevent this by ensuring causal ordering within a single client's write stream.

Causal Consistency

Causal consistency preserves cause-and-effect ordering across all clients. If operation A causally precedes operation B (B was influenced by or depends on A's result), then every client sees A before B. Operations that are not causally related (concurrent operations) can be seen in any order.

Example: User A posts "I got the job!" User B sees the post and replies "Congratulations!" Causal consistency ensures that every other user who sees B's reply also sees A's original post. Without causal consistency, some users might see "Congratulations!" without the post it refers to, a confusing and broken experience.

Causal consistency uses version vectors or logical clocks rather than wall-clock timestamps to track causal dependencies. Each operation carries metadata describing which operations it has observed. This ordering is correct even when physical clocks disagree across machines, which they always do to some degree (clock skew).

Causal consistency is weaker than linearizability (concurrent operations have no ordering guarantee) but stronger than eventual consistency (causally related operations are always ordered). It strikes a practical balance: it prevents the most confusing anomalies (seeing a reply without the original post) without requiring the expensive coordination of strong consistency.

MongoDB 3.6+ supports causal consistency through session-scoped causal tokens that track the logical timestamp of the last operation the session has seen. Each operation returns a cluster time, and subsequent operations in the same session include this time to ensure they read data that is at least as recent as the previous operation's result.

Comparing Session Guarantees

From lowest to highest cost:

Read-your-writes prevents writing then reading stale data. Low cost: route reads to the primary or track a version token. Solves the most common consistency complaint ("I updated my profile but still see the old version").
Monotonic reads prevents seeing newer then older data on page refresh. Low cost: track last-seen version per client session. Prevents the confusing "time travel" effect.
Monotonic writes prevents out-of-order write application on replicas. Low cost: sequence writes per client session. Prevents "update before create" anomalies.
Causal consistency prevents seeing an effect before its cause (a reply before the original post). Medium cost: requires version vectors or logical clocks to track causal dependencies across clients.
Linearizability prevents any stale or out-of-order read system-wide. High cost: requires a consensus protocol on every operation.

The first three guarantees are "session-scoped": they apply to a single client's view. Causal consistency extends ordering across clients. Linearizability is global. Most applications need read-your-writes and monotonic reads. Few need full linearizability.

Tunable Consistency (Cassandra Model)

Cassandra and similar leaderless replication systems let you choose consistency per operation using three parameters:

N: the number of replicas that store each piece of data (replication factor)
W: the number of replicas that must acknowledge a write before it is considered successful
R: the number of replicas that must respond to a read before it returns a result

The key formula: W + R > N guarantees that at least one replica in the read set has the latest write (the read and write quorum sets overlap). This provides read-after-write consistency without requiring all replicas to participate in every operation.

1Quorum math with N=3:
2
3W=2, R=2: Strong consistency (overlap guaranteed)
4  Write to A,B → Read from B,C → B has latest value ✓
5
6W=1, R=3: Strong consistency (every replica is read)
7  Write to A → Read from A,B,C → A has latest value ✓
8
9W=3, R=1: Strong consistency (every replica has write)
10  Write to A,B,C → Read from any → all have latest value ✓
11
12W=1, R=1: Eventual consistency (no overlap guarantee)
13  Write to A → Read from B → B may be stale ✗

The trade-off is direct: higher W means slower writes but fresher reads. Higher R means slower reads but fresher results. W=1 gives maximum write throughput but minimum durability. R=1 gives minimum read latency but no consistency guarantee without high W.

Version Vectors in Practice

Version vectors track causal history across replicas. Each replica maintains a counter, and the vector records the latest counter value seen from each replica:

1Version vector example (3 replicas: A, B, C):
2
3Initial state: all replicas hold value "x" at version {A:0, B:0, C:0}
4
5Client writes "y" to replica A:
6  A's vector: {A:1, B:0, C:0}   value = "y"
7
8Client writes "z" to replica B (concurrently):
9  B's vector: {A:0, B:1, C:0}   value = "z"
10
11Replica C receives both updates:
12  From A: {A:1, B:0, C:0} "y"
13  From B: {A:0, B:1, C:0} "z"
14
15Neither vector dominates the other (A:1 > A:0 but B:0 < B:1).
16These are concurrent writes — conflict detected!
17C must resolve: LWW picks one, application merge keeps both.

The key insight is that version vectors detect concurrency without relying on synchronized clocks. If one vector is strictly greater than or equal to another in every component, it causally succeeds it. If neither dominates, the operations are concurrent and the system must resolve the conflict.

Combining Session Guarantees

In practice, you often combine multiple session guarantees. A user profile service might need:

Read-your-writes: the editing user sees their changes immediately
Monotonic reads: once a user sees an updated profile, they never see the old version on refresh
Monotonic writes: profile updates are applied in order (changing email, then changing password based on the new email)

These three together form a "session consistency" package that gives individual users a coherent experience while allowing the system to serve other users from eventually consistent replicas. Most application-level "consistency bugs" (user updates something and does not see the change) are solved by read-your-writes alone, without requiring full strong consistency.

Interview Tip

Tunable consistency is not about choosing one setting for the entire database. It is about choosing per-operation. A shopping cart add uses W=1, R=1 for speed (losing a cart item is annoying but recoverable). An order confirmation uses W=2, R=2 for correctness (losing an order is unacceptable). The same Cassandra cluster serves both with different consistency levels on each request.

CAP Theorem and Real-World Trade-offs

The CAP theorem, proven by Seth Gilbert and Nancy Lynch in 2002, states that a distributed system can provide at most two of three guarantees during a network partition: Consistency (every read returns the latest write), Availability (every request receives a non-error response), and Partition tolerance (the system continues operating despite network failures between nodes).

The critical insight is that partition tolerance is not optional. In any distributed system, network partitions will happen: switches fail, cables are cut, cloud availability zones lose connectivity. You must tolerate partitions. So the real choice during a partition is between consistency and availability.

CP Systems: Consistency Over Availability

A CP system refuses to serve requests that might return stale data during a partition. If a node cannot confirm it has the latest data (because it is cut off from the primary or a quorum), it returns an error rather than a potentially stale value. The system sacrifices availability to prevent inconsistency.

Examples: etcd uses Raft consensus, so the minority side of a partition cannot elect a leader and refuses all writes. ZooKeeper operates similarly: partitioned nodes become read-only or unavailable entirely. HBase depends on ZooKeeper for region server assignment, making it unavailable if the ZooKeeper quorum is lost. These systems are used for coordination, configuration, and leader election where serving stale data causes correctness failures that are worse than temporary unavailability.

AP Systems: Availability Over Consistency

An AP system continues serving requests during a partition, even if some responses contain stale data. Every reachable node responds to every request using its local data, which may be outdated. When the partition heals, replicas reconcile their divergent state using conflict resolution strategies like LWW, version vectors, or CRDTs.

Examples: Cassandra is tunable but defaults to AP, with all nodes accepting reads and writes independently. DynamoDB uses eventually consistent reads by default, serving from any replica. DNS serves cached records even when authoritative servers are unreachable. These systems prioritize uptime because unavailability causes more user-visible damage than a brief period of staleness.

The "CA" Misconception

Teams sometimes claim their system is "CA": both consistent and available. In a truly distributed system, this is impossible. Partitions will occur. A system that has not designed for partition behavior does not avoid partitions; it simply has undefined behavior when they happen. The system might hang indefinitely, corrupt data, or split-brain. A single-node database (PostgreSQL on one server) is technically CA because there is no network between nodes, but it is not distributed and has a single point of failure.

Real-World Partitions and Gray Failures

Clean network partitions (node A cannot reach node B at all) are actually the easy case to handle. More common and more dangerous are gray failures: intermittent packet loss, extreme latency spikes (10x normal), or asymmetric partitions (A can reach B but B cannot reach A). These gray failures cause timeouts that look like partitions but resolve themselves, making it unclear whether to trigger partition-handling logic.

The 2016 Dyn DNS attack caused gray failures across the internet. Many systems experienced intermittent connectivity, not clean partitions. Systems designed only for clean CP-or-AP behavior struggled because they could not determine whether they were partitioned or just experiencing high latency.

Beyond CAP: The PACELC Framework

CAP only describes behavior during partitions, which are rare. What about normal operation? The PACELC framework extends CAP: during a Partition, choose Availability or Consistency; Else (normal operation), choose Latency or Consistency.

This matters because even without a partition, strong consistency adds latency. Every write must wait for replica acknowledgment. Every read may need a quorum check. PACELC classifies systems more precisely:

PA/EL (Cassandra default, DynamoDB default): Available during partitions, low latency normally. Maximum performance, eventual consistency
PC/EC (etcd, ZooKeeper, Spanner): Consistent during partitions, consistent normally. Maximum correctness, higher latency on every operation
PA/EC (MongoDB default config): Available during partitions, but consistent normally (reads go to the primary). A pragmatic middle ground that prioritizes normal-case consistency but accepts inconsistency during rare partition events

PACELC reveals that the daily cost of consistency is latency on every request, not just the rare cost during partitions. For many teams, the Else clause (latency vs. consistency during normal operation) matters far more than the Partition clause because normal operation accounts for 99.99 percent of the time.

Common Pitfall

CAP is often oversimplified as 'pick two of three.' In reality, you are always partition-tolerant (you must be), so the choice is C or A during partitions. And the choice is not binary; tunable consistency lets you choose per-operation. Furthermore, partitions are not all-or-nothing: gray failures (intermittent packet loss, latency spikes) create ambiguity that clean CP-or-AP models do not address. Design for gray failures, not just clean partitions.

Choosing the Right Consistency Model

Choosing a consistency model is not a global, one-time decision. Modern systems use different consistency levels for different operations within the same application. The right model depends on what breaks when data is stale and how much latency you can tolerate.

The Decision Framework

Ask two questions about each piece of data:

1. What is the cost of a stale read? If a stale read causes financial loss, data corruption, or safety issues (bank balance, inventory count, distributed lock), use strong consistency. If a stale read is invisible or merely annoying to users (social feed, search index, analytics dashboard), eventual consistency is sufficient.

2. What is the acceptable latency budget? Strong consistency adds write latency proportional to the network distance between replicas. Within a single region (1-5ms added), this is usually acceptable. Across regions (80-150ms added), it may be prohibitive for latency-sensitive operations like search queries or page loads.

Real-World Patterns

Pattern 1 (Strong for writes, eventual for reads): Write the authoritative record to a strongly consistent primary database. Asynchronously replicate to read replicas, caches, and search indexes. Reads from the primary are consistent; reads from replicas are eventually consistent but much faster and more scalable. This is the most common pattern in production systems at Amazon, Netflix, and Uber.

Pattern 2 (Per-operation tunable consistency): Use a database like Cassandra that supports per-request consistency levels. The same cluster serves an order placement (W=2, R=2 for correctness) and a product review listing (W=1, R=1 for speed). No separate infrastructure is needed; the application chooses the consistency level on each query based on the operation's correctness requirements.

Pattern 3 (Read-your-writes with eventual for others): After a user writes data, route that user's reads to the primary (or a replica confirmed to have the write) for a session window. All other users read from eventually consistent replicas. This gives the writing user a consistent experience without slowing down reads for the millions of other users. DynamoDB's ConsistentRead=true flag implements this pattern per-request.

Pattern 4 (Strong within region, eventual across regions): Use synchronous replication within each region for strong local consistency. Use asynchronous replication between regions for eventual cross-region consistency. This bounds write latency to intra-region round-trips (1-5ms) while still providing global data distribution. Users in the local region always see fresh data; users in remote regions may see data that is 100-300ms stale, which is acceptable for most applications.

Decision Table by Use Case

Strong consistency required:

Account balances and financial transactions
Inventory counts during checkout
Distributed locks and leader election
User authentication state (login/logout)
Configuration that all nodes must agree on

Session consistency (read-your-writes) sufficient:

User profile updates (the editing user needs to see changes immediately)
Shopping cart modifications (the shopper needs to see items they added)
Draft document saves (the author needs to see the latest draft)
Notification preferences (the user toggling a setting expects immediate effect)

Eventual consistency sufficient:

Social media feeds and timelines
Search indexes and recommendation engines
Analytics dashboards and reporting
CDN-cached content and DNS records
Activity logs and audit trails

How Major Systems Handle Consistency

Understanding how production systems implement consistency helps you make informed choices:

Amazon DynamoDB: Eventually consistent reads by default (cheaper, lower latency, served from any replica). Strongly consistent reads available per-request via the ConsistentRead=true parameter (reads from the leader partition, costs 2x the read capacity). Global tables replicate across regions with eventual consistency only, with no cross-region strong reads.

Google Cloud Spanner: Strongly consistent (strict serializability) by default, even across regions. Uses TrueTime for globally meaningful timestamps. Bounded-staleness reads available for lower latency when slight staleness is acceptable. The only globally distributed database that provides external consistency (linearizability + serializability) as the default.

Apache Cassandra: Tunable per-query via consistency levels (ONE, QUORUM, ALL, LOCAL_QUORUM, EACH_QUORUM). Different queries to the same table can use different levels. LOCAL_QUORUM provides strong consistency within a data center while allowing eventual consistency across data centers, a common production configuration.

MongoDB: Write concern and read concern are separate settings. Write concern "majority" ensures durability. Read concern "linearizable" provides linearizable reads but with higher latency. Read concern "majority" provides read-your-writes within a session. The defaults (write concern 1, read concern "local") provide no consistency guarantees beyond single-node.

PostgreSQL with streaming replication: Synchronous replication provides strong consistency at the cost of write latency. Synchronous_commit can be set per-transaction, allowing critical writes to wait for replica acknowledgment while non-critical writes proceed asynchronously. This is per-transaction tunable consistency, similar in spirit to Cassandra's per-query levels.

Common Mistakes

Mistake 1 (Using strong consistency everywhere): Strong consistency for every read and write wastes latency and availability budget on operations that do not need it. A product catalog does not need linearizable reads. If your e-commerce site uses strong consistency for browsing catalog pages, you are adding 1-5ms of latency to every page load and reducing your tolerance for replica failures, all for a guarantee that provides no user-visible benefit since catalog data changes infrequently.

Mistake 2 (Using eventual consistency for coordination): Distributed locks, leader election, and sequence number generation require strong consistency. An eventually consistent lock allows two nodes to believe they are both the leader simultaneously. Both nodes proceed with conflicting operations, causing data corruption. This is not a theoretical risk; it is one of the most common causes of distributed system outages. Always use a CP system (etcd, ZooKeeper) for coordination.

Mistake 3 (Ignoring the convergence window size): Saying "we use eventual consistency" without understanding the convergence window is dangerous. If replication lag can reach 30 seconds under load and your SLA requires freshness within 5 seconds, you need monitoring and alerting on replication lag metrics, not just a consistency label. Measure the p99 replication lag in your system and set alerts for when it exceeds your application's tolerance.

Mistake 4 (Confusing consistency levels across the stack): A system might use strong consistency in the database but cache aggressively in the application layer, creating effective eventual consistency. If the cache TTL is 60 seconds, the user may see stale data for up to 60 seconds regardless of database consistency. The effective consistency is the weakest link in the entire read path, not just the database layer.

Monitoring Consistency in Production

Choosing a consistency model is not enough; you must verify it is working as expected in production. Key metrics to monitor:

Replication lag (p50, p95, p99): The time between a write being committed on the primary and it being applied on each replica. If your application tolerates 1 second of staleness but p99 replication lag is 5 seconds, you have a consistency problem that only affects your busiest periods, exactly when it matters most.

Stale read rate: For systems with read-your-writes guarantees, measure how often a client reads data older than their last write. This catches routing bugs (reads going to the wrong replica) and replication failures. Instrument this by including a version token in write responses and checking it on subsequent reads.

Conflict rate: For eventually consistent systems with multi-primary writes, track how often conflicts occur and how they are resolved. A rising conflict rate may indicate a design problem: too many clients writing to the same keys, or a partition causing replicas to diverge.

Convergence time: Measure how long it takes for all replicas to agree on a value after a write. This is different from replication lag (which measures one replica). Write a known value, then poll all replicas until all return the new value. The maximum time across all replicas is the convergence time.

Set alerts on these metrics tied to your SLA. If your application promises "updates visible within 5 seconds," alert when p99 replication lag exceeds 3 seconds (giving you a 2-second buffer to investigate before users are affected).

Key Insight

The most scalable architectures are not 'strongly consistent' or 'eventually consistent.' They are selectively consistent, using the minimum consistency level required for each operation. This is why DynamoDB offers both ConsistentRead=true and ConsistentRead=false on the same table, why Cassandra supports per-query consistency levels, and why most production systems route writes to a primary but reads to replicas. Design consistency per-operation, not per-system.

Course

System Design Fundamentals

Networking & APIs

Storage & Data Modeling

Partitioning, Replication & Consistency

Caching & Edge

Messaging & Streaming

Reliability & Operability

Security & Privacy

Common Interview Scenarios