Replication Topologies

Course

System Design Fundamentals

Replication Topologies

Topics Covered

Leader-Follower Replication (Single-Leader Model)

Why it works

Synchronous vs. asynchronous replication

Where it fits, and where it breaks

Multi-Leader Replication (Active-Active / Multi-Master)

The conflict problem

Conflict resolution strategies

Multi-leader topologies

Where it fits, and where it does not

Quorum-Based / Leaderless Replication (Dynamo-Style)

The quorum formula

How reads resolve stale data

Sloppy quorums and hinted handoff

Where it fits

Comparing all three topologies

Decision framework for interviews

Leader-Follower Replication (Single-Leader Model)

The simplest replication model: one node accepts all writes (the leader), and one or more nodes stream those writes and serve reads (the followers). Every write goes through a single authority, which means there are zero write conflicts by design. This simplicity is why leader-follower is the default for PostgreSQL, MySQL, MongoDB replica sets, and Redis.

Why does replication matter in the first place? Two reasons: durability (if one machine dies, the data survives on another) and performance (distribute read traffic across multiple machines). Leader-follower replication gives you both, with the constraint that writes are bottlenecked on a single node.

Why it works

No write conflicts. One leader orders all writes sequentially. Followers replay the log in the same order. You never have two nodes accepting conflicting writes to the same row. This sequential ordering is the source of consistency: every follower applies exactly the same sequence of changes, guaranteeing all replicas converge to the same state.

Read scalability. Add followers to absorb read traffic. A social media feed with a 100:1 read-to-write ratio can serve reads from 10 followers while the leader handles writes alone. Reads scale linearly: each additional follower handles an independent subset of read queries, and a load balancer distributes traffic across them. For a system serving 10,000 reads per second, 5 followers handle roughly 2,000 reads/sec each.

Simple failover. Followers are hot standbys. If the leader dies, promote the most up-to-date follower. The new leader has the complete write history (minus any unreplicated tail).

In practice: PostgreSQL streaming replication, MySQL/MariaDB binary log replication, MongoDB replica sets, Redis primary/replica. These are all leader-follower by default. The differences are in how they handle promotion, lag monitoring, and synchronous vs. asynchronous modes.

Synchronous vs. asynchronous replication

The critical question: does the leader wait for followers to confirm before acknowledging the write to the client?

Synchronous: the leader waits for at least one follower to persist the write before returning success. The client knows the data is on two machines. Trade-off: every write pays the latency of the slowest required follower. If that follower is across an ocean, writes slow to 100ms+.

Asynchronous: the leader acknowledges immediately after its own write. Followers catch up in the background. Trade-off: if the leader crashes before a follower has the latest write, that data is lost. This is replication lag: the gap between what the leader has written and what followers have received.

Most production deployments use semi-synchronous: one follower is synchronous (durability guarantee), the rest are asynchronous (performance). PostgreSQL calls this synchronous_commit = on with one synchronous standby. MySQL calls it rpl_semi_sync_master_wait_for_slave_count = 1.

Replication lag in numbers. On a local network, asynchronous replication lag is typically 1-10 milliseconds under normal load. Under heavy write bursts (batch imports, schema migrations with data backfill), lag can spike to seconds or even minutes. Cross-datacenter replication (leader in US-East, follower in EU-West) adds 80-150ms of network latency to the baseline lag.

What causes lag spikes? Three common causes: (1) the follower's disk is slower than the leader's (writes on the leader are fast, but replaying on the follower is slow due to I/O contention from reads served by that follower), (2) a large transaction on the leader (e.g., a bulk UPDATE of 1 million rows) must be replayed as a single atomic unit on the follower, blocking other replication during replay, (3) the follower is running long analytics queries that hold locks that conflict with WAL replay. PostgreSQL's hot_standby_feedback setting addresses case (3) by communicating the follower's oldest query snapshot to the leader.

Common Pitfall

Replication lag is not a bug. It is a fundamental property of asynchronous replication. A read from a follower may return stale data. If your application reads its own writes (user updates profile, then immediately views it), route that read to the leader. This pattern is called read-your-writes consistency and it is the most common source of confusion in leader-follower systems.

Where it fits, and where it breaks

Best for: Read-heavy workloads (content APIs, product catalogs), consistency-first systems (financial ledgers, booking systems), any system where the write rate fits on a single node.

Breaks when: Write throughput exceeds what a single leader can handle. You cannot shard writes across followers; they are read-only. If you need 100K writes/sec and one node handles 20K, leader-follower alone is not enough. You need either sharding (split data across multiple leader-follower groups) or multi-leader replication.

Failover mechanics in detail. When the leader fails, the system must: (1) detect the failure (heartbeat timeout, typically 10-30 seconds), (2) elect a new leader from the followers (choose the one with the smallest replication lag), (3) reconfigure all clients to write to the new leader (DNS update, VIP failover, or proxy reconfiguration), (4) reconfigure remaining followers to replicate from the new leader. During this window (typically 10-60 seconds), writes are unavailable. This is the cost of strong consistency: the system refuses to accept writes rather than risk accepting them on two nodes simultaneously (split-brain).

Split-brain prevention. The most dangerous failover scenario is split-brain: the old leader is not actually dead (just slow or partitioned), and both the old and new leaders accept writes simultaneously. Two leaders accepting writes for the same data means conflicts and data divergence. Prevention mechanisms include: fencing tokens (the new leader gets a monotonically increasing token; the old leader's writes are rejected by storage), STONITH (Shoot The Other Node In The Head: forcibly power off the old leader before promoting), and consensus-based leader election (Raft, Paxos) where a leader cannot serve unless it holds a majority lease.

Monitoring replication health. In production, you need to track three metrics continuously:

Replication lag: how far behind each follower is, measured in bytes or seconds. A lag spike indicates the follower cannot keep up with write volume or is experiencing I/O contention. PostgreSQL exposes this via pg_stat_replication.replay_lag; MySQL via Seconds_Behind_Master.
WAL send rate: bytes per second from leader to followers. A drop means the network is congested or the follower stopped requesting data.
Follower connection state: streaming, catchup, or disconnected. A disconnected follower will miss writes and require a full re-sync if the WAL it needs has been recycled.

Set alerts on lag exceeding 10 seconds and on any follower disconnection. Replication lag is the single most important metric for leader-follower systems. It directly indicates whether your read replicas are serving fresh data and whether failover will lose transactions.

Multi-Leader Replication (Active-Active / Multi-Master)

What if your users are on two continents? With leader-follower, a user in Tokyo writing to a leader in Virginia pays 150ms round-trip latency on every write. Multi-leader replication solves this by placing a leader in each region. Each leader accepts writes locally (fast) and replicates to other leaders asynchronously.

The benefit is obvious: local-latency writes everywhere. The cost is equally obvious: two leaders can accept conflicting writes to the same data at the same time.

Where you see this in practice: CockroachDB and Spanner use multi-leader-like architectures with Paxos/Raft consensus to avoid conflicts (at the cost of cross-region write latency for consensus). MySQL Group Replication and PostgreSQL BDR (Bi-Directional Replication) offer true multi-leader where each node accepts writes and conflicts are resolved asynchronously. Cassandra's multi-datacenter mode behaves like multi-leader with LWW conflict resolution built into the storage engine.

The conflict problem

User A in Tokyo updates their display name to "Alice" on Leader-Tokyo. Simultaneously, User A (from a different device) in New York updates it to "Alicia" on Leader-Virginia. Both writes succeed locally. When the leaders replicate to each other, they discover two concurrent writes to the same row. This is a write conflict.

Conflicts are the defining challenge of multi-leader replication. You must decide which write wins, and the answer is never obvious.

Conflict detection. How does the system even know a conflict happened? Each write is tagged with a version vector or logical timestamp. When replication delivers a write from another leader, the receiving leader compares it to its own state. If the write updates a row that was also updated locally since the last sync, a conflict is detected. The detection granularity matters: row-level detection catches any concurrent update to the same row, while column-level detection only flags conflicts when the same column was modified by both leaders.

Conflict resolution strategies

Last-writer-wins (LWW): attach a timestamp to each write. The write with the higher timestamp wins. Simple, but dangerous: clocks across datacenters are never perfectly synchronized (NTP drift can be milliseconds to seconds). A "later" write by wall-clock time might actually have happened first in real time. LWW silently discards one write without telling the user.

Application-level resolution: surface the conflict to the application. For a collaborative document editor, this means merging both edits (CRDTs). For a shopping cart, it means taking the union of items. For a user profile, it might mean showing the user both versions and asking them to pick. This is what Amazon's Dynamo-style systems do.

Conflict-free replicated data types (CRDTs): data structures designed so concurrent updates always converge to the same state without coordination. G-Counters, OR-Sets, and LWW-Registers are examples. CRDTs eliminate conflicts by construction, but only work for specific data types. You cannot build a general-purpose relational database on CRDTs alone.

Multi-leader topologies

All-to-all: every leader replicates to every other leader. Most resilient (any leader failing does not break replication between the remaining leaders) but generates the most network traffic (N*(N-1) replication streams for N leaders).

Circular: each leader replicates to the next in a ring. Simpler, less traffic, but a single leader failure breaks the ring and requires reconfiguration.

Star (hub-and-spoke): one central leader receives from all others and forwards to all others. Easy to add new leaders, but the hub is a single point of failure.

In practice, most multi-leader deployments use all-to-all with 2-3 leaders (one per region). Beyond 3 leaders, the conflict resolution complexity grows faster than the latency benefits.

Interview Tip

In interviews, if you propose multi-leader replication, the interviewer will immediately ask about conflict resolution. Have a concrete strategy ready: LWW for non-critical data (user preferences, analytics), application-level merge for critical data (shopping carts, collaborative docs). Saying 'we will handle conflicts' without specifying how is a red flag.

Where it fits, and where it does not

Good fit: Globally distributed systems where users write from multiple regions and local-latency writes are a hard requirement: multi-region SaaS products, collaborative editing tools, offline-first mobile apps that sync later.

Bad fit: Systems that require strong consistency (banking, inventory). If two ATMs dispense cash for the same account simultaneously in different regions, no conflict resolution strategy can un-dispense the money. For these, use leader-follower with a single leader per data partition and accept the write latency.

Operational complexity. Multi-leader replication is significantly harder to operate than leader-follower. Schema migrations must be coordinated across all leaders simultaneously (rolling schema changes can cause replication failures if leaders disagree on the schema). Auto-increment primary keys cause conflicts (two leaders generate the same ID). Most multi-leader systems use UUIDs or per-region ID prefixes (US-000001, EU-000001) to avoid this. Monitoring must track replication lag between every pair of leaders, not just leader-to-follower. Debugging data inconsistencies requires comparing state across leaders and understanding which conflict resolution path was taken.

Testing is hard. You cannot easily reproduce multi-leader conflicts in a development environment. A conflict requires two writes to the same key arriving at different leaders within the replication lag window: milliseconds in the same datacenter, seconds across regions. Teams often discover conflict resolution bugs only in production. This operational risk is the biggest reason to avoid multi-leader unless the latency requirements truly demand it.

When multi-leader is worth the pain. Despite the complexity, multi-leader is the right choice when users in multiple regions need sub-50ms write latency and the data model supports conflict resolution. Google Docs uses a form of multi-leader (operational transforms) where every client can write locally and conflicts are merged automatically. Notion, Figma, and other collaborative tools use similar approaches with CRDTs. The common thread: the data model is designed from the start to be merge-friendly, not retrofitted after the fact.

Quorum-Based / Leaderless Replication (Dynamo-Style)

What if you removed the leader entirely? In leaderless replication, the client writes to multiple nodes simultaneously and reads from multiple nodes simultaneously. There is no single point of authority. Consistency comes from overlapping write and read sets: the quorum.

This design was popularized by Amazon's Dynamo paper (2007) and implemented in Cassandra, Riak, and Voldemort. DynamoDB (the managed AWS service) uses similar principles internally, though it abstracts the quorum knobs away from the user.

The quorum formula

For a cluster of N replicas, a write succeeds when W nodes acknowledge, and a read succeeds when R nodes respond. As long as W + R > N, at least one node in every read set has the latest write. This overlap guarantees the reader sees the most recent value.

Example: N=3, W=2, R=2. A write to nodes A, B, C succeeds when any 2 acknowledge. A read queries any 2 nodes. Since W+R = 4 > 3, at least one node in the read set participated in the write, so the reader gets the latest value (it picks the one with the higher version number).

Tuning the knobs:

W=1, R=N: fast writes, slow reads (write to any one node, read from all to find latest)
W=N, R=1: slow writes, fast reads (write to all, read from any one)
W=N/2+1, R=N/2+1: balanced (majority quorum on both sides)

Why W+R > N works. If you write to W nodes and read from R nodes, and W+R exceeds N, then the write set and the read set must overlap by at least one node (pigeonhole principle). That overlapping node has the latest value, so the reader can identify it by comparing version numbers. If W+R equals N or less, there is no guaranteed overlap, and the reader might miss the latest write.

How reads resolve stale data

Even with quorum, individual nodes can be stale. The reader queries R nodes, gets back potentially different versions, and picks the one with the highest version number or vector clock. Two mechanisms bring stale nodes up to date:

Read repair: when a read detects a stale response from one node, the client (or coordinator) writes the latest value back to that node. Stale data gets fixed lazily, as reads happen. This works well for frequently-read keys but leaves rarely-read keys stale indefinitely.

Anti-entropy: a background process continuously compares data across nodes and copies missing updates. Uses Merkle trees to efficiently identify which key ranges differ without comparing every record. This catches stale data that read repair misses, but consumes background I/O and network bandwidth.

Version vectors (vector clocks): each node maintains a vector of counters, one per node. When node A writes, it increments its own counter. When nodes exchange data, they compare vectors to determine which version is newer, or whether two versions are concurrent (neither is strictly newer). Concurrent versions indicate a conflict that must be resolved by the application.

Key Insight

The quorum formula W + R > N is not about consistency. It is about overlap. As long as the write set and read set overlap by at least one node, you are guaranteed to see the latest write. The formula lets you tune the trade-off: make writes cheap (low W) and reads expensive (high R), or vice versa. In practice, most systems use W=R=majority because it balances latency and fault tolerance equally on both paths.

Sloppy quorums and hinted handoff

What happens when a required node is temporarily unreachable? Strict quorum would reject the write. Sloppy quorum writes to any W available nodes, even if they are not the "home" nodes for that key. The substitute node holds the data temporarily and forwards it to the correct node when it comes back online. This is hinted handoff.

Sloppy quorums increase availability (writes succeed even during partial outages) but weaken the consistency guarantee. The "temporary" node is not part of the normal read quorum, so reads during the outage might miss the write.

Cassandra and DynamoDB both use sloppy quorums by default. Cassandra's consistency_level=LOCAL_QUORUM uses strict quorum within a datacenter but sloppy across datacenters. DynamoDB abstracts this entirely: you choose between "eventual" and "strong" consistency per read, and AWS handles the quorum logic internally.

Tunable consistency per query. One of the most powerful features of leaderless systems is per-query consistency tuning. Cassandra allows you to set consistency_level per CQL statement. A write to a financial ledger might use ALL (every replica must acknowledge), while a read of a user's recent activity feed might use ONE (fastest response from any replica). This lets you optimize each query path independently rather than choosing a single consistency level for the entire database.

The trade-off is cognitive complexity. Developers must understand what each consistency level means and choose correctly for each query. A single ONE write to a ledger table creates a data integrity risk. Code reviews must check consistency levels as carefully as they check SQL correctness.

Common Cassandra consistency patterns:

LOCAL_QUORUM for writes, LOCAL_QUORUM for reads: strong consistency within a datacenter, eventual across datacenters. Most common production pattern.
ONE for writes, ONE for reads: maximum speed, eventual consistency. Good for time-series data where slight staleness is acceptable.
ALL for writes, ONE for reads: slow writes but fast reads with guaranteed freshness. Good for infrequently-written configuration data.
EACH_QUORUM for writes: quorum in every datacenter. Ensures data is durable in all regions before acknowledging. Slowest but most durable option for cross-region consistency.

Choosing per-query consistency is one of the most powerful tools in the distributed database toolkit, but it requires discipline. Document your consistency choices per table and per query path, and enforce them in code reviews. Consider wrapping consistency level selection in a repository layer so individual developers do not need to remember the correct setting for each table. Centralize the decision where it can be audited and tested.

Where it fits

Good fit: High-availability systems where writes must never be rejected: shopping carts, session stores, DNS, and other systems where AP (availability + partition tolerance) is preferred over CP (consistency + partition tolerance).

Bad fit: Systems requiring strong consistency or transactions across multiple keys. Leaderless replication provides per-key consistency at best. If you need "transfer $100 from account A to B" as an atomic operation, quorum-based replication alone is not sufficient. You need a consensus protocol (Paxos, Raft) or a distributed transaction coordinator.

Comparing all three topologies

Property	Leader-Follower	Multi-Leader	Leaderless
Write target	Single leader	Any leader	Any node
Write conflicts	Impossible	Possible	Possible
Read scalability	Add followers	Limited	Add replicas
Write latency	Network to leader	Local	Quorum RTT
Failover	Promote follower	Automatic (other leaders)	No single point of failure
Complexity	Low	High (conflicts)	Medium (quorum tuning)
Best for	OLTP, consistency	Multi-region writes	High availability, AP

The choice is not always one or the other. Many real systems combine topologies: a leader-follower setup per shard (for consistency), with shards distributed across regions. Within each shard, the leader handles writes; across shards, the system scales horizontally. This hybrid approach gives you the consistency of leader-follower with the scalability of distributed systems.

Decision framework for interviews

When asked about replication in a system design interview, follow this decision tree:

Is the workload read-heavy with moderate writes? Start with leader-follower. Most OLTP systems fall here.
Do writes need to happen in multiple regions with local latency? Consider multi-leader, but only if you can explain your conflict resolution strategy.
Is availability more important than consistency? Consider leaderless with quorum.
Do you need both high availability AND strong consistency? Use consensus-based replication (Raft, Paxos). Systems like CockroachDB, etcd, and Consul use this approach. It is effectively leader-follower with automatic, safe leader election built into the protocol.

In most interviews, leader-follower with read replicas is the correct starting point. Only escalate to multi-leader or leaderless if the interviewer pushes on specific requirements (multi-region writes, extreme availability) that leader-follower cannot meet.

Real-world examples by topology:

Leader-follower: PostgreSQL with streaming replication (most SaaS backends), MySQL with read replicas (WordPress, Shopify), MongoDB replica sets (document-heavy apps), Redis primary/replica (caching layers).
Multi-leader: CockroachDB (distributed SQL with Raft per range), PostgreSQL BDR (multi-master PostgreSQL for geo-distributed writes), MySQL Group Replication, Cassandra multi-datacenter (LWW-based).
Leaderless: Amazon DynamoDB (managed, abstracts quorum), Apache Cassandra (configurable consistency levels per query), Riak (Dynamo-style with CRDTs), Voldemort (LinkedIn's internal KV store).

Notice that some systems span categories depending on configuration. Cassandra can behave as leaderless (within a datacenter with tunable consistency) or multi-leader (across datacenters where each DC has its own coordinator). The topology is not always a fixed property of the database. It can be a deployment choice.

The CAP theorem connection. The replication topology you choose directly maps to your CAP preference:

CP (consistency + partition tolerance): Leader-follower with synchronous replication. During a partition, the minority side refuses writes (unavailable) rather than risk divergence. Examples: etcd, ZooKeeper, PostgreSQL with synchronous standbys.
AP (availability + partition tolerance): Multi-leader or leaderless. During a partition, all sides continue accepting writes (available) and reconcile conflicts after the partition heals. Examples: Cassandra, DynamoDB, Riak.

No replication topology gives you all three (CA is only achievable without network partitions, which is unrealistic in distributed systems). Understanding this trade-off is essential for system design interviews. It connects the theoretical CAP theorem to the practical replication decision.

In practice, most systems are "mostly CP" or "mostly AP" depending on configuration. Cassandra with LOCAL_QUORUM consistency behaves like CP within a datacenter but AP across datacenters. PostgreSQL with synchronous replication is CP but becomes unavailable if the synchronous standby goes offline. The binary CP vs. AP classification is a simplification. Real systems exist on a spectrum.

Course

System Design Fundamentals

Networking & APIs

Storage & Data Modeling

Partitioning, Replication & Consistency

Caching & Edge

Messaging & Streaming

Reliability & Operability

Security & Privacy

Common Interview Scenarios