Leader Follower Replication: The Default That Hides a Cliff

March 22, 2026

The most popular replication model in production databases is also the simplest: one leader takes every write, and followers replicate the same log in the same order. Postgres streaming replication ships the WAL. MySQL ships row-based or statement-based binlog events. MongoDB ships the oplog. Redis ships a replication stream. Different formats, same shape.

This is a different design from Kafka's leader and ISR model. Kafka replicates partitions of an append-only log between brokers, and the followers exist mostly for durability. Database leader follower replicates state mutations and is built around two goals: durability if a node dies, and read scaling by spreading queries across followers.

Why is the model so popular? Because it eliminates write conflicts entirely. There is one authority for write ordering. Followers replay events in the same sequence the leader committed them. Failover is also straightforward in principle: pick the most up-to-date follower and promote it.

The pivotal design choice is synchronous versus asynchronous replication.

Synchronous: the leader waits for at least one follower to confirm before acknowledging the write. Stronger durability. Higher tail latency, because the slowest synchronous follower sets the floor.

Asynchronous: the leader acknowledges immediately and ships changes in the background. Lower latency. Followers lag behind. If the leader crashes before the lag is closed, the unreplicated tail is gone.

Semi-synchronous is the compromise most teams reach for, with one synchronous replica and the rest async.

The production failure I keep watching teams hit: an async setup with a single primary and three read replicas. During a peak event the primary's IO saturates, replication lag drifts from 200 ms to 8 seconds, and nobody notices because the dashboard alert was set at 30 seconds. The primary then segfaults. Failover promotes the most recent replica, which is the one 8 seconds behind. Customers who issued refunds in that window see them vanish. The reconciliation job runs the next morning and re-issues them, but only after support takes a few hundred angry tickets.

Lag is not a bug. It is the cost of asynchronous replication. The fix is not "make lag zero." It is alerting on lag at a threshold that maps to your acceptable data loss budget, and writing the failover runbook around that number. Followers improve scale. They do not improve freshness, and they only improve durability if you understand the window you are choosing to lose.

The cleanest guardrails I have seen are quiet but boring. Page on replication lag in bytes, not seconds, because seconds lie when the workload is bursty. Require at least one synchronous replica for any data you cannot replay from an upstream system. Run failover drills monthly so the runbook is muscle memory, not a wiki page nobody has opened in a year.

Key takeaway

Leader follower is the most popular replication model because it makes write ordering trivial. The price is replication lag, and lag becomes data loss the moment you fail over.

Originally posted on LinkedIn. View original.