Raft Consensus: Leader Election, Log Replication, and a Slow-Disk Failure

March 5, 2026

Without consensus you get split brain. Two nodes both think they are in charge, both accept writes, and you spend the next week reconciling a database against itself. Raft exists so that does not happen.

Raft decomposes consensus into three problems and solves each on its own.

Leader election. Every server is a follower, candidate, or leader. Followers expect heartbeats from a leader within a randomized election timeout, typically 150 to 300 ms. If the heartbeat does not arrive, the follower bumps its term, votes for itself, and asks the others for votes. A server grants at most one vote per term, and grants it to the first candidate it hears from with an up-to-date log. Randomized timeouts make split votes rare.

Log replication. The leader is the only node that accepts client writes. It appends each command to its local log, then sends AppendEntries to every follower. Once a majority has acknowledged the entry, the leader commits it and tells the client. Followers apply committed entries to their state machines in the same order. Conflicting follower entries get overwritten.

Safety. Leader completeness says any entry committed in a previous term is present in the log of every future leader. Raft enforces this by only granting votes to candidates whose logs are at least as up-to-date as the voter's, so a stale candidate cannot win.

Majority is the magic number. Two of three, three of five. Any two majorities of the same cluster must intersect in at least one node, so the next leader is guaranteed to see every committed entry from the previous term. Paxos has the same property, but Raft was deliberately designed to be teachable.

The production failure worth remembering. A 3-node Raft cluster had one node on a degraded SSD that fell perpetually behind in AppendEntries throughput. The leader and the healthy follower formed a 2/3 majority and the cluster looked fine. On a routine graceful leader rolloff, the new leader had to wait for the slow node to catch up to its committed index before accepting writes. It never caught up before the election timeout expired. Writes paused for 90 seconds.

Two fixes shipped. Enable the pre-vote extension so the slow node could not disrupt elections by bumping terms. And demote the slow-disk node from voter to non-voting learner, so it no longer counted toward quorum or held up leader handoff. Same physical cluster, far less coupled availability.

Key takeaway

Raft is consensus you can actually reason about. Majority quorum keeps you live under a single failure, but a permanently slow voter can still kill availability during a leader handoff. Make it a learner.

Originally posted on LinkedIn. View original.