Three-Phase Commit vs Two-Phase Commit: Why Nobody Actually Runs 3PC

December 23, 2025


Three-phase commit exists because someone looked at 2PC, noticed that a dead coordinator strands every participant, and asked the obvious question. Can we add a phase that lets participants decide on their own. The answer is yes, on paper, under assumptions that do not hold on real networks.

3PC splits the verdict step of 2PC into two. After participants vote yes in the CanCommit phase, the coordinator sends a PreCommit message that announces, in effect, the decision has been made and it is commit. Only after participants acknowledge PreCommit does the coordinator send the final DoCommit. The trick is that if the coordinator dies after PreCommit reaches everyone, surviving participants can run an election among themselves, observe that every live node is in the PreCommit state, and safely finish the commit on their own. No more indefinite locks.

The proof depends on a synchronous network model: every message either arrives within a known bound or is treated as lost. In that model, if you have not seen a PreCommit by the timeout, you can assume nobody else did either, and you abort. The real internet breaks this assumption casually. Messages reorder, get delayed by half a second behind a congested switch, then arrive after the recipient has already moved on. A single late PreCommit arriving at one participant after another participant has timed out and aborted is enough to produce an inconsistent decision.

A research team I crossed paths with implemented 3PC across two regions over the public internet, both for the academic interest and because their consultants had recommended it. Most days it worked. One afternoon a routing flap pushed traffic onto a longer path, and a PreCommit to one participant was delayed by about 200 milliseconds while another participant timed out and aborted locally. The slow participant eventually received the PreCommit, saw no abort, and committed. One node committed, the other aborted, and reconciliation took the rest of the week. The protocol did exactly what the spec said. The spec assumed a network that did not exist.

The lesson modern systems took from 3PC is not the third phase. It is that the coordinator is the problem. Replace the single coordinator with a Raft or Paxos group, and you get a coordinator role that survives node failure without giving up safety on an asynchronous network. For anything where you can model the work as steps with compensations, sagas remove the global transaction entirely. 3PC stays in textbooks. That is where it belongs.

Key takeaway

3PC fixes the 2PC blocking problem only if you assume bounded message delay and no partitions, which the internet does not give you. In production, use Raft or Paxos to elect the coordinator, or escape to a saga.

Originally posted on LinkedIn. View original.


All Rights Reserved.