Two-Phase Commit Walkthrough: The Blocking Problem Nobody Wants to Own
December 22, 2025
Two-phase commit is the textbook answer for atomic writes across services that do not share a database. It is also the protocol most senior engineers will quietly refuse to run in production. Both things are true at once. Worth understanding why.
Phase one is prepare. The coordinator asks every participant, can you commit this transaction. Each participant does the real work locally: validates inputs, acquires row locks, writes the intent to its own durable log, and only then replies VOTE-YES or VOTE-NO. The durable log is load bearing. If a participant crashes after voting yes, it must be able to come back up, read its log, and honor the vote it already cast.
Phase two is the verdict. If every vote was yes, the coordinator writes COMMIT to its own log and broadcasts commit to all participants. Each participant applies the change and releases its locks. If any vote was no, or any participant timed out, the coordinator broadcasts abort and everyone rolls back their prepared state.
The protocol is correct. It is also blocking. After a participant votes yes, it cannot unilaterally decide what to do. It is contractually obligated to wait for the coordinator's verdict, with its locks held. If the coordinator dies between collecting all the yes votes and writing its commit decision, every participant sits there holding locks on rows, waiting for instructions that never come.
A payment platform I worked alongside ran 2PC across a card service and a ledger service so that charges could not exist on one side without the other. One afternoon the coordinator process was killed by the OOM killer right after collecting yes votes from both participants. Both services held row locks on the affected customer accounts. For roughly four hours, no charges and no refunds could go through for the impacted shard, because every new transaction queued behind locks that were waiting on a dead coordinator. The on-call eventually walked through participant logs by hand to resolve each pending transaction.
The fix was not a better coordinator. It was getting out of 2PC for the hot path entirely. The team moved checkout to a saga with explicit compensating actions, and reserved 2PC for offline reconciliation jobs where a stall is recoverable. XA and distributed SQL engines still use 2PC under the hood, but they hide it behind features like presumed abort and aggressive coordinator failover. If availability matters in your call path, 2PC is the wrong tool.
2PC is safe on paper and dangerous in production. After participants vote yes, a crashed coordinator strands every participant with locks held. Reach for sagas anywhere availability matters more than strict atomicity.
Originally posted on LinkedIn. View original.