Saga Pattern for Distributed Transactions: Compensations Need Idempotency Keys Too

March 4, 2026

Two-phase commit looks elegant on a whiteboard and falls apart in production. The transaction coordinator is a single point of failure. If it crashes mid-protocol, every participant sits there holding locks, waiting for a verdict that never arrives. At scale, the coordinator itself becomes a hot path, and you end up running a distributed lock manager you did not budget for.

Sagas take a different bet. Instead of asking every service to agree atomically, you accept that the work happens in steps, each of which is a normal local transaction. If step five fails, you do not roll back the database. You run compensating actions for steps one through four. Reserve inventory and charge a card on the way forward, release inventory and refund the card on the way back.

There are two ways to coordinate a saga. Orchestration puts a workflow engine like Temporal or Cadence at the center, calling each service in order, tracking state, and triggering compensations when something fails. Choreography is event driven: services react to messages, and there is no central brain. Orchestration is easier to debug and easier to reason about. Choreography removes the coordinator but spreads the saga logic across every service involved.

Either way, the rule that gets people in trouble is this: every step must be idempotent, and every compensation must be idempotent. A saga retries. A workflow engine will replay a step that timed out, even though the actual side effect went through. If the call is not idempotent, you do the work twice.

A production failure I have seen up close. An e-commerce checkout saga ordered its steps reserve inventory, charge card, send confirmation. About once every three hundred orders, the payment provider returned a 5xx after actually processing the charge. The saga interpreted that as a failure and ran the compensation, refund the card. The refund call itself sometimes hit the provider's 5xx path. The saga retried it. The provider's bank integration only processed the first refund. The retry was treated as a fresh request and a second refund went out. About one customer a day got refunded twice for a single failed order, and finance noticed only after a month.

The fix lived in two places. Every external call, forward and compensating, got an idempotency key derived from the saga ID and step name. The payment provider already supported idempotency tokens. We had simply never sent them on the compensation path. Once we did, duplicate refunds went to zero and stayed there.

Sagas are not a free upgrade over 2PC. They trade blocking for bookkeeping. The bookkeeping is where the production bugs live.

Key takeaway

Sagas replace global atomicity with per-step compensations, but only work if every forward call and every compensation is idempotent. Without idempotency keys on both directions, your refund logic can double-refund customers under retries.

Originally posted on LinkedIn. View original.