Sagas: Distributed Transactions Without the Global Lock
December 26, 2025
When a business operation spans three services and four databases, the classic answer is a distributed transaction. The classic answer does not work.
Two-phase commit holds locks across participants until the coordinator finishes. If the coordinator dies at the wrong moment, every participant is stuck waiting, unable to commit or release. Three-phase commit only fixes that under network assumptions that the public internet violates daily. In practice nobody uses these for cross-service flows.
What real microservice architectures use instead is a saga. The idea is small and the consequences are large. You break the big operation into a sequence of local transactions, one per service, each on its own database. Each step also defines a compensating action that semantically undoes it if a later step fails.
Picture a booking flow: reserve a seat, charge the card, issue the ticket, send the email. If the charge fails after the seat is reserved, you do not roll back the database row. You publish a ReleaseSeat command. The seat service runs a normal local transaction that marks the seat available again. The state moves forward through compensation, never backward through rollback.
There are two ways to drive a saga.
- Orchestration. A single saga orchestrator service holds the workflow definition. It calls each step, waits for the response, and decides what to do next. On failure it explicitly issues compensations in reverse order. Easy to reason about because the flow is in one place. The orchestrator is a critical service you have to operate.
- Choreography. No central brain. Each service subscribes to events and reacts. Payment listens for
SeatReservedand emitsPaymentChargedorPaymentFailed. The seat service listens forPaymentFailedand releases the seat on its own. More decoupled, but the workflow is invisible. Debugging means stitching it together from logs across services.
Either way, compensating actions are the load-bearing piece, and they have a sharp edge. They must be idempotent, because the orchestrator or event bus will retry them. They must handle the case where the action they are compensating already partially completed. And they have to encode business logic: a refund is not a delete, an apology email is not an unsent message. There is no generic undo.
The production failure mode that surprises teams is the semi-permanent stuck saga. A compensation step itself fails: the downstream service is unreachable, or the data has moved on. The orchestrator now owns a half-undone state. You need a dead-letter queue for failed compensations, alerts, and a manual repair tool. Sagas trade availability for eventual consistency, and the eventual part needs an operator.
Pick orchestration when the flow is complex. Pick choreography when teams are independent. Reach for sagas instead of 2PC whenever the alternative is downtime.
A saga is a sequence of local transactions, each with a compensating action that semantically undoes it. You do not roll back across services. You publish an apology.
Originally posted on LinkedIn. View original.