Idempotency and Retries: The Pattern Every State-Changing API Owes You
December 27, 2025
Retries are not optional in any system that talks to a network. The network drops connections, load balancers reset, gateways time out, and any RPC that does not retry will fail in production within a week. The catch is that a retry without idempotency is a duplicate side effect waiting to happen, and the duplicate is almost always worse than the original failure.
The pattern is small and well understood. The client generates a unique idempotency key per logical request, usually a UUID v4 stored alongside the request so retries reuse it. The server, on receipt, looks up the key in a dedup store. If the key exists with a completed response, the server returns the stored response and skips the side effect entirely. If the key exists with an in-flight marker, the server returns 409 or waits, depending on your contract. If the key is new, the server processes the request, stores the response keyed by the idempotency key, and returns. The TTL on the dedup entry has to exceed your worst-case retry window plus clock skew, and 24 hours is the number Stripe picked for a reason.
Retry only on transient errors. Connection timeouts, 502s, 503s, 504s: retry with exponential backoff and jitter. 400, 401, 403, 422: do not retry, the request is broken and a retry will not fix it. 500 is ambiguous and is worth one cautious retry. Distinguishing transient from permanent in client code is half the battle.
The production failure that taught me this pattern: a checkout flow called the payment provider over a 5-second HTTP timeout. The provider charged the card in 4.8 seconds, but the response packet was lost during a load balancer rotation. The client treated the timeout as a transient failure and retried. The second call hit a different replica behind the provider, which had no record of the first, and charged the card a second time. Both calls returned 200. The customer's statement showed two charges for one order, the support team filed a chargeback, and the postmortem was a long conversation about why "the network is reliable in this region" had ever been an acceptable defense. The fix took an afternoon: the client started sending an Idempotency-Key header generated once per checkout attempt, and the provider's API began returning the stored response on the retry instead of charging again.
This pattern generalizes. Any HTTP call, any RPC, any queue handler, any database write that could be retried after an ambiguous failure needs the same shape. Idempotency keys are the contract that makes at-least-once delivery safe everywhere, not just in Kafka.
Every state-changing API needs a client-supplied idempotency key and a server-side dedup table with TTL longer than the worst retry window. Without it, a lost response on a successful call becomes a double charge.
Originally posted on LinkedIn. View original.