System Design Fundamentals
Storage & Data Modeling
Partitioning, Replication & Consistency
Caching & Edge
Messaging & Streaming
Reliability & Operability
Security & Privacy
Idempotency, Retries, Timeouts, Backoff, and Rate Limits
Why does idempotency matter? Because networks lie. A client sends a payment request, the server processes it, and the response packet vanishes somewhere between a load balancer and the client's socket. The client sees a timeout. Did the charge go through? Without idempotency, the client has two bad options: give up and risk that the user's payment was silently eaten, or retry and risk charging the user twice. Idempotency eliminates this dilemma entirely. An idempotent operation produces the same result whether you execute it once or ten times, so retrying is always safe.
HTTP method idempotency by design
The HTTP specification already classifies certain methods as idempotent:
- GET is safe and idempotent. Reading the same resource 100 times changes nothing on the server.
- PUT is idempotent because it replaces the entire resource. Sending
PUT /users/42with{"name": "Alice"}twice leaves the resource in the same state both times. - DELETE is idempotent. Deleting
/orders/99twice results in the same final state: the order is gone. The first call might return200 OK, the second might return404 Not Found, but the server-side effect is identical. - POST is not idempotent by default. Sending
POST /chargestwice can create two separate charges. This is the method that causes the most trouble in distributed systems, and it is the reason idempotency keys exist.
Idempotency keys: making POST safe to retry
The pattern works like this. The client generates a unique identifier (typically a UUID v4) before sending the request. It attaches this key as a header. The server, before processing the request, checks a dedupe store (Redis, a database table, or an in-memory cache) for that key. If the key does not exist, the server processes the request normally, stores the result keyed by that identifier, and returns the response. If the key already exists, the server skips processing entirely and returns the stored result.
On first receipt the server processes the charge and stores the result. On retry with the same key, the server returns the cached 201 Created response without touching the payment processor again.

Handling race conditions with concurrent requests
What happens if two requests arrive with the same idempotency key at the same instant? Without protection, both could pass the "key not found" check and both could execute the operation. The standard solution is to acquire a lock on the idempotency key before processing. If a second request arrives while the first is still in flight, the server returns 409 Conflict or blocks until the first request completes and then returns the stored result.
Stripe's idempotency design as a reference model
Stripe's API is one of the most widely studied implementations of idempotency keys. Their design choices are worth understanding. Keys expire after 24 hours, which keeps storage bounded. The server fingerprints requests by comparing the method, URL, and a hash of the request body against the stored key. If you reuse the same idempotency key with a different request body, Stripe rejects it with a 400 error rather than silently returning the old result. This prevents a subtle bug where a developer accidentally reuses a key across different operations.
Idempotency is what makes retries safe. Without it, every retry is a gamble that could create duplicate charges, duplicate orders, or duplicate messages. Build idempotency first, then layer retries on top. The order matters.
Server-side implementation pattern
The double-check after acquiring the lock is critical. Between the initial lookup and the lock acquisition, another request could have completed processing. Skipping the double-check is a common bug in idempotency implementations.
Design rules to remember
Choose the right scope for keys: one key per operation, not per resource. Expire stored results after a bounded window (24 hours is a reasonable default). Include request fingerprinting so that the same key with a different payload is rejected rather than returning stale data. For updates, prefer PUT (full replacement) or include a business-level request ID in the body so clients can retry confidently without needing a separate header.
Why do you need timeouts? Because without them, a single slow dependency can silently kill your entire service. Imagine Service A calls Service B, which calls a database. The database is experiencing lock contention and takes 45 seconds to respond instead of the usual 50 milliseconds. Without a timeout, Service A's thread sits blocked for 45 seconds. Ten such requests and the thread pool is exhausted. Now Service A cannot handle any requests at all, even ones that do not touch Service B. A missing timeout on one downstream call has cascaded into a full outage of the upstream service.
Connection timeout vs read timeout
These are two different failure modes that need separate configuration. A connection timeout limits how long the client waits to establish a TCP connection. If the remote host is unreachable or the port is not listening, you want to fail in 1-3 seconds, not the OS default of 60-120 seconds. A read timeout (sometimes called socket timeout) limits how long the client waits for data after the connection is established. This is where slow queries and overloaded services show up. Most HTTP client libraries let you set these independently:
Setting timeouts from data, not guesses
The most common mistake is choosing timeouts by intuition ("2 seconds feels right"). Instead, measure your actual latency distribution. If the p50 (median) response time is 45ms and the p99 is 180ms, a timeout of 500ms (roughly 2.5x the p99) gives you enough headroom for legitimate slow responses while catching genuinely stuck requests. If p95 is 200ms, setting the timeout at 150ms means you are killing 5% or more of your healthy requests. Setting it at 30 seconds means a stuck request burns resources for 30 seconds before anyone notices.
Review timeouts regularly. Latency profiles shift as traffic patterns change, new features are deployed, and data grows. A timeout that was perfect six months ago might be too short after a database migration.
Deadline propagation across service chains
In a microservices architecture, a single user request often fans out through multiple services. Without deadline propagation, each service sets its own independent timeout, and the total can far exceed what the user is willing to wait.
Consider this scenario. A user's browser has a 5-second timeout. The API gateway gives itself 4 seconds. Service A sets a 3-second timeout to Service B. Service B sets a 3-second timeout to Service C. If Service C takes 2.8 seconds, Service B returns to A after 2.8 seconds, and A has used 2.8 seconds plus its own processing. The user might still be waiting. But if the gateway already timed out at 4 seconds, Services A through C are still burning CPU on work that no one will ever see.
Deadline propagation solves this. The originating service starts a deadline (for example, "this request must complete by T+800ms"). Each downstream call subtracts the time already spent and passes the remaining budget:
gRPC has this built in. When you set a deadline on a gRPC call, it automatically propagates through the entire call chain. Every downstream service knows exactly how much time remains and can abort early if the budget is exhausted.
Set timeouts from p95 or p99 latency data, not guesses. A good starting point is 2-3x your p99. And always propagate deadline budgets through service chains. If Service A has 600ms left, it should tell Service B so B does not waste 3 seconds on work A will discard.
Too short vs too long: both are dangerous
A timeout that is too short causes false failures. The downstream service was about to respond successfully, but the caller gave up too early. This wastes the work the downstream already did and triggers a retry, doubling the load. A timeout that is too long wastes resources. The calling thread is blocked, the connection is occupied, and other requests are queued behind it. In extreme cases, long timeouts cause thread pool exhaustion and cascading failures.
The right timeout is one that lets legitimate slow responses through while quickly releasing resources from genuinely stuck requests. This is a moving target that requires monitoring and periodic adjustment.
Combining timeouts with retries
A short per-attempt timeout combined with a small number of retries is almost always better than one long timeout. For example, three attempts with 500ms timeouts each gives the request 1.5 seconds total exposure (plus backoff delays), but each attempt frees its resources quickly if the downstream is stuck. One attempt with a 5-second timeout blocks a thread for 5 seconds on a stuck call.
A retry is straightforward: if a request fails, try again. The rationale is that many failures in distributed systems are transient. A server restarts and is unavailable for 2 seconds. A network switch drops a packet. A load balancer routes to an instance that just started garbage collection. These problems disappear on their own within milliseconds to seconds. A well-timed retry turns a user-visible error into a seamless experience.
When to retry: transient server errors and network failures
Retry on failures that are likely to resolve themselves: HTTP 500 (server crashed mid-request), 502 (bad gateway, often a restarting instance), 503 (service temporarily overloaded), 504 (gateway timeout). Also retry on connection resets, DNS resolution failures, and socket timeouts. These all suggest the infrastructure had a momentary problem, not a permanent one.
When NOT to retry: client errors
Never retry 4xx responses. A 400 Bad Request means the payload is malformed. A 401 means authentication failed. A 403 means the caller lacks permission. A 422 means the request violates business rules. Sending the exact same request again will produce the exact same error. Retrying client errors wastes resources and can trigger rate limiting. The one nuanced exception is 429 Too Many Requests, which is a 4xx but explicitly asks the client to retry after a delay.
Safety: only retry idempotent operations
This is the most important rule. If you retry a non-idempotent POST without an idempotency key, you might create two orders, send two emails, or charge a credit card twice. Before implementing retries, ask: "If this operation executes twice, is the result the same as executing once?" If the answer is no, either make the operation idempotent first (via idempotency keys) or do not retry it.
The retry amplification problem

This is where retries become dangerous. Consider a three-tier architecture: Client calls Gateway, Gateway calls Service A, Service A calls Database. Each layer is configured with 3 retry attempts. The database goes down. Service A retries 3 times. For each of those 3 attempts, the Gateway retries 3 times (that is 9 total calls from Gateway to A). For each of those 9 calls, the Client retries 3 times. One user failure has now generated 27 requests hitting Service A, and 27 attempts against the database. Multiply that by thousands of concurrent users and a temporary database hiccup becomes a catastrophic retry storm that prevents the database from recovering.
Retry amplification is one of the most common causes of outage escalation. If every layer in a 3-tier system retries 3 times, a single failure generates 27 requests. With 4 layers and 3 retries each, it becomes 81. Always designate ONE layer to own retries, usually the edge client or API gateway.
Retry budgets: capping aggregate retry load
Google's SRE practices introduced the concept of a retry budget. Instead of each request deciding independently whether to retry, the service tracks the ratio of retries to original requests over a sliding window. If retries exceed 10% of total traffic, new retries are suppressed. This prevents the "everyone retries at once" problem while still allowing retries during isolated, small-scale failures.
Circuit breaker integration
When a downstream service is completely down (not just occasionally flaky), retries only add to the problem. A circuit breaker monitors failure rates and, once the rate exceeds a threshold (for example, 50% of requests failing over 30 seconds), it "opens" the circuit and immediately fails all requests without even attempting the call. After a cooldown period, it allows a small number of probe requests through. If those succeed, the circuit closes and normal traffic resumes. Circuit breakers and retries complement each other: retries handle transient blips, circuit breakers handle sustained outages.
Retry implementation with safety checks
Why not retry immediately? Because if 1,000 clients all hit a server at the same time and the server fails, all 1,000 clients retry simultaneously. The server, which was already struggling, now receives 1,000 requests in the same millisecond. This is the thundering herd problem, and it is one of the most common ways that retries make outages worse instead of better. Backoff solves this by spreading retry attempts over time.
Exponential backoff: the standard approach
The idea is simple: each successive retry waits longer than the last. The formula is:
With a base of 100ms and a cap of 10 seconds:
- Attempt 1: wait 100ms
- Attempt 2: wait 200ms
- Attempt 3: wait 400ms
- Attempt 4: wait 800ms
- Attempt 5: wait 1600ms
The exponential growth means the first retry is fast (giving the best chance of succeeding on a brief glitch) while later retries back off increasingly to give the server breathing room. The cap prevents absurdly long waits: without it, attempt 10 would be 102 seconds.
Why jitter is essential, not optional
Exponential backoff alone has a subtle problem. If 1,000 clients all fail at time T=0, they all retry at T=100ms (attempt 1), then all retry at T=300ms (attempt 2), then all at T=700ms (attempt 3). The retries are synchronized into waves that are just as harmful as the original thundering herd, just spaced further apart.
Jitter adds randomness to break this synchronization. There are three common strategies:
- Full jitter:
delay = random(0, min(cap, base * 2^attempt)). Each client picks a completely random delay between zero and the exponential ceiling. This provides maximum spread but occasionally produces very short delays. - Equal jitter:
delay = min(cap, base * 2^attempt) / 2 + random(0, min(cap, base * 2^attempt) / 2). The delay is at least half the exponential value, with randomness in the upper half. This guarantees a minimum wait while still spreading traffic. - Decorrelated jitter:
delay = min(cap, random(base, previous_delay * 3)). Each delay is random between the base and 3x the previous delay. This approach, recommended in Amazon's architecture blog, produces good spread without requiring the attempt counter.
AWS's analysis showed that full jitter produces the best results in terms of total completion time and server load across most scenarios.

Implementation with full jitter
Honoring the Retry-After header
Sometimes the server knows exactly how long the client should wait. The Retry-After header appears on 429 (Too Many Requests) and 503 (Service Unavailable) responses. It can specify seconds or an absolute HTTP-date:
When Retry-After is present, always honor it instead of your own backoff calculation. The server has information the client does not (current load, maintenance window duration, rate limit reset time). Ignoring this header and retrying sooner will likely result in another 429 or 503, wasting both client and server resources.
Backoff in context
Backoff is never used alone. It works together with a retry limit (cap at 3-5 attempts), per-attempt timeouts (do not wait forever on any single attempt), and idempotency (so that retried operations are safe). Think of backoff as the spacing strategy within a retry policy: retries decide "should I try again?", backoff decides "when should I try again?", and timeouts decide "how long should I wait for this particular attempt?"
Everything discussed so far (idempotency, timeouts, retries, backoff) happens on the client side. Rate limiting is the server's defense. While backoff is a polite request for clients to slow down, rate limiting is the server enforcing that slowdown whether the client cooperates or not. Without rate limits, a single misbehaving client, a bot, a buggy retry loop, or a DDoS attack can consume all available capacity and starve every other user.
The token bucket algorithm
The most widely used rate limiting algorithm is the token bucket. The concept is intuitive: imagine a bucket that holds up to b tokens. Tokens are added to the bucket at a steady rate of r tokens per second. Each incoming request must remove one token from the bucket. If the bucket is empty, the request is rejected (or queued). This design has an important property: it allows bursts. If the bucket holds 100 tokens and the refill rate is 10 per second, a client can send 100 requests in a single burst, then must wait for tokens to refill at 10 per second. This matches real traffic patterns where users often send clusters of requests (loading a page, scrolling a feed) followed by quiet periods.

Fixed window vs sliding window
A simpler approach is the fixed window counter: divide time into 1-minute windows and allow 100 requests per window. The problem is boundary spikes. A client sends 100 requests at 12:00:59 and another 100 at 12:01:01. Both are within their respective windows, but the server received 200 requests in 2 seconds. The sliding window approach fixes this by weighting the previous window's count. If the current window is 30% elapsed, the effective count is (0.7 * previous window count) + current window count. This smooths the boundary problem without the memory overhead of tracking individual request timestamps.
Rate limit response headers
Well-designed APIs communicate their rate limits through response headers so clients can self-regulate:
X-RateLimit-Limit: the maximum number of requests allowed in the current windowX-RateLimit-Remaining: how many requests the client has leftX-RateLimit-Reset: the Unix timestamp when the window resets
When the limit is exceeded, the server returns 429 with the Retry-After header:
Client behavior on 429
The correct response to a 429 is to honor the Retry-After header, apply backoff with jitter, and optionally reduce concurrency. The wrong response is to retry immediately in a tight loop, which guarantees another 429 and may escalate to the server banning the client entirely. Sophisticated clients track the X-RateLimit-Remaining header proactively and throttle their own request rate before hitting the limit.
Granularity: per-user, per-IP, per-endpoint, global
Rate limits can be applied at different levels depending on the threat model:
- Per-user / per-API-key: the most common for authenticated APIs. Each user gets their own quota. Stripe limits to 100 requests per second per API key.
- Per-IP: useful for unauthenticated endpoints like login pages. Prevents brute-force attacks. But be careful: many users behind a corporate NAT or VPN share one IP.
- Per-endpoint: different endpoints have different costs. A search query that fans out to 20 shards should have a tighter limit than a simple key-value lookup.
- Global: a hard ceiling on total requests per second to protect infrastructure. This is the last line of defense.
Load shedding as extreme rate limiting
When a service is so overloaded that even well-behaved clients are at risk, load shedding drops requests proactively. The service monitors its CPU, memory, or queue depth and starts returning 503 immediately (with Retry-After) once a threshold is crossed. This is conceptually the same as rate limiting but triggered by server health rather than client identity. The key principle is the same: reject a few requests quickly and cleanly so the rest can be served successfully, rather than accepting everything and failing slowly for everyone.
Implement rate limiting at the API gateway, not in individual services. The gateway sees all traffic in one place, can enforce per-tenant quotas centrally, and protects every downstream service without each one reimplementing the logic.
Each mechanism solves one piece of the reliability puzzle. Idempotency makes operations safe to repeat. Timeouts free resources from stuck calls. Retries give failed operations another chance. Backoff prevents retries from becoming a stampede. Rate limits protect the server from being overwhelmed. Used in isolation, each helps. Used together with a coherent design, they create a system that bends under stress without breaking.
The end-to-end flow
Here is how a single request travels through all five mechanisms working together. A client wants to create a payment. Before sending, it generates an idempotency key (UUID) and attaches it to the request. The client sets a per-attempt timeout of 500ms and allows up to 3 total attempts.
The request arrives at the API gateway, which checks the rate limiter. If the client has exceeded their quota, the gateway immediately returns 429 with a Retry-After header. The client honors that header, waits the specified duration (plus jitter), and retries with the same idempotency key.
If the rate limit check passes, the gateway forwards the request to the payment service with a propagated deadline (remaining budget minus gateway processing time). The payment service looks up the idempotency key in its dedupe store. On a first attempt, it processes the charge, stores the result, and returns 201. If the response is lost in transit and the client times out after 500ms, the client applies exponential backoff with jitter (say 120ms for the first retry) and sends the same request again with the same idempotency key. This time the payment service finds the cached result and returns it without processing the charge again.
The principle: one retry layer
The most important design decision is where retries live. If the client, gateway, and service all retry independently, you get the amplification problem discussed earlier. The cleanest pattern is to have exactly one layer own retries: typically the edge client (mobile app, browser, or CLI) or the API gateway. All other layers attempt each call once and propagate errors upward. The retrying layer is responsible for idempotency keys, backoff timing, retry budgets, and honoring server hints.
Pick ONE layer to own retries, usually the edge client or the API gateway. Every other layer in the stack should make a single attempt and propagate the error. If you need retries at multiple layers, coordinate them with a shared retry budget to cap total amplification.
Propagate deadlines, not independent timeouts
Each service in a call chain should receive the remaining time budget from its caller, not set its own arbitrary timeout. If the gateway has 800ms left and passes that to Service A, and A spends 200ms on its own logic, it passes 600ms to Service B. Service B will not waste time on work that A has already abandoned. gRPC does this automatically. For REST-based systems, pass the deadline as a request header (for example, X-Request-Deadline: 2026-03-06T12:00:00.800Z) and have each service check the remaining budget before starting expensive work.
Honor server signals
Clients should treat server responses as authoritative instructions, not suggestions. When a 429 includes Retry-After: 60, wait 60 seconds. When a 503 includes Retry-After: 120, wait 120 seconds. When rate limit headers show X-RateLimit-Remaining: 2, throttle proactively rather than burning the last tokens and getting throttled. This is a cooperative system: the server knows its capacity, and clients that respect its signals get better overall throughput than clients that fight the limits.
Trade-offs summary
| Mechanism | You invest | You gain |
| Idempotency | Storage for dedupe store, lock coordination, key management | Correctness under repeats, safe retries, no duplicate side effects |
| Timeouts | Some legitimate slow requests are killed prematurely | Resource protection, fast failure detection, no zombie threads |
| Retries | Extra attempts increase latency and downstream load | Higher success rate for transient failures, better user experience |
| Backoff + Jitter | Delayed retry means higher latency for individual requests | System-wide stability, no thundering herds, server recovery time |
| Rate Limits | Some legitimate traffic is rejected during spikes | Sustainable throughput, fairness across tenants, abuse protection |
Design checklist for resilient systems
Wire these mechanisms together by following this checklist. First, make all state-changing operations idempotent (via HTTP method semantics or explicit idempotency keys). Second, set per-attempt timeouts derived from measured latency percentiles and propagate deadline budgets through service chains. Third, implement retries at exactly one layer with a cap of 2-3 attempts and a retry budget that limits aggregate retry traffic to 10% of total load. Fourth, use exponential backoff with full jitter between retry attempts and always honor Retry-After headers. Fifth, enforce rate limits at the API gateway with per-tenant token bucket quotas and clear response headers. Sixth, add circuit breakers on outbound calls so sustained failures trigger fast-fail rather than endless retries.
Each piece reinforces the others. Idempotency makes retries safe. Timeouts make retries bounded. Backoff makes retries gentle. Rate limits make the whole system sustainable. Remove any one piece and the others lose their effectiveness: retries without idempotency create duplicates, retries without backoff create stampedes, retries without timeouts create zombie threads, and none of it matters if the server has no rate limits to protect itself.