Resilience Patterns: Circuit Breakers, Bulkheads, and Load Shedding

Course

System Design Fundamentals

Resilience Patterns: Circuit Breakers, Bulkheads, and Load Shedding

Topics Covered

Resilience Patterns for Modern Services

Why Slow Dependencies Are Worse Than Dead Ones

Real-World Cascade: Amazon's 2015 DynamoDB Outage

How the Three Patterns Work Together

Circuit Breakers

The Three States

Implementation Pattern

Configuration Decisions

Fallback Strategies

Monitoring Circuit Breaker Health

Bulkheads

Thread Pool Isolation

Semaphore Isolation

Sizing Bulkheads

Bulkheads Beyond Thread Pools

Monitoring Bulkhead Utilization

Load Shedding

When to Shed Load

Priority Classification

Implementation Approaches

Client-Side Handling of Shed Requests

Combining Resilience Patterns

Layered Defense: Outside In

A Concrete Production Stack

The Testing Imperative

Observability for Resilience

Common Anti-Patterns

Resilience Patterns for Modern Services

Distributed systems do not fail cleanly. A single slow dependency can consume all your threads, a noisy neighbor can exhaust shared resources, and a traffic spike can turn a healthy service into a cascading failure that takes down everything behind it. Resilience patterns exist to contain these failures before they spread.

Three patterns form the foundation of production resilience:

Circuit Breakers: stop calling a failing dependency before it drags your service down with it. If the payment gateway is returning errors, stop sending requests and return a fallback instead of queueing up thousands of timeouts.
Bulkheads: isolate resources so that one failing component cannot consume everything. If the recommendation service is slow, it exhausts its own thread pool while payments and checkout continue unaffected.
Load Shedding: when you receive more traffic than you can handle, reject the excess gracefully rather than slowing down for everyone. Return 503 to low-priority requests so high-priority ones still get fast responses.

These patterns share a philosophy: it is better to serve some requests well than all requests poorly. A system that degrades gracefully under stress is more valuable than one that works perfectly at normal load but collapses under pressure.

Why Slow Dependencies Are Worse Than Dead Ones

A dead service fails fast. Your request gets a connection refused or a timeout in milliseconds, freeing the thread and the connection immediately. The caller retries or returns a fallback, and life goes on.

A slow service is far more dangerous. Each pending request holds a thread (or connection) for the duration of the timeout, typically 30 seconds. If you make 100 requests/second to a service that has become slow, within 30 seconds you have 3,000 threads blocked waiting for responses. Your thread pool is exhausted, and now your service cannot handle any requests, even those going to perfectly healthy dependencies.

This is the cascade failure pattern: Service C becomes slow, Service B runs out of threads calling C, and Service A runs out of threads calling B. A single slow service can take down an entire microservice graph.

Real-World Cascade: Amazon's 2015 DynamoDB Outage

In September 2015, a metadata partition in Amazon DynamoDB became overloaded. Services reading from that partition experienced elevated latency, not failures, just slowness. The slow reads held threads and connections in upstream services. Those upstream services started timing out, and their callers started timing out in turn. Within minutes, dozens of AWS services were degraded because of a single slow metadata partition.

The lesson was clear: slow dependencies propagate failure faster than dead ones. AWS responded by adding circuit breakers and bulkheads at every service boundary, and by implementing strict timeouts on all internal calls. The specific change was reducing default timeouts from 30 seconds to 1-3 seconds and isolating dependency call pools so that one slow partition could not exhaust threads needed for other operations.

This incident also demonstrated that cascade failures are not theoretical. They happen regularly in production microservice architectures. The three resilience patterns exist specifically to contain these failures before they propagate across service boundaries.

How the Three Patterns Work Together

Each pattern protects a different boundary:

Load shedding protects you from callers (upstream traffic). It sits at the front door of your service.
Bulkheads protect your internal resources. They sit inside your service, partitioning thread pools, connection pools, and memory.
Circuit breakers protect you from dependencies (downstream services). They sit on every outbound call.

A production service needs all three because failures come from all directions: too much inbound traffic, internal resource exhaustion, and downstream outages can all happen simultaneously. During a real incident, these failure modes often co-occur: a downstream outage causes retries, retries amplify traffic, amplified traffic exhausts internal resources, and the entire system cascades. The three patterns together break this chain at every link.

Key Insight

Resilience patterns protect different boundaries. Circuit breakers protect you from downstream failures (services you call). Bulkheads protect you from internal resource exhaustion (your own threads and connections). Load shedding protects you from upstream overload (traffic you receive). A production system needs all three because failures come from all directions.

Circuit Breakers

A circuit breaker monitors calls to a dependency and stops sending requests when the failure rate crosses a threshold. Instead of waiting for timeouts on a service that is clearly broken, the circuit breaker returns a fallback response immediately, freeing threads and preserving the caller's health.

The Three States

Closed (normal). All requests pass through to the dependency. The breaker counts recent failures within a sliding window. If failures stay below the threshold, nothing changes. This is the normal operating state.

Open (tripped). Failures exceeded the threshold. All requests are immediately rejected with a fallback response: cached data, a default value, or an error message. No traffic reaches the failing dependency. A timer starts counting down.

Half-open (probing). The timer expires. The breaker allows a small number of test requests through to the dependency. If they succeed, the breaker transitions back to Closed and resumes normal traffic. If they fail, it returns to Open and resets the timer.

Implementation Pattern

1def call_with_breaker(request):
2    if breaker.state == OPEN:
3        if not breaker.timer_expired():
4            return fallback_response()
5        breaker.state = HALF_OPEN
6
7    try:
8        response = downstream.call(request, timeout=3s)
9        breaker.record_success()
10        if breaker.state == HALF_OPEN:
11            breaker.state = CLOSED
12        return response
13    except (Timeout, Error):
14        breaker.record_failure()
15        if breaker.failure_count >= THRESHOLD:
16            breaker.state = OPEN
17            breaker.start_timer()
18        return fallback_response()

The key detail is the timeout=3s on the downstream call. Without an aggressive timeout, a slow dependency consumes the caller's thread for the full default timeout (often 30 seconds). With a 3-second timeout, the thread is freed quickly, and the circuit breaker gets failure signal faster.

Configuration Decisions

Failure threshold: too low and the breaker trips on transient blips. Too high and it takes too long to detect real outages. A common starting point: 5 failures within a 10-second sliding window.
Open-state timeout: how long to wait before probing the dependency. Too short and you hammer a recovering service. Too long and you unnecessarily delay recovery. Start with 30-60 seconds.
Half-open probe count: how many test requests to send before closing the circuit. One is fragile (a single success might be a fluke). Three to five gives more confidence that the dependency is genuinely recovered.
What counts as a failure: timeouts and 5xx errors should trip the breaker. 4xx errors (client mistakes) should NOT. The dependency is working correctly, the request was invalid. Including 4xx would cause false trips due to normal client behavior.

Fallback Strategies

When the circuit is open, what do you return?

Cached response: return the last known good value. A product catalog that is 5 minutes stale is better than no catalog at all.
Default value: return a sensible default. If the recommendation engine is down, show trending products instead of personalized ones.
Degraded response: skip the failing feature entirely. Show the checkout page without recommendations rather than blocking checkout.
Error with guidance: return a clear error message that tells the client to retry later. Include a Retry-After header so clients know when to try again.

The best fallback depends on the operation. Read operations (product catalog, user profile) work well with cached or default values. Write operations (payments, order submissions) usually cannot use fallbacks. The user needs to know the operation did not complete. Never silently swallow a write failure behind a cached success response.

Monitoring Circuit Breaker Health

A circuit breaker is only useful if you know it is working. Instrument every breaker with metrics:

State transitions: emit an event whenever the breaker moves between Closed, Open, and Half-Open. A breaker that cycles rapidly between states indicates an unstable dependency that partially recovers but cannot sustain load.
Fallback invocation rate: track how often the fallback path is triggered. A sudden spike in fallback rate is an early warning that a dependency is degrading even if the breaker has not tripped yet.
Success/failure rate per dependency: separate metrics for each downstream call. Aggregate failure rates hide which specific dependency is problematic.
Open duration: how long the breaker stays open before recovering. Long open durations suggest the dependency has persistent issues. Short, frequent open-close cycles suggest the breaker configuration needs tuning (threshold too sensitive, or the dependency is flapping).

Dashboard these metrics per dependency. Alert when a breaker opens (severity: warning) and when it stays open for more than 5 minutes (severity: critical). The goal is to detect dependency health issues before they escalate to user-facing impact.

Interview Tip

In interviews, always mention the half-open state. Many candidates describe circuit breakers as just open/closed, missing the recovery mechanism. The half-open state is what makes circuit breakers self-healing. Without it, an open circuit never recovers and requires manual intervention to close.

Bulkheads

The term comes from ship design: a ship is divided into watertight compartments so that a breach in one does not flood the entire vessel. In software, a bulkhead isolates resources so that one failing component cannot exhaust the resources needed by others.

Thread Pool Isolation

The most common bulkhead implementation assigns a separate thread pool to each downstream dependency. If your service calls three backends (payment, inventory, and recommendations), each gets its own pool:

Payment pool: 20 threads
Inventory pool: 15 threads
Recommendation pool: 10 threads

If the recommendation service becomes slow and all 10 of its threads are blocked, only recommendation calls are affected. Payment and inventory continue using their own pools with no contention. Without bulkheads, all three would share a single pool of 45 threads, and a slow recommendation service could consume all 45.

The isolation guarantee is absolute: no matter how badly the recommendation service behaves (whether it responds in 30 seconds, returns garbage, or hangs indefinitely), it can only consume its 10 allocated threads. The other 35 threads are physically inaccessible to recommendation calls. This is what makes bulkheads a hard boundary rather than a soft limit.

Semaphore Isolation

An alternative to thread pools is semaphore isolation, which limits the number of concurrent in-flight requests to a dependency without dedicating separate threads. The calling thread itself makes the request, but a semaphore with a count of N ensures that at most N requests are in-flight simultaneously. If the semaphore is full, the request is immediately rejected.

Semaphore isolation has lower overhead (no thread context switching) but less control. It cannot enforce timeouts independently because the calling thread is the one waiting. Thread pool isolation allows the pool to enforce a timeout and interrupt the calling code. In practice, combine semaphore isolation with explicit timeouts on the HTTP client for the best balance of efficiency and safety.

Sizing Bulkheads

The size of each bulkhead should match the expected concurrency for that dependency. Use Little's Law:

Pool size = Request rate x Average response time

If recommendations receive 100 requests/second and average 10ms, the pool needs 1 concurrent thread (100 x 0.01). Add headroom for variance: 5-10 threads. If payments receive 50 requests/second at 200ms average, the pool needs 10 threads (50 x 0.2). Add headroom: 15-20 threads.

Undersized bulkheads reject requests unnecessarily for a healthy dependency. Oversized bulkheads defeat the purpose. If the recommendation pool is 200 threads, a slow recommendation service can consume 200 threads before the bulkhead kicks in, which might be most of your capacity.

Bulkheads Beyond Thread Pools

The principle extends to any shared resource:

Connection pools: separate database connection pools for transactional queries and analytics queries. A slow analytics query does not block checkout transactions.
Memory: container memory limits prevent one service from consuming all memory on a host. Kubernetes resource limits are a form of bulkhead.
CPU: CPU quotas (cgroups) prevent one container from starving others on the same host.
Queue partitions: separate message queues for different priorities. High-priority messages are not blocked behind a backlog of low-priority work.

Monitoring Bulkhead Utilization

Each bulkhead needs utilization metrics to detect both undersizing and approaching saturation:

Active thread count: how many threads in each pool are currently busy. Track the peak and average. If a pool consistently peaks at 90% utilization, it needs more headroom for traffic bursts.
Rejection count: how many requests were rejected because the pool was full. Rejections in a healthy pool indicate undersizing. Rejections during a dependency outage indicate the bulkhead is working correctly: it is containing the failure.
Queue wait time: if the pool has a bounded queue, track how long requests wait before a thread becomes available. Rising queue times indicate the pool is approaching saturation.

Set alerts on rejection rate per pool. A healthy pool should have near-zero rejections under normal traffic. Rejections during incidents are expected and confirm that the bulkhead is isolating the failure correctly.

Load Shedding

When traffic exceeds your service's capacity, you have two choices: serve everyone slowly (degrading the experience for all users) or reject some requests fast so the rest get normal service. Load shedding chooses the second approach: it drops lower-priority requests with a fast 503 response so that higher-priority requests continue receiving fast, correct responses.

When to Shed Load

Load shedding activates when the service detects it is at or near capacity. Common trigger signals:

Request queue depth: if the pending request queue exceeds a threshold, start rejecting. This is the simplest and most common trigger.
CPU utilization: if CPU exceeds 80%, start shedding. Useful but laggy: CPU metrics are sampled periodically and may react too slowly to sudden spikes.
Response latency: if P99 latency exceeds the SLO, start shedding. This directly measures user impact and is the most meaningful signal.
In-flight request count: if concurrent requests exceed the server's capacity (measured during load testing), reject new ones. Similar to queue depth but measures active processing.

Priority Classification

Not all requests are equal. A flash sale generates traffic from browsing, searching, cart operations, and payment processing. If you must shed load, the priority order is clear:

Critical: payment processing, order submission. Never shed these unless the service is truly dying.
High: cart operations, inventory checks. Important but briefly deferrable.
Medium: search, product pages. Important for user experience but the user can retry.
Low: recommendations, personalization, analytics tracking. Nice to have but dispensable under pressure.

Priority is typically determined from request metadata available immediately: the URL path, an API key tier, or a custom header set by the API gateway. The classification must be fast, sub-millisecond, because the whole point of shedding is to free resources quickly.

Some teams also distinguish between new requests and in-progress operations. A checkout that has already validated the cart and reserved inventory should not be shed at the payment step. Shedding mid-flow wastes all the work already done. Priority should account for how far along the request is in a multi-step workflow.

Implementation Approaches

Queue-based shedding: Requests enter a bounded queue. When the queue is full, new requests are rejected with 503. The queue can be priority-ordered so low-priority requests are at the back and get shed first when the queue fills.

Token bucket: The service maintains a token bucket that refills at the service's sustainable rate (determined by load testing). Each request consumes a token. When tokens are exhausted, requests are rejected. Different priority levels can have reserved token allocations.

Adaptive shedding: The service continuously monitors its own latency and throughput. When latency degrades beyond the SLO, it starts rejecting the lowest priority tier. If latency does not improve, it escalates to the next tier. This approach self-tunes but is more complex to implement.

Client-Side Handling of Shed Requests

When a service returns 503, the client needs to handle it correctly. A well-behaved client does three things:

Respect the Retry-After header. The server includes a Retry-After header indicating how many seconds the client should wait before retrying. Ignoring this header and retrying immediately adds load to an already overloaded service.
Add jitter to retries. If 1,000 clients all receive 503 at the same time and all retry after exactly 5 seconds, the service gets another 1,000 requests simultaneously. Adding random jitter (e.g., 5 plus 0-5 seconds of random delay) spreads retries across a time window and avoids thundering herd effects.
Back off exponentially. If the second attempt also returns 503, wait longer before the third attempt. Exponential backoff with jitter (1s, 2s, 4s, 8s with random variation) naturally reduces retry pressure as the overload persists.

Services should also communicate shed status to upstream callers. If your service is shedding load because a downstream dependency is overloaded, return 503 upstream with a Retry-After header so the entire call chain adjusts instead of continuing to generate traffic that will be rejected.

Common Pitfall

Load shedding must return 503 fast, in under 10ms. If the shedding logic itself is slow (e.g., querying a database to determine request priority), it defeats the purpose. Priority classification should use request metadata that is available immediately: the URL path, an API key tier, or a request header. Never do expensive computation to decide whether to shed.

Combining Resilience Patterns

No single resilience pattern is sufficient. A production service faces failures from multiple directions simultaneously: upstream traffic spikes, internal resource exhaustion, and downstream outages can all happen at the same time during an incident. The patterns compose as layers of defense.

Layered Defense: Outside In

Think of the patterns in layers from the outside of your service inward:

Layer 1: Load shedding (front door). Incoming traffic enters the service. If the service is at capacity, load shedding rejects low-priority requests immediately with 503 responses. This prevents the service from being overwhelmed by traffic it cannot handle.

Layer 2: Bulkheads (internal resources). Accepted requests are routed through isolated resource pools. Each downstream dependency gets its own thread pool or connection pool. If one dependency becomes slow, only its dedicated resources are consumed.

Layer 3: Circuit breakers (outbound calls). Each call to a downstream dependency passes through a circuit breaker. If the dependency is failing, the breaker returns a fallback immediately rather than waiting for timeouts.

A Concrete Production Stack

Kubernetes plus a service mesh provides many of these patterns automatically:

Load shedding: Envoy proxy (Istio sidecar) provides rate limiting and circuit breaking at the mesh level. Kubernetes Horizontal Pod Autoscaler scales replicas, and pod resource limits prevent individual pods from consuming all node resources.
Bulkheads: separate Kubernetes deployments for different service components. Container resource limits (CPU, memory) prevent one container from starving others. Separate connection pools in the application code.
Circuit breakers: application-level libraries (Resilience4j for Java, Polly for .NET, Hystrix legacy) or Envoy circuit breaking at the mesh level. Application- level breakers give finer control over fallback behavior.

The Testing Imperative

Resilience patterns are useless if they are never tested. Chaos engineering validates that the patterns work under real failure conditions:

Inject slow responses: verify circuit breakers trip and fallbacks activate.
Kill downstream instances: verify the service continues with degraded functionality.
Generate traffic spikes: verify load shedding activates at the right threshold and sheds the right priority levels.
Exhaust thread pools: verify bulkheads contain the failure to the affected pool.

Netflix's Chaos Monkey kills random instances in production. Gremlin and LitmusChaos provide controlled fault injection. These tools prove that resilience patterns work before real incidents test them.

The key principle of chaos engineering is to start small and expand scope gradually. Begin by killing a single instance in a staging environment. Once you are confident in your resilience patterns, move to production with blast radius limits (affect only 1% of traffic). Only after validating at small scale should you test larger failure scenarios like full availability zone outages or complete dependency failures.

Observability for Resilience

Resilience patterns without observability are blind defenses. You need to know when patterns activate, how often they trigger, and whether they are configured correctly. Key metrics to track:

Circuit breaker state per dependency: dashboard showing current state (Closed/Open/Half-Open) for every downstream call. Alert on any breaker opening.
Bulkhead utilization per pool: percentage of threads or connections in use for each isolated pool. Alert when any pool exceeds 80% sustained utilization.
Load shedding rate by priority tier: how many requests per second are being shed at each priority level. This tells you whether the shedding thresholds are calibrated correctly and whether high-priority traffic is being protected.
Fallback invocation rate: how often cached or default responses are served instead of live data. A sustained high fallback rate indicates a dependency issue even if the circuit breaker has recovered.

Common Anti-Patterns

Retry storms: retries without circuit breakers amplify load on a failing dependency. Always pair retries with circuit breakers and exponential backoff.
Infinite queues: unbounded request queues convert traffic spikes into memory exhaustion. Always use bounded queues with explicit rejection when full.
Shared everything: a single thread pool, a single connection pool, a single queue for all request types. Any slow path can consume everything. Isolate by dependency and by priority.
Generous timeouts: 30-second default timeouts on downstream calls mean 30 seconds of thread consumption per failing request. Use aggressive timeouts (1-5 seconds) combined with circuit breakers.
Testing only in staging: staging environments rarely replicate production traffic patterns, data volumes, or dependency behavior. Resilience patterns must be tested in production with controlled blast radius (chaos engineering) to validate they work under real conditions.
Set and forget: resilience configuration needs regular tuning as traffic patterns, dependency behavior, and system capacity change. Review circuit breaker thresholds, bulkhead sizes, and load shedding tiers quarterly or after significant architecture changes.

Course

System Design Fundamentals

Networking & APIs

Storage & Data Modeling

Partitioning, Replication & Consistency

Caching & Edge

Messaging & Streaming

Reliability & Operability

Security & Privacy

Common Interview Scenarios