System Design Fundamentals
Networking & APIs
Storage & Data Modeling
Partitioning, Replication & Consistency
Messaging & Streaming
Reliability & Operability
Security & Privacy
Hot Key Mitigation, Request Coalescing, Stampede Protection
Distributed caches like Redis Cluster or Memcached spread keys across nodes using consistent hashing. The assumption is that traffic distributes roughly evenly. A hot key breaks that assumption — one key receives so many requests that the single node responsible for it becomes the bottleneck while other nodes sit idle.

Why Hot Keys Happen
Real-world access patterns follow power-law distributions (Zipf's law). A small number of keys receive a disproportionately large share of traffic. Examples:
- Celebrity profile: A user with 50M followers goes live. Their profile key (
user:9999) gets millions of reads per second. Consistent hashing routes all of them to one node. - Viral content: A tweet or video goes viral overnight. The cache key for that content receives 10,000x the traffic of an average key.
- Flash sale product: An e-commerce flash sale starts and every visitor hits
product:12345for price and stock level. - Config/feature flags: A global configuration key read by every request on every server becomes a hot key by default.
The Cascading Failure Risk
When a hot key overloads its node, the consequences escalate:
- Latency spike — Requests for the hot key queue up. Response times increase for that key and every other key on the same node (noisy neighbor effect).
- Node failure — If the load exceeds what the node can handle, it crashes or becomes unresponsive.
- Cascade — The cluster redistributes the hot key to another node. That node immediately gets the same traffic and may also crash. This can cascade across the entire cluster.
The key insight is that hot keys are not just a performance problem — they are a reliability problem. A single hot key can take down an entire cache cluster if unmitigated.
Detecting Hot Keys
Before you can fix hot keys, you need to find them:
Redis 4.0+ supports --hotkeys via the LFU eviction policy approximation. For production systems, instrumenting your cache client to sample request counts per key and alerting when any key exceeds a threshold (e.g., 10,000 requests/minute) catches hot keys before they cause outages.
Hot keys are inevitable in any system with user-generated content. You cannot predict which tweet goes viral or which product a celebrity endorses. The question is not whether hot keys will occur, but whether your system detects and mitigates them automatically before they cascade into outages.
There is no single solution to hot keys — each mitigation trades something (consistency, complexity, memory) for load distribution. The right approach depends on whether the hot key is predictable or sudden.
Key Splitting (Sharded Replicas)
The most direct approach: split one logical key into N physical keys across different nodes.
With 4 replicas, the hot key's traffic splits across 4 nodes instead of 1. The trade-off is write amplification — every update must propagate to all N replicas. This works well for read-heavy hot keys (the common case) where writes are infrequent.
L1 Local Caching
Push hot data into each application server's process memory. The distributed cache becomes L2, checked only on local miss.
With 50 application servers and a 2-second L1 TTL, the hot key goes from 200,000 reads/sec on one Redis node to roughly 25 reads/sec (each server refreshes once every 2 seconds). The trade-off is staleness — data can be up to 2 seconds old. L1 local caching is the fastest emergency response to a hot key incident because it requires no infrastructure changes — just a code deploy. Keep the TTL short (1-5 seconds) to limit staleness. For truly critical real-time data, consider pub/sub invalidation instead of TTL-based expiration.
Read Replicas
Some cache systems support read replicas natively. Redis Cluster allows configuring replica nodes for read traffic with READONLY mode. Clients read from replicas, spreading load across multiple nodes. This is transparent to application code but requires cluster configuration and adds replication lag.
Monitoring and Auto-Detection
Production hot key mitigation should be automatic, not manual:
- Detect: Instrument cache clients to track per-key request rates. Alert when any key exceeds a threshold.
- React: Automatically enable L1 caching or key splitting for detected hot keys.
- Recover: Remove mitigations when traffic normalizes to avoid unnecessary complexity.
The detection threshold depends on your node capacity. If each Redis node handles 100,000 ops/sec comfortably, alert when any single key exceeds 50,000 ops/sec (50% of capacity).
When a cache miss occurs, the naive approach is for every request to independently fetch from the database. If 100 requests arrive simultaneously for the same missing key, 100 identical database queries execute. Request coalescing ensures only one query runs while the others wait for its result.

Single-Process Coalescing (Singleflight)
The simplest form: within one application server, track in-flight operations and have duplicate requests wait for the first one to complete.
In Go, the standard library provides sync.Singleflight as a drop-in solution. In Java, use ConcurrentHashMap.computeIfAbsent with a Future value.
Distributed Coalescing (Cross-Node)
Single-process coalescing solves the problem within one server, but 50 application servers hitting a cache miss simultaneously still produce 50 database queries. Distributed coalescing uses a shared lock:
The lock TTL (5 seconds) is critical — if the lock holder crashes, the lock auto-expires and another instance can take over. Without the TTL, a crashed loader permanently blocks all other instances from loading that key.
Always set a TTL on distributed coalescing locks. If the lock holder crashes or hangs, a lock without TTL blocks all other instances from loading that key indefinitely. The TTL should be longer than the expected database query time but short enough that a failure is detected quickly — typically 2-10 seconds.
A cache stampede (also called thundering herd) occurs when a popular cache key expires and many concurrent requests simultaneously discover the miss, all independently querying the database to recompute the value. The database gets overwhelmed, queries time out, and the cache remains empty because no query completes successfully — a self-reinforcing failure cycle.

The Failure Cycle
The stampede becomes self-reinforcing:
- Popular key expires (TTL reached)
- Hundreds of concurrent requests discover cache miss
- All independently query the database
- Database overwhelmed — queries slow down or timeout
- No query completes successfully to repopulate the cache
- New requests arrive, still find cache empty, send more queries
- Database load increases further — cascading failure
Mutex Lock Pattern (Dogpile Lock)
The most common prevention: use a distributed lock so only one requester recomputes the value while others wait.
The double-check after acquiring the lock prevents redundant work when multiple threads race past the initial cache check.
Stale-While-Revalidate
Instead of blocking requests while recomputing, serve the stale (expired) value immediately and refresh in the background:
This eliminates user-facing latency spikes entirely. Users always get a response immediately — either fresh data or slightly stale data while the background refresh runs.
TTL Jitter
If all keys are set with the same TTL (e.g., 300 seconds), keys created at the same time expire at the same time, causing a synchronized stampede. Adding random jitter spreads expirations:
A 10% jitter on a 300-second TTL spreads expirations across a 60-second window (270-330s). This prevents the "thundering herd at TTL boundary" scenario where thousands of keys expire simultaneously.
Probabilistic Early Expiration (XFetch)
A more sophisticated approach: instead of waiting for TTL to expire, each request has a small probability of triggering a refresh before expiration. As the key approaches its TTL, the probability increases:
XFetch naturally adapts: keys with expensive compute functions (high compute_time) are refreshed earlier because the cost of a stampede is higher. The beta parameter controls aggressiveness.
Start with the simplest approach that solves your problem. TTL jitter prevents synchronized expirations (nearly free to implement). Mutex locks prevent stampedes on individual keys. Stale-while-revalidate eliminates latency spikes entirely. Probabilistic early expiration (XFetch) is for extreme scale where even a single synchronous recomputation is too expensive.
Each mitigation technique trades one property for another. Choosing the right combination depends on your specific access patterns, consistency requirements, and failure tolerance.
Trade-off Summary
| Technique | Prevents | Cost | Best For |
| Key splitting | Hot key overload | Write amplification, eventual consistency | Read-heavy hot keys |
| L1 local cache | Hot key overload | Staleness (TTL-bounded) | Global config, feature flags |
| Singleflight | Duplicate DB queries (per-node) | Memory for in-flight map | All cache miss scenarios |
| Distributed lock | Duplicate DB queries (cross-node) | Lock contention, complexity | High-concurrency cache misses |
| TTL jitter | Synchronized expirations | Slightly unpredictable TTLs | All cached data (default practice) |
| Stale-while-revalidate | User-facing latency spikes | Serving stale data briefly | Latency-sensitive applications |
| XFetch | Stampede via early refresh | Background compute overhead | Expensive recomputations |
Layered Defense Strategy
In production, these techniques are combined in layers:
Not every system needs all layers. Start with TTL jitter and singleflight (cheap, easy). Add L1 caching and distributed locks when hot keys or stampedes are observed. Add stale-while-revalidate and XFetch only for specific high-traffic, latency-sensitive paths.
Common Mistakes
Mistake 1: Lock without TTL — A distributed recomputation lock without a TTL creates a permanent block if the lock holder crashes. Always set a TTL longer than expected compute time.
Mistake 2: L1 TTL too long — Long L1 TTLs maximize cache hit rate but increase staleness. A 60-second L1 TTL on frequently changing data means users see data that is up to a minute old, which is unacceptable for things like inventory counts or prices.
Mistake 3: Ignoring write amplification — Key splitting with 8 replicas means every write goes to 8 nodes. For write-heavy keys, this multiplies write load by 8x. Only split read-heavy keys.
Mistake 4: No fallback path — If your coalescing lock holder fails and waiters have no fallback, they hang indefinitely. Always include a fallback (retry with direct DB query after timeout).
The most effective hot key mitigation is often the simplest. L1 local caching with a 2-second TTL reduces Redis load by 99.9% for hot keys with zero infrastructure changes. TTL jitter prevents synchronized stampedes with 3 lines of code. Save the complex solutions (XFetch, dynamic key splitting) for the rare cases where simple approaches are insufficient.