Hot Key Mitigation, Request Coalescing, Stampede Protection

Course

System Design Fundamentals

Hot Key Mitigation, Request Coalescing, Stampede Protection

Topics Covered

Hot Keys: Skewed Access Patterns in Caches

Why Hot Keys Happen

The Cascading Failure Risk

Detecting Hot Keys

Mitigating Hot Keys

Key Splitting (Sharded Replicas)

L1 Local Caching

Read Replicas

Monitoring and Auto-Detection

Request Coalescing: Avoiding Duplicate Work

Single-Process Coalescing (Singleflight)

Distributed Coalescing (Cross-Node)

Cache Stampede and Thundering Herd Problems

The Failure Cycle

Mutex Lock Pattern (Dogpile Lock)

Stale-While-Revalidate

TTL Jitter

Probabilistic Early Expiration (XFetch)

Design Trade-offs and Best Practices

Trade-off Summary

Layered Defense Strategy

Common Mistakes

Hot Keys: Skewed Access Patterns in Caches

Distributed caches like Redis Cluster or Memcached spread keys across nodes using consistent hashing. The assumption is that traffic distributes roughly evenly. A hot key breaks that assumption: one key receives so many requests that the single node responsible for it becomes the bottleneck while other nodes sit idle.

Why Hot Keys Happen

Real-world access patterns follow power-law distributions (Zipf's law). A small number of keys receive a disproportionately large share of traffic. Examples:

Celebrity profile: A user with 50M followers goes live. Their profile key (user:9999) gets millions of reads per second. Consistent hashing routes all of them to one node.
Viral content: A tweet or video goes viral overnight. The cache key for that content receives 10,000x the traffic of an average key.
Flash sale product: An e-commerce flash sale starts and every visitor hits product:12345 for price and stock level.
Config/feature flags: A global configuration key read by every request on every server becomes a hot key by default.

The Cascading Failure Risk

When a hot key overloads its node, the consequences escalate:

Latency spike: Requests for the hot key queue up. Response times increase for that key and every other key on the same node (noisy neighbor effect).
Node failure: If the load exceeds what the node can handle, it crashes or becomes unresponsive.
Cascade: The cluster redistributes the hot key to another node. That node immediately gets the same traffic and may also crash. This can cascade across the entire cluster.

The key insight is that hot keys are not just a performance problem. They are a reliability problem. A single hot key can take down an entire cache cluster if unmitigated.

Detecting Hot Keys

Before you can fix hot keys, you need to find them:

python

1# Redis: sample the keyspace for hot keys
2# redis-cli --hotkeys (uses LFU approximation)
3# Or track programmatically:
4
5from collections import Counter
6import time
7
8class HotKeyDetector:
9    def __init__(self, window_sec=60, threshold=10000):
10        self.counts = Counter()
11        self.window_sec = window_sec
12        self.threshold = threshold
13        self.last_reset = time.time()
14
15    def record(self, key):
16        now = time.time()
17        if now - self.last_reset > self.window_sec:
18            self.counts.clear()
19            self.last_reset = now
20        self.counts[key] += 1
21
22    def get_hot_keys(self):
23        return {k: v for k, v in self.counts.items()
24                if v > self.threshold}

Redis 4.0+ supports --hotkeys via the LFU eviction policy approximation. For production systems, instrumenting your cache client to sample request counts per key and alerting when any key exceeds a threshold (e.g., 10,000 requests/minute) catches hot keys before they cause outages.

Key Insight

Hot keys are inevitable in any system with user-generated content. You cannot predict which tweet goes viral or which product a celebrity endorses. The question is not whether hot keys will occur, but whether your system detects and mitigates them automatically before they cascade into outages.

Mitigating Hot Keys

There is no single solution to hot keys. Each mitigation trades something (consistency, complexity, memory) for load distribution. The right approach depends on whether the hot key is predictable or sudden.

Key Splitting (Sharded Replicas)

The most direct approach: split one logical key into N physical keys across different nodes.

python

1import hashlib
2
3NUM_REPLICAS = 4
4
5def get_split_key(base_key, request_id):
6    """Route each request to one of N replica keys."""
7    bucket = int(hashlib.md5(request_id.encode()).hexdigest(), 16) % NUM_REPLICAS
8    return f"{base_key}:{bucket}"
9
10# Read: pick a random replica
11physical_key = get_split_key("product:12345", request_id)
12value = cache.get(physical_key)
13
14# Write: update ALL replicas
15def update_split_key(base_key, value, ttl=300):
16    for i in range(NUM_REPLICAS):
17        cache.set(f"{base_key}:{i}", value, ttl=ttl)

With 4 replicas, the hot key's traffic splits across 4 nodes instead of 1. The trade-off is write amplification: every update must propagate to all N replicas. This works well for read-heavy hot keys (the common case) where writes are infrequent.

L1 Local Caching

Push hot data into each application server's process memory. The distributed cache becomes L2, checked only on local miss.

python

1from functools import lru_cache
2import time
3
4class L1Cache:
5    def __init__(self, max_size=1000, ttl_sec=2):
6        self._cache = {}
7        self._ttl = ttl_sec
8
9    def get(self, key):
10        entry = self._cache.get(key)
11        if entry and time.time() - entry["ts"] < self._ttl:
12            return entry["value"]
13        return None
14
15    def set(self, key, value):
16        self._cache[key] = {"value": value, "ts": time.time()}
17
18l1 = L1Cache(ttl_sec=2)
19
20def get_with_l1(key):
21    # Check L1 first (in-process memory)
22    val = l1.get(key)
23    if val is not None:
24        return val  # L1 hit, no network call
25
26    # L1 miss -> check L2 (Redis)
27    val = redis.get(key)
28    if val is not None:
29        l1.set(key, val)
30        return val
31
32    # L2 miss -> database
33    val = db.fetch(key)
34    redis.set(key, val, ex=300)
35    l1.set(key, val)
36    return val

With 50 application servers and a 2-second L1 TTL, the hot key goes from 200,000 reads/sec on one Redis node to roughly 25 reads/sec (each server refreshes once every 2 seconds). The trade-off is staleness: data can be up to 2 seconds old. L1 local caching is the fastest emergency response to a hot key incident because it requires no infrastructure changes, just a code deploy. Keep the TTL short (1-5 seconds) to limit staleness. For truly critical real-time data, consider pub/sub invalidation instead of TTL-based expiration.

Read Replicas

Some cache systems support read replicas natively. Redis Cluster allows configuring replica nodes for read traffic with READONLY mode. Clients read from replicas, spreading load across multiple nodes. This is transparent to application code but requires cluster configuration and adds replication lag.

Monitoring and Auto-Detection

Production hot key mitigation should be automatic, not manual:

Detect: Instrument cache clients to track per-key request rates. Alert when any key exceeds a threshold.
React: Automatically enable L1 caching or key splitting for detected hot keys.
Recover: Remove mitigations when traffic normalizes to avoid unnecessary complexity.

The detection threshold depends on your node capacity. If each Redis node handles 100,000 ops/sec comfortably, alert when any single key exceeds 50,000 ops/sec (50% of capacity).

Request Coalescing: Avoiding Duplicate Work

When a cache miss occurs, the naive approach is for every request to independently fetch from the database. If 100 requests arrive simultaneously for the same missing key, 100 identical database queries execute. Request coalescing ensures only one query runs while the others wait for its result.

Single-Process Coalescing (Singleflight)

The simplest form: within one application server, track in-flight operations and have duplicate requests wait for the first one to complete.

python

1import threading
2
3class Singleflight:
4    def __init__(self):
5        self._in_flight = {}  # key -> (Event, result)
6        self._lock = threading.Lock()
7
8    def do(self, key, fetch_fn):
9        with self._lock:
10            if key in self._in_flight:
11                event, _ = self._in_flight[key]
12                # Wait for the in-flight request to complete
13                self._lock.release()
14                event.wait()
15                self._lock.acquire()
16                _, result = self._in_flight.get(key, (None, None))
17                return result
18
19            # This thread becomes the loader
20            event = threading.Event()
21            self._in_flight[key] = (event, None)
22
23        # Execute outside the lock
24        try:
25            result = fetch_fn()
26            with self._lock:
27                self._in_flight[key] = (event, result)
28            return result
29        finally:
30            event.set()  # Wake up all waiters
31            # Clean up after waiters have read the result
32            with self._lock:
33                self._in_flight.pop(key, None)
34
35# Usage
36sf = Singleflight()
37
38def get_user(user_id):
39    cached = cache.get(f"user:{user_id}")
40    if cached:
41        return cached
42
43    def fetch():
44        data = db.query("SELECT * FROM users WHERE id = ?", user_id)
45        cache.set(f"user:{user_id}", data, ttl=300)
46        return data
47
48    return sf.do(f"user:{user_id}", fetch)

In Go, the standard library provides sync.Singleflight as a drop-in solution. In Java, use ConcurrentHashMap.computeIfAbsent with a Future value.

Distributed Coalescing (Cross-Node)

Single-process coalescing solves the problem within one server, but 50 application servers hitting a cache miss simultaneously still produce 50 database queries. Distributed coalescing uses a shared lock:

python

1def get_or_load_distributed(key):
2    # 1. Check cache
3    val = cache.get(key)
4    if val is not None:
5        return val
6
7    # 2. Try to acquire distributed lock
8    lock_key = f"lock:{key}"
9    if cache.set(lock_key, "1", nx=True, ex=5):  # SETNX with 5s TTL
10        try:
11            # This instance is the loader
12            data = db.fetch(key)
13            cache.set(key, data, ex=300)
14            return data
15        finally:
16            cache.delete(lock_key)
17    else:
18        # Another instance is loading -- wait and retry
19        time.sleep(0.05)  # 50ms backoff
20        val = cache.get(key)
21        if val is not None:
22            return val
23        # Fallback: load from DB (lock holder may have failed)
24        return db.fetch(key)

The lock TTL (5 seconds) is critical: if the lock holder crashes, the lock auto-expires and another instance can take over. Without the TTL, a crashed loader permanently blocks all other instances from loading that key.

Common Pitfall

Always set a TTL on distributed coalescing locks. If the lock holder crashes or hangs, a lock without TTL blocks all other instances from loading that key indefinitely. The TTL should be longer than the expected database query time but short enough that a failure is detected quickly, typically 2-10 seconds.

Cache Stampede and Thundering Herd Problems

A cache stampede (also called thundering herd) occurs when a popular cache key expires and many concurrent requests simultaneously discover the miss, all independently querying the database to recompute the value. The database gets overwhelmed, queries time out, and the cache remains empty because no query completes successfully, a self-reinforcing failure cycle.

The Failure Cycle

The stampede becomes self-reinforcing:

Popular key expires (TTL reached)
Hundreds of concurrent requests discover cache miss
All independently query the database
Database overwhelmed: queries slow down or timeout
No query completes successfully to repopulate the cache
New requests arrive, still find cache empty, send more queries
Database load increases further: cascading failure

Mutex Lock Pattern (Dogpile Lock)

The most common prevention: use a distributed lock so only one requester recomputes the value while others wait.

python

1def get_with_stampede_protection(key, compute_fn, ttl=300):
2    # 1. Try cache
3    val = cache.get(key)
4    if val is not None:
5        return val
6
7    # 2. Try to acquire recomputation lock
8    lock_key = f"recompute:{key}"
9    if cache.set(lock_key, "1", nx=True, ex=10):
10        try:
11            # Double-check cache (another thread may have filled it)
12            val = cache.get(key)
13            if val is not None:
14                return val
15
16            # 3. Recompute the value
17            data = compute_fn()
18            cache.set(key, data, ex=ttl)
19            return data
20        finally:
21            cache.delete(lock_key)
22    else:
23        # 4. Another requester is recomputing -- wait briefly
24        time.sleep(0.1)
25        val = cache.get(key)
26        if val is not None:
27            return val
28        # Fallback: compute directly (lock holder may have failed)
29        return compute_fn()

The double-check after acquiring the lock prevents redundant work when multiple threads race past the initial cache check.

Stale-While-Revalidate

Instead of blocking requests while recomputing, serve the stale (expired) value immediately and refresh in the background:

python

1def get_with_stale_while_revalidate(key, compute_fn, ttl=300, stale_ttl=600):
2    """
3    Store value with two TTLs:
4    - ttl: when the value is considered 'fresh'
5    - stale_ttl: when the value is actually deleted from cache
6    Between ttl and stale_ttl, serve stale data while refreshing.
7    """
8    entry = cache.get(key)  # Returns {value, created_at}
9    if entry is not None:
10        age = time.time() - entry["created_at"]
11        if age < ttl:
12            return entry["value"]  # Fresh -- return immediately
13
14        if age < stale_ttl:
15            # Stale but usable -- trigger background refresh
16            trigger_async_refresh(key, compute_fn, ttl, stale_ttl)
17            return entry["value"]  # Return stale data now
18
19    # Cache miss or truly expired -- synchronous fetch
20    data = compute_fn()
21    cache.set(key, {"value": data, "created_at": time.time()}, ex=stale_ttl)
22    return data

This eliminates user-facing latency spikes entirely. Users always get a response immediately, either fresh data or slightly stale data while the background refresh runs.

TTL Jitter

If all keys are set with the same TTL (e.g., 300 seconds), keys created at the same time expire at the same time, causing a synchronized stampede. Adding random jitter spreads expirations:

python

1import random
2
3def set_with_jitter(key, value, base_ttl=300, jitter_pct=0.1):
4    """Add random jitter to prevent synchronized expirations."""
5    jitter = random.uniform(-jitter_pct, jitter_pct) * base_ttl
6    actual_ttl = int(base_ttl + jitter)  # 270-330 seconds
7    cache.set(key, value, ex=actual_ttl)

A 10% jitter on a 300-second TTL spreads expirations across a 60-second window (270-330s). This prevents the "thundering herd at TTL boundary" scenario where thousands of keys expire simultaneously.

Probabilistic Early Expiration (XFetch)

A more sophisticated approach: instead of waiting for TTL to expire, each request has a small probability of triggering a refresh before expiration. As the key approaches its TTL, the probability increases:

python

1import math
2
3def xfetch(key, compute_fn, ttl=300, beta=1.0):
4    entry = cache.get(key)  # {value, created_at, compute_time}
5    if entry is not None:
6        age = time.time() - entry["created_at"]
7        # Probability of early refresh increases as age approaches ttl
8        remaining = ttl - age
9        threshold = entry["compute_time"] * beta * math.log(random.random())
10        if remaining > -threshold:
11            return entry["value"]  # Not yet time to refresh
12
13    # Refresh needed (expired or probabilistic early refresh)
14    start = time.time()
15    data = compute_fn()
16    compute_time = time.time() - start
17    cache.set(key, {
18        "value": data,
19        "created_at": time.time(),
20        "compute_time": compute_time
21    }, ex=ttl)
22    return data

XFetch naturally adapts: keys with expensive compute functions (high compute_time) are refreshed earlier because the cost of a stampede is higher. The beta parameter controls aggressiveness.

Interview Tip

Start with the simplest approach that solves your problem. TTL jitter prevents synchronized expirations (nearly free to implement). Mutex locks prevent stampedes on individual keys. Stale-while-revalidate eliminates latency spikes entirely. Probabilistic early expiration (XFetch) is for extreme scale where even a single synchronous recomputation is too expensive.

Design Trade-offs and Best Practices

Each mitigation technique trades one property for another. Choosing the right combination depends on your specific access patterns, consistency requirements, and failure tolerance.

Trade-off Summary

Technique	Prevents	Cost	Best For
Key splitting	Hot key overload	Write amplification, eventual consistency	Read-heavy hot keys
L1 local cache	Hot key overload	Staleness (TTL-bounded)	Global config, feature flags
Singleflight	Duplicate DB queries (per-node)	Memory for in-flight map	All cache miss scenarios
Distributed lock	Duplicate DB queries (cross-node)	Lock contention, complexity	High-concurrency cache misses
TTL jitter	Synchronized expirations	Slightly unpredictable TTLs	All cached data (default practice)
Stale-while-revalidate	User-facing latency spikes	Serving stale data briefly	Latency-sensitive applications
XFetch	Stampede via early refresh	Background compute overhead	Expensive recomputations

Layered Defense Strategy

In production, these techniques are combined in layers:

1Layer 1: TTL jitter on all keys (prevents synchronized expirations)
2Layer 2: L1 local cache for known hot keys (absorbs read traffic)
3Layer 3: Singleflight per application server (prevents per-node stampede)
4Layer 4: Distributed lock for cache misses (prevents cross-node stampede)
5Layer 5: Stale-while-revalidate for latency-critical paths
6Layer 6: Auto-detection + dynamic key splitting for surprise hot keys

Not every system needs all layers. Start with TTL jitter and singleflight (cheap, easy). Add L1 caching and distributed locks when hot keys or stampedes are observed. Add stale-while-revalidate and XFetch only for specific high-traffic, latency-sensitive paths.

Common Mistakes

Mistake 1: Lock without TTL: A distributed recomputation lock without a TTL creates a permanent block if the lock holder crashes. Always set a TTL longer than expected compute time.

Mistake 2: L1 TTL too long: Long L1 TTLs maximize cache hit rate but increase staleness. A 60-second L1 TTL on frequently changing data means users see data that is up to a minute old, which is unacceptable for things like inventory counts or prices.

Mistake 3: Ignoring write amplification: Key splitting with 8 replicas means every write goes to 8 nodes. For write-heavy keys, this multiplies write load by 8x. Only split read-heavy keys.

Mistake 4: No fallback path: If your coalescing lock holder fails and waiters have no fallback, they hang indefinitely. Always include a fallback (retry with direct DB query after timeout).

Key Insight

The most effective hot key mitigation is often the simplest. L1 local caching with a 2-second TTL reduces Redis load by 99.9% for hot keys with zero infrastructure changes. TTL jitter prevents synchronized stampedes with 3 lines of code. Save the complex solutions (XFetch, dynamic key splitting) for the rare cases where simple approaches are insufficient.

Course

System Design Fundamentals

Networking & APIs

Storage & Data Modeling

Partitioning, Replication & Consistency

Caching & Edge

Messaging & Streaming

Reliability & Operability

Security & Privacy

Common Interview Scenarios