Cache Stampede: When TTL Expiry Becomes a Self-Inflicted DDoS
January 10, 2026
The cache is doing its job for 99.9 percent of the requests, and then one TTL ticks to zero and the database faceplants. That is a cache stampede, also called dogpile or thundering herd, and it is the most common way a healthy-looking system goes down on a busy afternoon.
The mechanics are simple enough to draw on a napkin. A hot key sits in Redis with a five minute TTL. Two thousand requests per second are hitting that key. At minute five, the key expires. The next two thousand requests all miss in the same millisecond. All of them fall through to the database. The database, which was sized for the cache hit rate, suddenly sees two thousand concurrent queries for the same row. Connection pool exhausts. Queries queue. The clients time out, retry, and now the database sees four thousand. The application tier piles up threads waiting on connections. The service returns 503. The cache key, by the way, never got repopulated, because every rebuild attempt timed out.
I watched this exact failure mode take a checkout page offline for 18 minutes. The cached object was the product catalog snapshot. TTL was 10 minutes, traffic was steady, and the stampede happened on the dot of every tenth minute until we shipped a fix. Three patterns, used together, made it go away.
Singleflight, also known as request coalescing or lock-and-populate. When a miss happens, the first request acquires a short lock in the cache itself: SET stampede:<key> 1 NX PX 2000. The winner queries the database and writes the result back. Every other request that misses during those two seconds either waits briefly and re-reads, or returns the previous value if you kept one around. Exactly one database query per expiry, regardless of how much traffic is fighting for the key.
Stale-while-revalidate, borrowed from HTTP semantics. Store two timestamps per entry: a fresh-until and an expires-at, with a gap between them. Inside the gap, serve the stale value immediately and fire a background refresh. Users never block on a rebuild. Tail latency stays flat across the expiry boundary. The cost is staleness up to the gap window, fine for product listings, a problem for account balances.
Jitter. Never set the same TTL on a batch of keys you populated together, because they will expire together and stampede together. Add plus or minus 10 to 20 percent of random jitter on write. For very hot keys, refresh probabilistically before expiry, with the probability rising as the TTL approaches zero. The cache rebuilds itself in the background under low contention.
Use the three together. Jitter spreads the misses out. Singleflight makes sure each miss costs exactly one query. Stale-while-revalidate keeps users out of the rebuild path entirely. Together they turn cache expiry from a synchronized faceplant into a non-event.
Cache stampedes happen at expiry, under load, and never show up in local testing. Singleflight protects the database by letting one request rebuild while the rest wait. Stale-while-revalidate protects tail latency by serving the old value during the rebuild. TTL jitter prevents synchronized misses. Most production systems need all three.
Originally posted on LinkedIn. View original.