Cache Stampedes: When One Expired Key Takes Down Your Database

March 27, 2026

A cache stampede is the moment your caching layer turns into an amplifier instead of a shield. A popular key expires. The next millisecond, a thousand requests notice the miss. None of them wait for the others. Each one runs the same expensive query against the same database row, returns the same result, and tries to write it back to the cache. The database, which was sized for one of those queries per minute, gets a thousand at once.

The pattern is self reinforcing. The first wave slows the database, which slows the recomputes, which keeps the cache empty longer, which lets the second wave pile on. Latency rises, queries time out, retries multiply, and the cache stays cold for minutes. The root cause is not the expiration. The root cause is that nobody coordinated.

There are three fixes worth knowing.

The first is a per key mutex, sometimes called single flight. The first miss takes a lock, recomputes the value, writes it to the cache, and releases. Every other miss for that key waits on the lock and reads the freshly populated value when it wakes up. One database query instead of a thousand.

The second is probabilistic early refresh, often written as XFetch. Instead of expiring at a hard boundary, each request rolls a small probability of refreshing the value before its TTL ends. As the key approaches expiration, the probability rises. By the time it would have expired, some lucky request has already rebuilt it, and the rest sail through on a fresh entry.

The third is stale while revalidate. When the TTL ends, the cache keeps serving the stale value to readers and kicks off a single background refresh. No reader ever sees a miss. The freshness window widens slightly, which is almost always an acceptable trade.

The production failure is the one teams hit when they think they have already fixed this. A per key lock in process memory works beautifully on a single instance. Then you deploy to a fleet of 200 pods. Each pod has its own lock. Each pod runs exactly one recompute. You now get 200 concurrent rebuilds instead of 1,000, which still flattens the database. The lock has to live in shared infrastructure, usually Redis with a TTL safety net, so the entire fleet coordinates on a single recompute per key.

Cache design is not just storing data. It is governing what happens when the data disappears.

Key takeaway

A cache stampede is a coordination failure, not a capacity failure. The cache did not break. Your fleet decided independently to rebuild the same value at the same moment.

Originally posted on LinkedIn. View original.