The Cache Hierarchy: Performance Is Which Layer You Actually Hit

January 27, 2026

Most performance work is not really about algorithms. It is about which cache you actually hit. The hierarchy is fixed, and the gaps between layers are larger than people remember.

CPU L1 cache responds in about 1 nanosecond. L2 is around 4 nanoseconds. L3 is roughly 10. Main memory is 100. A local Redis call over a unix socket or loopback is around 1 millisecond. A Redis call across the network in the same availability zone is closer to 5 milliseconds. A primary database call with an index hit is around 50 milliseconds. Cross-region anything is 100 milliseconds and up.

The pattern is consistent: each layer is roughly 10 to 100 times slower than the one above it. A miss at any layer is, by definition, a hit at the next one. That is the whole game. You are not trying to make individual layers faster. You are trying to keep requests served high in the stack.

This is the strategic view of caching, and it sits above the tactical concerns. Cache stampede prevention, TTL tuning, and hot key mitigation are all problems inside the Redis layer. The hierarchy question is different: which layer should hold this data in the first place, and what is the cost when it falls through to the next one?

The production failure that taught me this lesson was painful in its simplicity. A team had spent two sprints tuning their Redis cluster, chasing p99 latency on cache reads from 1 millisecond down to 0.5. They were proud of the work, and the cache layer dashboards looked beautiful. The user-facing p99 did not move. When I traced a single request, the answer was obvious: every request that hit their cache also made an uncached 50 millisecond Postgres call for a small lookup table. The cache layer they had been polishing was correct. The cache key for the table that mattered did not exist at all.

The fix took an afternoon. Add a Redis key for the lookup table, populate it on the application side with a 5 minute TTL, and stop hitting Postgres for data that changed twice a day. The user-facing p99 dropped by 45 milliseconds. The Redis tuning work had been real engineering. It was just aimed at the wrong layer.

The discipline is to instrument every level: app cache hit rate, page cache stats, database query latency, cross-region call counts. Then add caching where the next-layer cost is large and the hit rate gain will actually show up at the top of the funnel. Optimizing the fast layer when the slow layer is doing the work is the most common form of wasted performance effort.

Key takeaway

Latency is a hierarchy problem before it is a tuning problem. Cutting Redis from 1ms to 0.5ms is invisible if there is an uncached 50ms database call on the same request. Instrument every layer, then cache where it actually moves the number.

Originally posted on LinkedIn. View original.