Tail Latency Is What Your Users Actually Feel
May 12, 2026
The average latency on your dashboard is probably lying to you. Not because the metric is wrong, but because no real user experiences the average. They experience whatever the slowest blocking path on their request happens to be.
A single user-facing request rarely lives on one machine. Open any product page and the API gateway is already fanning out: profile, pricing, inventory, recommendations, reviews, feature flags, search. Ten parallel calls is a low estimate for anything modern. The page renders when the last one comes back. So the user-visible latency is not the average of those ten calls. It is the max.
This is where fan-out amplification bites. If each backend has a p99 of 200 ms and a median of 20 ms, you might assume that 200 ms is a rare event. It is, for one call. But the probability that at least one of ten parallel calls lands in its top 1% is roughly 1 - 0.99^10, about 9.6%. So nearly one in ten requests will experience what was supposed to be a one-in-a-hundred event. The tail of a single backend becomes close to the median of the composite request. Average latency on the gateway looks fine. Users see lag.
It helps to know where tails come from, because the levers depend on the source.
- Garbage collection pauses. A young-gen pause is fine. An old-gen pause on a heap that has grown is not.
- JIT compilation. The first few requests through a hot code path on a fresh JVM are slow. Rolling deploys make this a permanent fixture.
- Lock contention and queueing. A request waiting for a thread is not running. Saturation makes the tail explode long before the average notices.
- Cold caches. The unlucky request is the one that has to fill the cache for everyone else.
- Disk and network jitter. SSDs are fast on average and very occasionally not.
Tail control is a separate engineering discipline from "make it fast." A few tools that pay off:
- Aggressive timeouts on every outbound call, set to a value below the SLO for the slowest fast path.
- Hedged requests for idempotent reads: send the second copy after p95 and take whichever returns first.
- Request coalescing so a cache stampede does not multiply the slow path.
- Bounded fan-out and partial responses when something does not come back in time.
- Pre-warming JIT and caches before a node takes real traffic.
The mental model that sticks: one slow dependency makes many fast ones irrelevant. Optimize the tail, not the average.
Users live on p99, not on the mean. Fan-out turns a backend's tail into the request's median, so the work is bounding the slowest path with timeouts, hedging, request coalescing, and ruthless attention to GC and JIT warmup.
Originally posted on LinkedIn. View original.