Gray Failures: When the System Is Up and Broken at the Same Time

April 28, 2026


The cleanest outage to debug is the one where a server is completely dead. Health checks fail, the load balancer ejects the node, traffic shifts, and pagers stay quiet. The hardest outage to debug is the one where the server is technically alive and quietly wrong.

This is a gray failure. The process is running. The health endpoint returns 200. CPU is normal. From every monitor's point of view the node is healthy. And yet a slice of real requests are timing out, or returning stale data, or taking five seconds to do something that usually takes fifty milliseconds. The node is up. The node is also broken. Both are true.

The causes are mundane and varied: a disk that is slowly failing and serves reads at a tenth of normal throughput, a network card that drops one packet in a hundred so TCP keeps retransmitting, a JVM caught in a long garbage collection pause, a connection pool that leaked and now blocks new acquisitions, an availability zone that can reach half the fleet but not the other half. Each of these leaves the process alive and the health check passing while real work suffers.

The damage is rarely contained to the broken node. Clients retry, which doubles load on the survivors. Queues grow upstream. Connection pools across the fleet fill with requests waiting on the slow node. Healthy instances start looking unhealthy because they are wedged behind a sick peer. By the time anyone realizes the source, the symptoms are everywhere and the root cause looks like nothing.

Detection has to abandon the binary model. A health endpoint that does in-process bookkeeping cannot diagnose a sick disk or an asymmetric network, because it never exercises them. The signals that actually work come from real traffic and synthetic traffic together: p99 latency per instance, error rate per instance, success rate of synthetic probes that traverse the same code path as production requests. Alert on outliers, not on absolutes. A node whose p99 is ten times the fleet median is unhealthy even if nothing it owns is technically down.

The production failure mode that catches teams off guard is the gray failure that survives a deploy. Restarting the process clears most transient corruption, so the standard runbook of "rotate the bad node" usually works. When the bad node comes back bad, the underlying cause is hardware or kernel state, and the fleet's automation will keep happily routing traffic to it because the health check still passes. You need outlier ejection at the load balancer layer, configured on latency and error rate, not just on liveness. Envoy and modern service meshes call this passive health checking. It is the single biggest defense against partial degradation, and most teams discover they need it during the outage that taught them the word "gray."

Key takeaway

The binary up-down model misses the failures that hurt most. Detection requires real-traffic signals, percentile latency, and outlier ejection, not just a health endpoint returning OK.

Originally posted on LinkedIn. View original.


All Rights Reserved.