Health Checks That Don't Take Down Your Whole Fleet
March 17, 2026
A load balancer is only as smart as the health signal it trusts. Send traffic to a dead node and users see errors. Eject a healthy node because one dependency stuttered for a second and you may have just shed half your fleet for no good reason.
There are two families of checks, and good systems run both.
Active checks are the load balancer probing each backend on a schedule. Open a TCP socket. Call /healthz. Run a synthetic request that touches the real code path. The advantage is predictability. You define what healthy means, you control the cadence, you get a clean binary signal. The disadvantage is that probes are periodic. If you check every ten seconds, a node that dies one second after a successful probe will keep receiving live traffic for nine more seconds. That is plenty of time to fail thousands of requests.
Passive checks watch the real traffic. Connection resets, timeouts, bursts of 5xx, p99 latency spikes. The load balancer infers health from what users are actually experiencing. The reaction is fast and it catches gray failures that a shallow /healthz would never notice, like a node that accepts connections but returns garbage. The cost is that your users are the canaries. The first handful of failing requests is how the system learns the node is sick.
The pairing is the point. Active checks give you a baseline before traffic ever arrives. Passive checks pull a node out the moment its real success rate craters.
The next decision is shallow versus deep. A shallow check confirms the process is listening on a port. A deep check confirms the process can actually serve a real request, often by hitting its dependencies.
Deep checks are seductive and dangerous. Here is the failure mode I have watched twice. Every backend's /healthz ran a SELECT 1 against the shared primary, plus a PING against Redis. The Redis cluster did a 4 second leader election during a routine failover. Every backend failed at the same instant. The load balancer ejected the entire fleet. The site returned 503 until the operator manually disabled health checks. A tiny hiccup became a full outage because the health signal coupled every node's fate to a shared resource.
A few protections that pay for themselves:
- Require N consecutive failures before marking a node unhealthy, and N consecutive successes before re-admitting it.
- Use slow start so a recovered node ramps back up over 30 to 60 seconds.
- Drain in-flight connections before removal.
- Separate readiness from liveness. Readiness controls traffic. Liveness controls restarts.
- Never let a deep check fail the entire fleet for one shared dependency. Cap the eject rate.
Health checks are a recovery system, not a binary switch. Treat them that way.
Combine active probes for baseline readiness with passive signals from real traffic for fast detection. Require consecutive failures before ejection, and never let a shared dependency mark every backend unhealthy at once.
Originally posted on LinkedIn. View original.