Load Balancer Health Checks: The Difference Between Healthy and Ready
March 17, 2026
A load balancer routes traffic based on a single bit per backend: healthy or not. Get the bit wrong and every other decision the system makes is wrong with it.
There are two ways to compute that bit. Active checks probe on a schedule: open a TCP connection, call /healthz, run a synthetic request. They are predictable and easy to reason about, but they only fire every few seconds, so a backend that dies right after a probe keeps receiving traffic until the next interval. Passive checks read the real traffic: connection resets, request timeouts, 5xx bursts, latency spikes. They react in milliseconds, but the first failing users are the canaries. Production systems run both. Active for baseline readiness, passive for fast detection of real serving problems.
The deeper question is what the probe actually checks. A TCP-port probe says the kernel is listening. An HTTP /health that returns 200 unconditionally says the web server is up. Neither says the app can serve a request. A real readiness check exercises the things traffic depends on: a quick DB ping, a cache touch, a downstream dependency call.
That depth has a sharp edge. Make /ready too dependency-aware and a brief Redis pause ejects the entire fleet in unison. The cure becomes worse than the disease. The right framing: can this specific node serve the traffic I want to send to it, right now? Not: is every downstream in perfect condition?
Tuning matters as much as the check itself. Interval too low and you DDoS yourself with probes. Failure threshold too low and any GC pause flaps a node out of rotation. The standard pattern is consecutive failures before ejection, consecutive successes before re-admission, slow start on recovery, and connection draining on removal. These four turn a brittle binary switch into a system that recovers without thrashing.
Here is the production failure I have watched twice. A team writes a /health endpoint that returns 200 if the process is up. They never wire it to the database. One afternoon the DB goes unreachable. Every backend keeps passing health checks because nothing in the probe touches the DB. The LB happily routes 100% of traffic to nodes that immediately return 500. The on-call sees green dashboards and red user errors at the same time.
The fix is two-sided. Add a dependency-aware /ready that probes the DB shallowly. Keep passive checks on so a 5xx spike trips the LB even if /ready lies. Health checks do not just detect failure. They are the contract between the LB and the fleet, and contracts have to be written carefully.
A load balancer is only as smart as its health signal. Active probes set a baseline, passive checks catch fast failures, and readiness must reflect the dependencies that traffic actually exercises.
Originally posted on LinkedIn. View original.