The Rolling Deploy That Returns 5xx: Why Readiness Probes Decide When Traffic Moves

April 10, 2026

A deploy can be technically green and still hurt users. The control plane reports success because every old pod was replaced, every new pod is running, and no rollback was triggered. Meanwhile your error rate spiked for two minutes during the rollout and nobody knows why.

The cause is almost always the same. Traffic moved to new pods before the new pods were ready to serve. Kubernetes does not know what "ready" means for your app. It only knows what you tell it through the readiness probe. The default behavior, if you do not configure one, is "ready as soon as the container starts." For most real services, that is a lie.

Here is what an app actually does between process start and serving correctly. It parses config, opens a database connection pool, warms a local cache, establishes connections to downstream services, and hydrates the JIT for the hot paths. Until that is done, the process is listening on port 8080 but a request will hit a half-built service and time out.

The readiness probe is the contract that fixes this. The pod stays in NotReady until the probe passes. The Service controller excludes NotReady pods from the Endpoints list, so traffic keeps hitting the old pods. Only when the new pod answers a real health check does it join rotation. Define the probe to actually exercise the dependencies you care about. A handler that returns 200 because the HTTP server is up tells you nothing. One that confirms the DB pool has a live connection and the downstream auth service is reachable is real.

The rollout knobs control the pace. maxSurge is how many extra pods can exist above the desired count during the deploy. maxUnavailable is how many old pods can be gone before new ones are ready. For latency-sensitive workloads with slow warmup, drop maxUnavailable to zero so capacity never dips below baseline.

The other half is graceful shutdown. When Kubernetes terminates an old pod, it sends SIGTERM and starts a clock. If the app does not finish in-flight requests before terminationGracePeriodSeconds runs out, it gets SIGKILL and any in-flight request dies. A preStop hook with a short sleep, plus an app that stops accepting new connections on SIGTERM while draining existing ones, removes that whole class of error.

The failure I have actually shipped: an app that marked itself ready on port-listen, but the DB pool initialized lazily on first query. The first thirty seconds of every new pod's life served 500s from cold connections. The readiness probe was happy. Users were not. Fix was a one-line change to warm a connection inside the readiness handler.

A rolling update is not pod movement. It is traffic movement. Readiness is what gates the traffic.

Key takeaway

A successful rolling update from the control plane's perspective can still be a user-visible outage. Readiness probes, maxSurge/maxUnavailable, and preStop hooks are what separate a clean deploy from a 5xx burst.

Originally posted on LinkedIn. View original.