What Actually Happens When a Kubernetes Node Dies
April 15, 2026
Kubernetes will recover from a dead node without you doing anything. That sentence is technically true and operationally misleading. The recovery is automatic. It is also slow enough that your callers will notice, and the defaults are not tuned for your latency budget.
Here is the actual timeline. A node loses connectivity, or the kubelet panics, or someone pulls the plug. The first thing that stops happening is the node's heartbeat to the API server. By default, the control plane will wait node-monitor-grace-period (40 seconds) before deciding the node is gone and marking it NotReady. During that 40 seconds, the Endpoints object still lists pods on that node as ready. Services keep routing traffic to dead pods. Callers see timeouts, not fast failures.
Once the node is NotReady, eviction does not happen immediately either. The controller manager waits pod-eviction-timeout (5 minutes by default in older versions, replaced by taint-based eviction with tolerationSeconds of 300 in newer ones) before it evicts pods from the unreachable node. Only then does the ReplicaSet controller notice that desired replicas exceeds running replicas and start scheduling replacements onto healthy nodes. Those new pods then go through their own startup probe, readiness probe, and image pull. End to end, you are looking at one to several minutes of degraded capacity per pod that was on the dead node.
This is why "Kubernetes is self-healing" needs an asterisk. Self-healing happens. It just happens on Kubernetes timescales, not user-facing timescales.
The production failure: a payments team ran a single deployment with three replicas, no PodDisruptionBudget, and readiness probes that returned 200 the instant the process bound to a port. A node went down on a Saturday morning. The two surviving pods got hit with all the traffic and saturated their thread pools. The third replica came up on a new node within 90 seconds, but its readiness check passed before the JVM finished its JIT warmup and the connection pool was sized. New pod accepted traffic, immediately timed out, looked unhealthy to clients. The "fast recovery" actually extended the incident.
Two changes make this graceful. First, set a PodDisruptionBudget so voluntary disruptions (drains, upgrades) cannot stack on top of an involuntary one. Second, write a readiness probe that actually proves the pod can serve work: warmed connection pool, primed caches, one successful synthetic request. The right time for traffic is not "the process started." It is "the process can do its job."
Node failure is a solved problem in Kubernetes. The solution just takes longer than you think.
Node failure recovery in Kubernetes is automatic but not instant. The control plane waits 40 seconds before marking a node NotReady, then another five minutes by default before evicting its pods. PDBs and readiness gates make the rescheduling clean instead of chaotic.
Originally posted on LinkedIn. View original.