Kubernetes's http liveness probe failed when pod under heavy load

Kubernetes

liveness probe

pod performance

heavy load

troubleshooting

Kubernetes's http liveness probe failed when pod under heavy load

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

An HTTP liveness probe that fails only under heavy load is often a sign that the probe endpoint is too expensive, the timeout is too aggressive, or liveness is being used for a readiness problem. Restarting a busy but still recoverable pod can make the outage worse by adding more churn at exactly the wrong time.

Understand What Liveness Should Mean

A liveness probe answers one narrow question: is the process stuck badly enough that restarting it is the right action. It should not be a general service-quality check and should not fail just because the application is slow.

If the application is overloaded but still functioning, that is usually a readiness problem. In that case, the pod should stop receiving new traffic until it recovers, not be restarted automatically.

Keep the Liveness Endpoint Cheap

A common cause of false failures is using an endpoint that hits the database, checks multiple downstream services, or performs real application work. Under load, those dependencies slow down and the liveness probe times out.

A better endpoint only verifies that the process event loop, web server, or core health flag is still alive.

yaml

1livenessProbe:
2  httpGet:
3    path: /health/live
4    port: 8080
5  initialDelaySeconds: 20
6  periodSeconds: 10
7  timeoutSeconds: 2
8  failureThreshold: 3
9readinessProbe:
10  httpGet:
11    path: /health/ready
12    port: 8080
13  initialDelaySeconds: 5
14  periodSeconds: 5
15  timeoutSeconds: 2
16  failureThreshold: 2

In this setup, /health/live should be lightweight, while /health/ready can reflect whether the pod is ready to serve traffic.

Tune Probe Timing for Real Load

Defaults are often too strict for CPU-bound or latency-sensitive services. If a pod occasionally needs several seconds to respond during spikes, a timeoutSeconds value of 1 can be unrealistic.

A more forgiving probe might look like this:

yaml

1livenessProbe:
2  httpGet:
3    path: /health/live
4    port: 8080
5  periodSeconds: 10
6  timeoutSeconds: 5
7  failureThreshold: 5

This does not hide true failures forever. It simply gives the process a reasonable chance to survive transient pressure.

Add a Startup Probe for Slow Warmup

If the application takes a long time to initialize caches, JIT compilation, or connections, a startupProbe can protect the pod from early liveness failures.

yaml

1startupProbe:
2  httpGet:
3    path: /health/live
4    port: 8080
5  periodSeconds: 10
6  failureThreshold: 30

While the startup probe is active, liveness and readiness checks are effectively deferred. That is helpful for applications that are slow only during boot.

Fix the Real Resource Problem Too

Probe tuning helps, but it should not hide real capacity issues. If the pod saturates CPU or memory under normal traffic, investigate resource requests, limits, concurrency settings, and scaling behavior.

For example, if the application server has too few worker threads, the health endpoint can be starved behind user requests. In that case, even a cheap endpoint may time out because no worker is available to serve it.

Autoscaling, higher CPU requests, or isolating health handling onto a lightweight path can all improve probe reliability.

Consider Probe Type Alternatives

HTTP is convenient, but it is not always the best liveness check. If your service can deadlock above the HTTP stack, an exec or TCP probe may express health differently. The right choice depends on what “stuck” means for the application.

Still, probe type is secondary to semantics. The main question is whether failure should trigger a restart or merely temporary traffic removal.

Common Pitfalls

The most common mistake is using the same heavy endpoint for both readiness and liveness. That couples restart decisions to load conditions and makes cascading failures more likely.

Another issue is tuning only the probe while ignoring actual pod saturation. If CPU throttling or memory pressure is severe, the health checks are only exposing a deeper problem.

A final mistake is assuming every timeout is proof the process is dead. Under heavy load, slow does not automatically mean unrecoverable.

Summary

Liveness should detect stuck processes, not overloaded but recoverable ones.
Keep the liveness endpoint cheap and separate from readiness checks.
Increase timeout and failure thresholds when the current settings are unrealistically strict.
Use a startup probe for slow initialization.
Fix underlying capacity and concurrency bottlenecks instead of relying only on probe tuning.