kafka readiness probes failing
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
A Kafka broker can open its port long before it is truly ready to serve clients. That is why Kafka readiness probes often fail in noisy or confusing ways in Kubernetes. Sometimes the probe is too strict and keeps the broker unready during long log recovery. Sometimes it is too weak and marks the broker ready just because a TCP socket opened. The fix is to probe the right thing and give startup enough time.
Why Kafka Readiness Is Tricky
Kafka startup is not the same as a simple web service startup. A broker may need to:
- recover logs
- replay metadata
- rejoin the cluster
- elect or learn controller state
- open listeners and advertise correct addresses
During that time, a naive probe can misread the broker state. A TCP probe only confirms that the process is listening. It does not prove the broker can answer metadata requests from clients.
Common Reasons Probes Fail
The most common causes are:
- the probe starts too early for a slow broker
- the probe checks the wrong port or listener
- the broker is listening, but the cluster is not yet usable
- DNS or advertised listener settings are wrong
- the container is CPU- or disk-starved during startup
Large partition counts and unclean previous shutdowns can make this much worse because the broker spends longer on recovery before it can genuinely serve traffic.
Prefer a Real Broker Check Over a Bare TCP Check
A TCP readiness probe is acceptable only as a weak baseline. For Kafka, a better readiness test is one that verifies the broker can answer a real client request.
If the image includes Kafka CLI tools, an exec probe is usually more meaningful:
This checks more than "is the port open." It checks whether the broker can respond to a metadata-level client request.
If the image does not include the CLI tools, a TCP probe is still possible, but accept that it is less accurate:
Use a Startup Probe for Slow Recovery
Kubernetes explicitly recommends startupProbe for slow-starting containers. That advice fits Kafka well because brokers often need extra time before readiness checks should matter.
A solid pattern is:
- '
startupProbegives the broker time to recover' - '
readinessProbedecides when traffic is safe' - '
livenessProbestays conservative so the pod does not restart during normal recovery'
Example:
This gives the broker up to 300 seconds to complete startup before Kubernetes treats it as failed.
Validate Listener Configuration
Probe failures are often symptoms of listener mistakes rather than probe mistakes. Check these values carefully:
- '
listeners' - '
advertised.listeners' - '
inter.broker.listener.name' - service DNS names and ports
A broker might respond locally on 127.0.0.1:9092 but advertise a hostname that clients inside the cluster cannot resolve. In that case the readiness probe may pass while actual applications still fail, or the reverse if the probe itself targets the wrong listener.
A simplified environment example:
Probe behavior only makes sense when the listener configuration is correct.
Watch the Pod Events and Broker Logs Together
Do not debug probes by staring only at the probe YAML. Look at both Kubernetes and Kafka evidence.
Useful commands:
If kubectl describe shows repeated readiness failures while the Kafka logs still say the broker is recovering logs or waiting on cluster metadata, the probe is probably just too aggressive. If the logs show address binding or listener errors, the probe is only revealing a real startup problem.
Common Pitfalls
The biggest mistake is using a TCP readiness probe and assuming it proves the broker is fully usable. It only proves a socket is open.
Another mistake is using readiness and liveness with the same aggressive settings. Kafka often needs longer startup windows than stateless services, and a missing startupProbe can turn slow recovery into repeated restarts.
Teams also often forget the storage angle. If persistent volumes are slow or nearly full, brokers can miss probe windows even when the YAML is otherwise fine.
Finally, do not treat probe tuning as a substitute for fixing bad listener configuration. If advertised.listeners is wrong, no amount of initialDelaySeconds will make the cluster healthy.
Summary
- Kafka readiness is harder than "port open equals ready."
- Prefer an
execreadiness probe that performs a real broker-level check. - Add a
startupProbefor slow recovery and log replay. - Verify listener and advertised listener settings before blaming probe timing.
- Read pod events and Kafka logs together to distinguish slow startup from real configuration errors.

