kafka readiness probes failing

Kafka

readiness probes

troubleshooting

error resolution

microservices monitoring

kafka readiness probes failing

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

A Kafka broker can open its port long before it is truly ready to serve clients. That is why Kafka readiness probes often fail in noisy or confusing ways in Kubernetes. Sometimes the probe is too strict and keeps the broker unready during long log recovery. Sometimes it is too weak and marks the broker ready just because a TCP socket opened. The fix is to probe the right thing and give startup enough time.

Why Kafka Readiness Is Tricky

Kafka startup is not the same as a simple web service startup. A broker may need to:

recover logs
replay metadata
rejoin the cluster
elect or learn controller state
open listeners and advertise correct addresses

During that time, a naive probe can misread the broker state. A TCP probe only confirms that the process is listening. It does not prove the broker can answer metadata requests from clients.

Common Reasons Probes Fail

The most common causes are:

the probe starts too early for a slow broker
the probe checks the wrong port or listener
the broker is listening, but the cluster is not yet usable
DNS or advertised listener settings are wrong
the container is CPU- or disk-starved during startup

Large partition counts and unclean previous shutdowns can make this much worse because the broker spends longer on recovery before it can genuinely serve traffic.

Prefer a Real Broker Check Over a Bare TCP Check

A TCP readiness probe is acceptable only as a weak baseline. For Kafka, a better readiness test is one that verifies the broker can answer a real client request.

If the image includes Kafka CLI tools, an exec probe is usually more meaningful:

yaml

1readinessProbe:
2  exec:
3    command:
4      - /bin/sh
5      - -ec
6      - kafka-broker-api-versions.sh --bootstrap-server 127.0.0.1:9092 >/dev/null 2>&1
7  periodSeconds: 10
8  timeoutSeconds: 5
9  failureThreshold: 6

This checks more than "is the port open." It checks whether the broker can respond to a metadata-level client request.

If the image does not include the CLI tools, a TCP probe is still possible, but accept that it is less accurate:

yaml

1readinessProbe:
2  tcpSocket:
3    port: 9092
4  periodSeconds: 10
5  timeoutSeconds: 3
6  failureThreshold: 6

Use a Startup Probe for Slow Recovery

Kubernetes explicitly recommends startupProbe for slow-starting containers. That advice fits Kafka well because brokers often need extra time before readiness checks should matter.

A solid pattern is:

'startupProbe gives the broker time to recover'
'readinessProbe decides when traffic is safe'
'livenessProbe stays conservative so the pod does not restart during normal recovery'

Example:

yaml

1startupProbe:
2  exec:
3    command:
4      - /bin/sh
5      - -ec
6      - kafka-broker-api-versions.sh --bootstrap-server 127.0.0.1:9092 >/dev/null 2>&1
7  periodSeconds: 10
8  failureThreshold: 30
9
10readinessProbe:
11  exec:
12    command:
13      - /bin/sh
14      - -ec
15      - kafka-broker-api-versions.sh --bootstrap-server 127.0.0.1:9092 >/dev/null 2>&1
16  periodSeconds: 10
17  failureThreshold: 6
18
19livenessProbe:
20  tcpSocket:
21    port: 9092
22  periodSeconds: 20
23  failureThreshold: 6

This gives the broker up to 300 seconds to complete startup before Kubernetes treats it as failed.

Validate Listener Configuration

Probe failures are often symptoms of listener mistakes rather than probe mistakes. Check these values carefully:

'listeners'
'advertised.listeners'
'inter.broker.listener.name'
service DNS names and ports

A broker might respond locally on 127.0.0.1:9092 but advertise a hostname that clients inside the cluster cannot resolve. In that case the readiness probe may pass while actual applications still fail, or the reverse if the probe itself targets the wrong listener.

A simplified environment example:

properties

listeners=PLAINTEXT://:9092
advertised.listeners=PLAINTEXT://kafka-0.kafka-headless.default.svc.cluster.local:9092
inter.broker.listener.name=PLAINTEXT

Probe behavior only makes sense when the listener configuration is correct.

Watch the Pod Events and Broker Logs Together

Do not debug probes by staring only at the probe YAML. Look at both Kubernetes and Kafka evidence.

Useful commands:

bash

kubectl describe pod kafka-0
kubectl logs kafka-0

If kubectl describe shows repeated readiness failures while the Kafka logs still say the broker is recovering logs or waiting on cluster metadata, the probe is probably just too aggressive. If the logs show address binding or listener errors, the probe is only revealing a real startup problem.

Common Pitfalls

The biggest mistake is using a TCP readiness probe and assuming it proves the broker is fully usable. It only proves a socket is open.

Another mistake is using readiness and liveness with the same aggressive settings. Kafka often needs longer startup windows than stateless services, and a missing startupProbe can turn slow recovery into repeated restarts.

Teams also often forget the storage angle. If persistent volumes are slow or nearly full, brokers can miss probe windows even when the YAML is otherwise fine.

Finally, do not treat probe tuning as a substitute for fixing bad listener configuration. If advertised.listeners is wrong, no amount of initialDelaySeconds will make the cluster healthy.

Summary

Kafka readiness is harder than "port open equals ready."
Prefer an exec readiness probe that performs a real broker-level check.
Add a startupProbe for slow recovery and log replay.
Verify listener and advertised listener settings before blaming probe timing.
Read pod events and Kafka logs together to distinguish slow startup from real configuration errors.