Kafka Consumer Lag: When Your System Is Up but No Longer Real Time

April 5, 2026

The hardest Kafka incidents are the ones where nothing is broken. Producers are producing. Consumers are consuming. Every dashboard is green. And the data is forty minutes old.

That is consumer lag.

The definition is mechanical. For each partition, lag is the log-end offset (the next offset a producer would write) minus the committed offset (the last offset the consumer group has acknowledged). Sum that across all partitions a group owns and you get total group lag. A lag of zero means consumers are caught up. A lag of fifty thousand means there are fifty thousand records sitting in the log that have been produced but not yet processed.

The trap is that lag is not a binary up-or-down signal. A healthy system has lag all the time. Right after a deploy, consumers are warming up and lag spikes briefly. During a replay from an older offset, lag is huge by design. A sudden producer burst pushes lag up until consumers drain it. None of those are incidents.

The incident is when lag stops draining.

This happens when consumer throughput drops below producer throughput. A bad deploy adds a slow database write inside the message handler. A downstream API gets slower. A new message format is twice the size and the deserializer is now the bottleneck. The consumer is still alive, still polling, still committing offsets. Just slower than the producers.

Once lag starts growing it tends to keep growing. The processing falls behind, downstream consumers see stale data, alerts fire late, recommendations use old signals. If the slowdown is bad enough you also start losing data older than the topic retention. By the time you notice, you cannot replay because the records have been deleted.

The diagnosis questions are: is the lag growing on every partition or only some, is the handler slow or is downstream slow, and is the consumer group churning through rebalances. Per-partition lag matters because a single hot key on one partition can starve that consumer while every other partition looks fine.

The fixes follow the bottleneck. If handlers are slow, move blocking work onto a worker pool and let the poll loop just ack offsets. If downstream is slow, add backpressure or batching. If you are CPU bound on the consumer side, scale the group up. The hard limit is the partition count: a group with more consumers than partitions just has idle workers. If parallelism is your bottleneck and you are already at one consumer per partition, you have to add partitions, which requires planning because Kafka does not let you reduce partition count later.

Lag is the cheapest streaming SLO you can monitor. Watch it per partition, alert on growth rate not absolute value, and your real-time system stays real-time.

Key takeaway

Consumer lag is `log-end offset minus committed offset`, per partition. The dangerous case is when consumers are alive but slower than producers. Scale consumers up to the partition count, never past it, and move slow work out of the poll loop.

Originally posted on LinkedIn. View original.