Spring kafka consumer don't commit to kafka server after leader changed
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Apache Kafka is a popular distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Whereas, Spring Kafka provides a high-level abstraction for Kafka-based messaging solutions.
One of the common issues faced in Kafka, specifically with its consumers, is the problem related to committing offsets back to the Kafka server when there is a leader change in the Kafka cluster. Understanding the relationship between consumer commits and Kafka server behavior during leader transitions is crucial for maintaining the reliability and accuracy of the consumer processing messages.
Understanding Kafka Consumers and Offset Commit
Kafka consumers fetch data from a topic and are responsible for keeping track of the offsets (positions) they have consumed to ensure that in case of consumer failure or reboot, they can resume processing from the last committed offset. Commits can be manual or automatic, configurable through consumer properties:
enable.auto.commit=true(default): the consumer's offset is committed automatically at intervals defined byauto.commit.interval.ms.enable.auto.commit=false: the consumer needs to manually commit the offsets either by callingcommitSync()orcommitAsync().
Issue with Consumer Commit During Leader Change
The Kafka cluster consists of several brokers and partitions have one leader and zero or more follower replicas. The leader handles all read and write requests for the partition, while followers replicate the leader. If the leader broker goes down or becomes unreachable, one of the followers will automatically be promoted to the new leader.
When a leader changes, the consumer may encounter issues such as:
- Failed Commit Attempts: If the consumer tries to commit its offset while a leader election is in place, the commit can fail because the new leader might not be elected yet, or the consumer might not be aware of the new leader and still tries to commit offsets to the old leader.
- Offset Commit Delays: Post leader election, it may take some time for the new leader to process offset commit requests, causing delays.
- Duplicate Processing: If the consumer does not realize that the commit failed and goes on to fetch new records, it might reprocess the same messages again when restarted or when another consumer takes over the partition (in case of consumer groups).
Solutions and Best Practices
- Handling Commit Failures: Ensure that the consumer handles commit exceptions properly. Use retries for committing the offsets. For instance, you could catch specific exceptions like
CommitFailedExceptionand retry committing after a brief pause. - Tuning Consumer Configurations: Adjust the
retry.backoff.msto manage the duration between retries. Increase thesession.timeout.msandrequest.timeout.msto provide a broader window for leader elections and offset commits during such changes. - Monitoring and Alerts: Implement monitoring on the Kafka cluster and set up alerts for leader changes so the DevOps team can be aware and possibly intervene if the automated processes fail.
Example Scenario and Recovery Strategy
Below is an example scenario highlighting the issue and a strategy to handle it:
This basic example does not implement a full retry mechanism but highlights where you would handle such an exception.
Summary Table
| Issue | Consequence | Solution Suggestion |
| Failed Commit | Offsets not committed, messages might be reprocessed | Catch CommitFailedException, retry commit |
| Offset Commit Delay | Slower performance, delayed processing | Increase session.timeout.ms, request.timeout.ms |
| Leader Change Alert | Unexpected behavior without prior warning | Monitor and set alerts for leader changes |
Understanding and handling Kafka consumer behavior during leader changes ensure high availability and data consistency across the consumer application, which is crucial for systems relying on timely and reliable data processing.

