system design
kafka

Difference between session.timeout.ms and max.poll.interval.ms for Kafka >= 0.10.1

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

session.timeout.ms and max.poll.interval.ms both affect consumer liveness in Kafka, but they protect against different failure modes. The short version is that session.timeout.ms is about staying alive in the group through heartbeats, while max.poll.interval.ms is about making progress by calling poll() often enough.

session.timeout.ms Watches Heartbeats

Kafka consumers in a group must keep heartbeating to the group coordinator. If the broker does not receive heartbeats within session.timeout.ms, it assumes the consumer is gone and triggers a rebalance.

That means session.timeout.ms is mainly about membership liveness:

  • process crashed
  • machine disconnected
  • network stalled long enough to miss heartbeats
  • consumer thread stopped heartbeating

It is typically paired with heartbeat.interval.ms, which is usually set lower so the consumer sends multiple heartbeats during one session window.

properties
session.timeout.ms=10000
heartbeat.interval.ms=3000

A smaller session timeout detects dead consumers faster, but it also makes the group more sensitive to transient pauses or network jitter.

max.poll.interval.ms Watches Application Progress

Starting with newer consumer behavior introduced around Kafka 0.10.1, heartbeats and record processing became meaningfully separated. A consumer might still appear alive from a heartbeat perspective while your application is actually stuck processing a batch and not calling poll() again.

That is what max.poll.interval.ms is for. It places an upper bound on the time between poll() calls.

java
1while (true) {
2    ConsumerRecords<String, String> records =
3            consumer.poll(Duration.ofMillis(500));
4
5    for (ConsumerRecord<String, String> record : records) {
6        process(record);
7    }
8}

If process(record) takes so long that the next poll() does not happen before max.poll.interval.ms expires, Kafka treats the consumer as stuck and begins removing it from the group. In other words, the consumer may still be alive as a process but dead in terms of useful progress.

Why They Are Not Interchangeable

These two settings solve different problems:

  • 'session.timeout.ms answers "is this consumer still heartbeating?"'
  • 'max.poll.interval.ms answers "is this consumer still returning to poll() and participating normally?"'

That difference matters when processing is slow. Raising only session.timeout.ms does not fix a consumer that spends too long between polls. Likewise, raising only max.poll.interval.ms does not help if the process or network stops heartbeats entirely.

A typical configuration for heavy processing might look like this:

properties
1session.timeout.ms=15000
2heartbeat.interval.ms=5000
3max.poll.interval.ms=300000
4max.poll.records=100

If processing still exceeds five minutes, you should consider reducing max.poll.records, moving long work off the poll thread, or decoupling ingestion from downstream processing rather than simply increasing timeouts forever.

Practical Tuning Advice

If rebalances happen because the consumer crashes or loses connectivity, look at session.timeout.ms. If rebalances happen during long-running processing, look at max.poll.interval.ms.

In many systems, the better fix is architectural:

  • poll smaller batches
  • hand work to another thread pool
  • store records durably and process them outside the poll loop

Timeout tuning can help, but it should not hide a design that blocks the poll loop for too long.

Common Pitfalls

  • Assuming both settings control the same timeout.
  • Increasing session.timeout.ms when the real problem is slow processing between polls.
  • Increasing max.poll.interval.ms while still leaving huge batches on the poll thread.
  • Forgetting to tune max.poll.records alongside processing time.
  • Treating rebalances as only a broker-side issue instead of a consumer design issue.

Summary

  • 'session.timeout.ms is the heartbeat-based group membership timeout.'
  • 'max.poll.interval.ms is the maximum allowed delay between poll() calls.'
  • One protects against dead consumers, the other against stuck consumers.
  • Slow processing is usually a max.poll.interval.ms problem, not a session.timeout.ms problem.
  • Good tuning often includes smaller batches or moving work off the poll thread.

Course illustration
Course illustration

All Rights Reserved.