Kafka
Structured Streaming
Consumer Group
group.id
Data Source Configuration

How to set group.id for consumer group in kafka data source in Structured Streaming?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

When Spark Structured Streaming reads from Kafka, it does not behave exactly like a hand-written Kafka consumer. That is why setting a consumer group id can be confusing. In practice, you usually should not try to force a normal Kafka group.id into the source configuration unless you understand how Spark manages offsets and query isolation.

Why This Is Different From a Plain Kafka Consumer

A normal Kafka consumer uses group.id to join a consumer group and coordinate partition assignment and committed offsets. Structured Streaming is different because Spark manages progress through its own checkpointing and source logic.

That means the question is not simply “where do I put group.id,” but “do I actually want Spark queries sharing a Kafka group identity at all?” In many cases, the answer is no.

The Important Option Name

For Spark’s Kafka source, the Kafka-style configuration keys are usually prefixed with kafka.. If you want to supply a custom Kafka group id, the option to look for is kafka.group.id, not bare group.id.

A minimal Scala example looks like this:

scala
1val df = spark.readStream
2  .format("kafka")
3  .option("kafka.bootstrap.servers", "localhost:9092")
4  .option("subscribe", "events")
5  .option("kafka.group.id", "reporting-stream")
6  .load()

That is the mechanical answer, but it is only the beginning.

Why You Usually Should Not Force It

By default, Structured Streaming generates its own query-specific consumer group identity. That is useful because:

  • each streaming query stays isolated
  • separate queries do not steal partitions from each other
  • Spark checkpointing remains the main source of truth for progress

If two streaming queries use the same explicit Kafka group id, they can interfere with each other. One query may rebalance partitions away from the other, which is usually not what you want.

So the safe rule is:

  • if you do not have a strong reason, let Spark manage the group identity
  • if you do set kafka.group.id, make it unique per query

Checkpointing Still Matters More

Even when a custom Kafka group id is provided, Structured Streaming still relies heavily on its checkpoint directory to track query state and progress.

scala
1val query = df.writeStream
2  .format("console")
3  .option("checkpointLocation", "/tmp/checkpoints/reporting-stream")
4  .start()

If you restart the query without the same checkpoint location, you should not expect consumer-group behavior alone to reproduce the exact same semantics as a plain Kafka consumer.

That is the core conceptual mistake many teams make: they expect Kafka consumer-group offset commits to be the main recovery mechanism, while Structured Streaming is designed around query checkpointing.

When a Custom Group Id Can Be Useful

There are limited cases where a custom kafka.group.id makes sense:

  • brokers enforce access controls based on consumer groups
  • you need consistent client identification for operations or auditing
  • you are integrating with existing cluster policies that require a known group name

Even then, avoid sharing that group id across multiple concurrent streaming queries unless you want them to behave like competing members of one Kafka group.

A Small PySpark Example

python
1stream_df = (spark.readStream
2    .format("kafka")
3    .option("kafka.bootstrap.servers", "localhost:9092")
4    .option("subscribe", "events")
5    .option("kafka.group.id", "audit-stream")
6    .load())
7
8query = (stream_df.writeStream
9    .format("console")
10    .option("checkpointLocation", "/tmp/checkpoints/audit-stream")
11    .start())

The important thing is not just setting the option, but pairing it with a stable checkpoint strategy.

Common Pitfalls

  • Using bare group.id instead of kafka.group.id in the Spark source options.
  • Assuming Structured Streaming uses Kafka group offsets the same way a plain Kafka consumer does.
  • Reusing the same custom group id across multiple concurrent queries.
  • Forgetting that the checkpoint location is central to query recovery.
  • Forcing a custom group id when no real operational requirement exists.

Summary

  • In Structured Streaming, the relevant option is kafka.group.id, not bare group.id.
  • Spark usually works best when it manages a unique consumer-group identity per query.
  • Checkpointing is the primary recovery mechanism, not ordinary Kafka consumer-group commits alone.
  • Use a custom group id only when you have a concrete requirement.
  • If you do set one, keep it unique per streaming query to avoid interference.

Course illustration
Course illustration

All Rights Reserved.