How to set group.id for consumer group in kafka data source in Structured Streaming?

Kafka

Structured Streaming

Consumer Group

group.id

Data Source Configuration

How to set group.id for consumer group in kafka data source in Structured Streaming?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

When Spark Structured Streaming reads from Kafka, it does not behave exactly like a hand-written Kafka consumer. That is why setting a consumer group id can be confusing. In practice, you usually should not try to force a normal Kafka group.id into the source configuration unless you understand how Spark manages offsets and query isolation.

Why This Is Different From a Plain Kafka Consumer

A normal Kafka consumer uses group.id to join a consumer group and coordinate partition assignment and committed offsets. Structured Streaming is different because Spark manages progress through its own checkpointing and source logic.

That means the question is not simply “where do I put group.id,” but “do I actually want Spark queries sharing a Kafka group identity at all?” In many cases, the answer is no.

The Important Option Name

For Spark’s Kafka source, the Kafka-style configuration keys are usually prefixed with kafka.. If you want to supply a custom Kafka group id, the option to look for is kafka.group.id, not bare group.id.

A minimal Scala example looks like this:

scala

1val df = spark.readStream
2  .format("kafka")
3  .option("kafka.bootstrap.servers", "localhost:9092")
4  .option("subscribe", "events")
5  .option("kafka.group.id", "reporting-stream")
6  .load()

That is the mechanical answer, but it is only the beginning.

Why You Usually Should Not Force It

By default, Structured Streaming generates its own query-specific consumer group identity. That is useful because:

each streaming query stays isolated
separate queries do not steal partitions from each other
Spark checkpointing remains the main source of truth for progress

If two streaming queries use the same explicit Kafka group id, they can interfere with each other. One query may rebalance partitions away from the other, which is usually not what you want.

So the safe rule is:

if you do not have a strong reason, let Spark manage the group identity
if you do set kafka.group.id, make it unique per query

Checkpointing Still Matters More

Even when a custom Kafka group id is provided, Structured Streaming still relies heavily on its checkpoint directory to track query state and progress.

scala

1val query = df.writeStream
2  .format("console")
3  .option("checkpointLocation", "/tmp/checkpoints/reporting-stream")
4  .start()

If you restart the query without the same checkpoint location, you should not expect consumer-group behavior alone to reproduce the exact same semantics as a plain Kafka consumer.

That is the core conceptual mistake many teams make: they expect Kafka consumer-group offset commits to be the main recovery mechanism, while Structured Streaming is designed around query checkpointing.

When a Custom Group Id Can Be Useful

There are limited cases where a custom kafka.group.id makes sense:

brokers enforce access controls based on consumer groups
you need consistent client identification for operations or auditing
you are integrating with existing cluster policies that require a known group name

Even then, avoid sharing that group id across multiple concurrent streaming queries unless you want them to behave like competing members of one Kafka group.

A Small PySpark Example

python

1stream_df = (spark.readStream
2    .format("kafka")
3    .option("kafka.bootstrap.servers", "localhost:9092")
4    .option("subscribe", "events")
5    .option("kafka.group.id", "audit-stream")
6    .load())
7
8query = (stream_df.writeStream
9    .format("console")
10    .option("checkpointLocation", "/tmp/checkpoints/audit-stream")
11    .start())

The important thing is not just setting the option, but pairing it with a stable checkpoint strategy.

Common Pitfalls

Using bare group.id instead of kafka.group.id in the Spark source options.
Assuming Structured Streaming uses Kafka group offsets the same way a plain Kafka consumer does.
Reusing the same custom group id across multiple concurrent queries.
Forgetting that the checkpoint location is central to query recovery.
Forcing a custom group id when no real operational requirement exists.

Summary

In Structured Streaming, the relevant option is kafka.group.id, not bare group.id.
Spark usually works best when it manages a unique consumer-group identity per query.
Checkpointing is the primary recovery mechanism, not ordinary Kafka consumer-group commits alone.
Use a custom group id only when you have a concrete requirement.
If you do set one, keep it unique per streaming query to avoid interference.