Spark Structured Streaming
Kafka Offsets
Group.id
Data Processing
Commit Offsets

How to manually set group.id and commit kafka offsets in spark structured streaming?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Spark Structured Streaming does not treat Kafka offsets the same way a plain Kafka consumer does. In most cases, you should not manually commit offsets back to Kafka at all. Structured Streaming tracks progress through its checkpoint metadata, and that checkpoint is the mechanism that gives you restart safety.

How Structured Streaming Really Tracks Offsets

When you read from Kafka with Structured Streaming, Spark acts as the streaming engine, not as a hand-managed consumer loop. It records source progress in the query checkpoint location.

That means the usual recovery story is:

  • Spark reads records from Kafka
  • Spark processes a micro-batch
  • Spark stores progress in the checkpoint
  • on restart, Spark resumes from the checkpointed offsets

This is why manual commitSync() logic is usually the wrong mental model for Structured Streaming.

The Normal Way to Configure a Kafka Source

A standard Kafka source uses topic subscription, bootstrap servers, and a checkpoint location on the write side.

scala
1import org.apache.spark.sql.SparkSession
2
3val spark = SparkSession.builder()
4  .appName("KafkaStructuredStreaming")
5  .getOrCreate()
6
7val kafkaDf = spark.readStream
8  .format("kafka")
9  .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092")
10  .option("subscribe", "orders")
11  .option("startingOffsets", "earliest")
12  .load()
13
14val query = kafkaDf.writeStream
15  .format("console")
16  .option("checkpointLocation", "/tmp/checkpoints/orders-stream")
17  .start()

The checkpoint location is the important part. If you delete it, Spark loses its notion of progress and may reread data depending on your source settings.

What About group.id

In Structured Streaming, the consumer-group story is different from ordinary Kafka application code. Spark can derive its own consumer-group identity, and newer Spark Kafka integration options expose groupIdPrefix plus a kafka.group.id escape hatch for special cases.

You should reach for an explicit group id only when you have a strong reason, such as Kafka ACL rules that require a particular group identity. Even then, it needs caution because overlapping queries with the same group id can interfere with each other.

scala
1val kafkaDf = spark.readStream
2  .format("kafka")
3  .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092")
4  .option("subscribe", "orders")
5  .option("groupIdPrefix", "orders-job")
6  .load()

For most applications, checkpointing matters far more than forcing a custom Kafka group id.

Why Manual Offset Commits Are Usually the Wrong Tool

Structured Streaming does not expect you to call the Kafka consumer API and commit offsets yourself after foreachBatch. Doing that creates two separate progress systems:

  • Spark checkpoint offsets
  • Kafka consumer-group offsets

Those systems can drift apart, which makes recovery logic harder instead of safer. If your job restarts from the Spark checkpoint but your manual Kafka commits point somewhere else, the result is confusion rather than control.

If you need "process then write exactly once" behavior, design around Spark sinks, idempotent writes, deduplication keys, and stable checkpoint storage rather than external manual commits.

What You Can Control Safely

You still have useful levers:

  • 'startingOffsets controls where the very first run begins'
  • checkpoint location controls restart progress
  • trigger configuration controls batch timing
  • sink design controls end-to-end correctness

Those are the knobs that Structured Streaming actually uses reliably.

Common Pitfalls

The biggest mistake is treating Structured Streaming like a plain Kafka consumer loop. It is not. Trying to bolt commitSync() onto foreachBatch usually makes the system less predictable.

Another mistake is setting a fixed group.id or kafka.group.id for multiple concurrent streaming queries. That can cause consumer-group collisions and hard-to-debug rebalancing behavior.

Be careful with checkpoint cleanup. Deleting or moving the checkpoint directory changes the recovery contract and can make Spark reread old data.

Finally, do not rely on enable.auto.commit semantics here. Structured Streaming offset progress is a Spark concern first, not a normal Kafka auto-commit workflow.

Summary

  • Structured Streaming normally manages Kafka offsets through checkpointing, not manual Kafka commits.
  • Use a stable checkpointLocation if you care about restart safety.
  • Set a custom group identity only when you have a specific operational reason.
  • Avoid mixing Spark checkpoint progress with hand-written Kafka commitSync() logic.
  • Think in terms of Spark query recovery, not plain consumer-group offset management.

Course illustration
Course illustration

All Rights Reserved.