Spark - Get earliest and latest offset of Kafka without opening stream
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
If all you need is Kafka offset metadata, opening a Spark streaming query is unnecessary overhead. The cleaner solution is to query Kafka directly for beginning and end offsets, because this is really a Kafka metadata problem, not a Spark processing problem.
What Earliest and Latest Offsets Mean
For each partition:
- earliest offset tells you the first retained record position
- latest offset tells you the log end offset, which is the next position to be written
Together they tell you the range of retained data. They do not tell you a consumer group's current progress by themselves.
That distinction matters because people often mix up:
- raw earliest/latest offsets
- consumer lag
- committed offsets
They are related, but not the same question.
Query Kafka Directly with AdminClient
The practical answer is to use Kafka's own client API, even if the surrounding application also uses Spark.
This gives you the offsets directly without creating a Spark stream, reading payloads, or managing a checkpointed query.
Why Not Open a Spark Stream
Spark structured streaming is built for continuous processing. Even if you only want offset boundaries, starting a stream introduces extra concerns:
- query lifecycle
- trigger behavior
- checkpointing choices
- streaming execution overhead
That is a lot of machinery for a question Kafka can answer with lightweight metadata calls.
Partition Awareness Is Essential
Offsets are tracked per partition, not just per topic. If your topic has multiple partitions, you need to inspect each one.
Then fetch earliest and latest for each partition and aggregate if needed.
This matters because a topic-level summary like "latest offset is X" is incomplete without partition context.
If You Already Use Spark for Downstream Processing
Spark still belongs in the architecture when you actually need to read and process the data. But for a lightweight offset-boundary check, Kafka's own admin or consumer APIs are the correct tool.
That separation keeps the code honest:
- Kafka clients for metadata inspection
- Spark for distributed computation
Using the right boundary simplifies both operations and debugging.
Related Question: Consumer Lag
If you later want lag, earliest and latest offsets are still not enough. Lag usually involves:
- latest offset
- committed consumer-group offset
So do not confuse "what range of data exists" with "how far behind is a specific consumer group."
Common Pitfalls
Starting a Spark stream just to ask Kafka for metadata adds unnecessary complexity.
Treating offsets as topic-wide single numbers instead of partition-specific values leads to incomplete or misleading conclusions.
Forgetting that latest offset is the next write position, not the last consumed record, causes off-by-one misunderstandings.
Ignoring cluster security configuration such as SASL or TLS will make even correct metadata code fail in production.
Summary
- Earliest and latest offsets are Kafka metadata and are best queried through Kafka's own client APIs.
- '
AdminClientis a clean way to fetch them directly.' - Query offsets per partition, then aggregate if you need a topic-level picture.
- Avoid opening a Spark stream when you only need metadata.
- Keep Spark for processing and Kafka clients for lightweight offset inspection.

