Spark - Get earliest and latest offset of Kafka without opening stream

Apache Spark

Kafka

Data Streaming

Offset Management

Big Data Analytics

Spark - Get earliest and latest offset of Kafka without opening stream

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

If all you need is Kafka offset metadata, opening a Spark streaming query is unnecessary overhead. The cleaner solution is to query Kafka directly for beginning and end offsets, because this is really a Kafka metadata problem, not a Spark processing problem.

What Earliest and Latest Offsets Mean

For each partition:

earliest offset tells you the first retained record position
latest offset tells you the log end offset, which is the next position to be written

Together they tell you the range of retained data. They do not tell you a consumer group's current progress by themselves.

That distinction matters because people often mix up:

raw earliest/latest offsets
consumer lag
committed offsets

They are related, but not the same question.

Query Kafka Directly with `AdminClient`

The practical answer is to use Kafka's own client API, even if the surrounding application also uses Spark.

scala

1import java.util.Properties
2import scala.jdk.CollectionConverters._
3import org.apache.kafka.clients.admin.{AdminClient, AdminClientConfig, OffsetSpec}
4import org.apache.kafka.common.TopicPartition
5
6object OffsetLookup extends App {
7  val props = new Properties()
8  props.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
9
10  val admin = AdminClient.create(props)
11  val tp = new TopicPartition("orders", 0)
12
13  val earliest = admin
14    .listOffsets(Map(tp -> OffsetSpec.earliest()).asJava)
15    .partitionResult(tp)
16    .get()
17    .offset()
18
19  val latest = admin
20    .listOffsets(Map(tp -> OffsetSpec.latest()).asJava)
21    .partitionResult(tp)
22    .get()
23    .offset()
24
25  println(s"earliest=$earliest latest=$latest")
26  admin.close()
27}

This gives you the offsets directly without creating a Spark stream, reading payloads, or managing a checkpointed query.

Why Not Open a Spark Stream

Spark structured streaming is built for continuous processing. Even if you only want offset boundaries, starting a stream introduces extra concerns:

query lifecycle
trigger behavior
checkpointing choices
streaming execution overhead

That is a lot of machinery for a question Kafka can answer with lightweight metadata calls.

Partition Awareness Is Essential

Offsets are tracked per partition, not just per topic. If your topic has multiple partitions, you need to inspect each one.

scala

val partitions = (0 until 3).map(p => new TopicPartition("orders", p))

Then fetch earliest and latest for each partition and aggregate if needed.

This matters because a topic-level summary like "latest offset is X" is incomplete without partition context.

If You Already Use Spark for Downstream Processing

Spark still belongs in the architecture when you actually need to read and process the data. But for a lightweight offset-boundary check, Kafka's own admin or consumer APIs are the correct tool.

That separation keeps the code honest:

Kafka clients for metadata inspection
Spark for distributed computation

Using the right boundary simplifies both operations and debugging.

If you later want lag, earliest and latest offsets are still not enough. Lag usually involves:

latest offset
committed consumer-group offset

So do not confuse "what range of data exists" with "how far behind is a specific consumer group."

Common Pitfalls

Starting a Spark stream just to ask Kafka for metadata adds unnecessary complexity.

Treating offsets as topic-wide single numbers instead of partition-specific values leads to incomplete or misleading conclusions.

Forgetting that latest offset is the next write position, not the last consumed record, causes off-by-one misunderstandings.

Ignoring cluster security configuration such as SASL or TLS will make even correct metadata code fail in production.

Summary

Earliest and latest offsets are Kafka metadata and are best queried through Kafka's own client APIs.
'AdminClient is a clean way to fetch them directly.'
Query offsets per partition, then aggregate if you need a topic-level picture.
Avoid opening a Spark stream when you only need metadata.
Keep Spark for processing and Kafka clients for lightweight offset inspection.

Spark - Get earliest and latest offset of Kafka without opening stream

Master System Design with Codemia

Introduction

What Earliest and Latest Offsets Mean

Query Kafka Directly with AdminClient

Why Not Open a Spark Stream

Partition Awareness Is Essential

If You Already Use Spark for Downstream Processing

Related Question: Consumer Lag

Common Pitfalls

Summary

Query Kafka Directly with `AdminClient`