Java, How to get number of messages in a topic in apache kafka
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Kafka does not expose a single built-in field called "message count" for a topic in the way a traditional queue might. In Java, the usual approach is to inspect offsets for each partition and compute an approximate retained-record count, but that number only answers a specific question and should be explained carefully.
Decide What Count You Actually Mean
Before writing code, decide what you want to measure. Different teams use the phrase "number of messages in a topic" to mean very different things:
- all records ever produced since the topic was created
- records currently retained on disk
- records not yet consumed by one consumer group
- records visible under
read_committed
Kafka offsets help most with the second interpretation: how many records are currently retained in each partition. For a non-compacted topic, a practical estimate is:
latest offset - earliest offset
If you sum that value across all partitions, you get an approximate count of retained records. That is often good enough for diagnostics, dashboards, and rough capacity checks.
Offset-Based Approximation with AdminClient
The Java AdminClient API can describe the topic, list its partitions, and fetch the earliest and latest offsets for each partition. The following example prints a retained-record estimate for one topic.
The code does three things:
- Reads topic metadata to discover every partition.
- Requests the earliest and latest offsets for each partition.
- Sums the offset differences.
If one partition has earliest 100 and latest 140, Kafka currently retains about 40 records in that partition. Adding all partition totals gives a topic-level estimate.
What the Number Means
This offset-based total is useful, but it is not a universal truth about the topic.
On a normal topic with time-based or size-based retention, the estimate describes records still retained in the log at the moment you ask. It does not tell you how many records have been produced over the full lifetime of the topic, because older data may already have expired.
Compacted topics need even more caution. Log compaction can remove older records with the same key while keeping newer ones, so offset gaps do not map cleanly to currently materialized key-value entries. You are still looking at offset movement in the log, not a clean count of unique logical records.
Transactional workloads also matter. If your question is about records visible to consumers using read_committed, raw latest offsets may not match what an application can actually read at a given moment. In that case, the business metric you want may need a consumer-based view instead of pure topic metadata.
When to Use a Different Metric
If you really want consumer lag, compare the consumer group's committed offsets to the topic's end offsets. If you want total processed events for analytics or billing, Kafka itself is usually the wrong source of truth. That kind of number is better written to a durable metrics store or counted in your stream-processing pipeline.
A good rule is:
- use offset differences for rough retained-message estimates
- use consumer lag APIs for backlog
- use application metrics for exact business counts
That keeps the operational meaning of the number clear.
Common Pitfalls
- Assuming
latest - earliestmeans total lifetime messages produced. Retention can remove old data. - Treating the count as exact on compacted topics. Compaction changes the relationship between offsets and logical records.
- Using topic-level counts as a business metric for billing or reporting. Kafka metadata is usually too low-level for that purpose.
- Forgetting that the topic may have multiple partitions. A single-partition calculation is incomplete for real deployments.
- Ignoring visibility semantics such as transactions and committed reads when your consumers depend on them.
Summary
- Kafka does not provide one universal "message count" field for a topic.
- In Java, a common approach is summing
latest offset - earliest offsetacross partitions. - That number usually estimates currently retained records, not lifetime produced records.
- Retention, compaction, and transactional semantics can change what the count means.
- For exact backlog or business metrics, use consumer offsets or application-level counters instead of raw topic metadata.

