Kafka Topics
Data Measurement
Bytes
Server Management
Information Technology

See size of Kafka Topics in Bytes

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Apache Kafka is a distributed streaming platform capable of handling trillions of events a day. One fundamental component of Kafka is the concept of topics, which are categories or feeds to which records are published. Monitoring the size of these topics in bytes is crucial for effective capacity planning, performance tuning, and ensuring the health of the Kafka ecosystem.

Understanding Kafka Topic Storage

Kafka stores records in topics, and topics are split into partitions for scalability and fault tolerance. Each partition is an ordered, immutable sequence of records that is continually appended to—a commit log. The records in the partitions are stored in a set of segment files on disk.

Calculating the Size of a Kafka Topic

The size of a Kafka topic in bytes is essentially the total disk space used by the topic's partitions. This includes all active and archived data (if log compaction or retention policies are used). To determine the size of a Kafka topic, you can sum up the sizes of its partition logs.

Using Kafka Bin Scripts

Kafka ships with a set of binary scripts which can be used to interact with its cluster. Among these, kafka-log-dirs.sh can be particularly useful. This script allows you to describe log directories on Kafka brokers.

To check the size of topic partitions across all brokers, run:

bash
kafka-log-dirs.sh --describe --bootstrap-server localhost:9092 --topic-list my-topic

Replace localhost:9092 with your cluster's broker address and my-topic with your topic name.

This script outputs details like the size of each partition in every broker. By aggregating these sizes, you determine the total size of a topic.

Programmatically Using AdminClient API

For a more dynamic approach, you could use Kafka's AdminClient API. Here is an example in Java to fetch the sizes:

java
1Properties properties = new Properties();
2properties.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
3try (AdminClient adminClient = AdminClient.create(properties)) {
4    DescribeLogDirsResult result = adminClient.describeLogDirs(Collections.singletonList(0)); // Broker ID 0
5    Map<Integer, Map<String, LogDirInfo>> logDirs = result.all().get();
6    logDirs.forEach((broker, logDir) -> {
7        logDir.forEach((dir, info) -> {
8            info.replicaInfos.entrySet().stream()
9                .filter(e -> e.getKey().topic().equals("my-topic"))
10                .forEach(e -> System.out.println("Partition " + e.getKey().partition() + " Size: " + e.getValue().size));
11        });
12    });
13}

In this example, my-topic is the name of your Kafka topic. This will print the size of each partition belonging to the topic.

Key Metrics to Monitor

MetricDescriptionRelevance
Topic SizeTotal size of the topic across all partitions.Useful for monitoring the disk space utilization.
Partition CountNumber of partitions in the topic.Important for load balancing and parallelism.
Replication FactorNumber of replicas per partition.Critical for fault tolerance and durability.
Broker Storage UtilizationPercentage of storage used in each broker.Prevents brokers from getting overwhelmed.

Additional Considerations

  • Retention Policies: Retention settings affect how long data is stored and thus the size of a topic. Ensure that your retention settings align with your storage capacity.
  • Compaction: If a topic is configured to use log compaction, older data is replaced by only the latest value for each key. This can significantly affect the size calculations, especially for topics with high update frequency but few unique keys.
  • Performance Impacts: Larger topic sizes can impact performance, especially during rebalances or recoveries. Monitoring and managing the size can help in maintaining optimal performance.

In summary, understanding and monitoring the size of Kafka topics is essential for maintaining a robust and efficient streaming platform. By employing tools and APIs provided by Kafka, alongside careful consideration of topic configuration and system architecture, administrators can effectively manage their data’s footprint in a Kafka cluster.


Course illustration
Course illustration

All Rights Reserved.