Kafka
Data Storage
Topic Storage
Big Data
Data Management

How much data can Kafka topic store?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Apache Kafka is a distributed streaming platform known for its high throughput, reliability, and horizontal scalability. Kafka is widely used for building real-time streaming data pipelines and applications that require handling large volumes of data. One common topic that often arises with Kafka is its storage capability: specifically, how much data can a Kafka topic store?

Understanding Kafka Topics and Partitions

A Kafka topic is a category or feed name to which records are published. Topics in Kafka are multi-subscriber; that is, they can be consumed by multiple clients. Each topic is split into one or more partitions, where each partition is an ordered, immutable sequence of records that is continually appended to. Partitions in a topic are distributed across different servers in the Kafka cluster to ensure load balancing.

Storage Capacity

The storage capacity of a Kafka topic primarily depends on:

  1. The number of partitions in a topic: More partitions mean more data can be distributed across the cluster, enhancing storage capabilities.
  2. Configuration settings: Kafka’s configuration allows administrators to control storage limits via retention policies.
  3. Physical storage: The aggregate storage across a Kafka cluster limits the total data a topic can store.

Retention Policies

Kafka provides two primary configuration settings that determine how long data is stored on a topic:

  • Log retention time (log.retention.hours): This determines how long records are kept in a partition before they are deleted.
  • Log retention size (log.retention.bytes): This dictates the maximum size in bytes of a partition before old data is discarded.

These settings can be configured per broker or overridden per topic, offering flexibility based on specific needs. If both retention time and size are set, Kafka applies a "whichever comes first" policy.

Example: Estimating Storage Requirements

Suppose you have a topic with 10 partitions, and you want to calculate the maximum storage based on your retention policy and an average record size of 1 KB:

  • Retention policy: 7 days
  • Write rate: 1000 records per second

The maximum storage for one partition can be calculated as follows: Data stored per partition = Records per second × Record size × Number of seconds per day × Retention days

Plugging in the numbers: Data stored per partition = 1000 records/sec × 1 KB × 86400 sec/day × 7 days = 604,800,000 KB ≈ 577 GB

For 10 partitions: Total Data = 577 GB × 10 = 5770 GB ≈ 5.77 TB

Table: Kafka Storage Variables

ParameterDescriptionImpact on Storage Capacity
Number of PartitionsMore partitions distribute data and loadIncreases storage capacity
log.retention.hoursTime before data is deletedDirectly affects data retention
log.retention.bytesMaximum size of a partitionLimits data based on size
Record sizeSize of each record in bytesLarger records consume more space
ThroughputRecords written per secondHigher throughput increases storage use

Physical and Practical Limits

While Kafka itself doesn’t impose a strict maximum limit on the amount of data a topic can handle, practical limits are dictated by the underlying hardware, network capacity, and operational constraints like maintenance overhead and cost. Performance can degrade if a single partition grows excessively large, suggesting a balance between the number of partitions and partition size is essential for optimal performance.

Conclusion

In practice, how much data a Kafka topic can store is influenced by the configuration of the topic's retention settings, the number of partitions, and the physical infrastructure of the Kafka cluster. Kafka’s architecture allows handling large datasets effectively, but planning and monitoring are crucial to maintain performance and manage storage efficiently. Kafka administrators need to carefully design topics considering future data growth to ensure scalability and resilience.


Course illustration
Course illustration

All Rights Reserved.