How much data can Kafka topic store?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Apache Kafka is a distributed streaming platform known for its high throughput, reliability, and horizontal scalability. Kafka is widely used for building real-time streaming data pipelines and applications that require handling large volumes of data. One common topic that often arises with Kafka is its storage capability: specifically, how much data can a Kafka topic store?
Understanding Kafka Topics and Partitions
A Kafka topic is a category or feed name to which records are published. Topics in Kafka are multi-subscriber; that is, they can be consumed by multiple clients. Each topic is split into one or more partitions, where each partition is an ordered, immutable sequence of records that is continually appended to. Partitions in a topic are distributed across different servers in the Kafka cluster to ensure load balancing.
Storage Capacity
The storage capacity of a Kafka topic primarily depends on:
- The number of partitions in a topic: More partitions mean more data can be distributed across the cluster, enhancing storage capabilities.
- Configuration settings: Kafka’s configuration allows administrators to control storage limits via retention policies.
- Physical storage: The aggregate storage across a Kafka cluster limits the total data a topic can store.
Retention Policies
Kafka provides two primary configuration settings that determine how long data is stored on a topic:
- Log retention time (
log.retention.hours): This determines how long records are kept in a partition before they are deleted. - Log retention size (
log.retention.bytes): This dictates the maximum size in bytes of a partition before old data is discarded.
These settings can be configured per broker or overridden per topic, offering flexibility based on specific needs. If both retention time and size are set, Kafka applies a "whichever comes first" policy.
Example: Estimating Storage Requirements
Suppose you have a topic with 10 partitions, and you want to calculate the maximum storage based on your retention policy and an average record size of 1 KB:
- Retention policy: 7 days
- Write rate: 1000 records per second
The maximum storage for one partition can be calculated as follows:
Data stored per partition = Records per second × Record size × Number of seconds per day × Retention days
Plugging in the numbers:
Data stored per partition = 1000 records/sec × 1 KB × 86400 sec/day × 7 days = 604,800,000 KB ≈ 577 GB
For 10 partitions:
Total Data = 577 GB × 10 = 5770 GB ≈ 5.77 TB
Table: Kafka Storage Variables
| Parameter | Description | Impact on Storage Capacity |
| Number of Partitions | More partitions distribute data and load | Increases storage capacity |
| log.retention.hours | Time before data is deleted | Directly affects data retention |
| log.retention.bytes | Maximum size of a partition | Limits data based on size |
| Record size | Size of each record in bytes | Larger records consume more space |
| Throughput | Records written per second | Higher throughput increases storage use |
Physical and Practical Limits
While Kafka itself doesn’t impose a strict maximum limit on the amount of data a topic can handle, practical limits are dictated by the underlying hardware, network capacity, and operational constraints like maintenance overhead and cost. Performance can degrade if a single partition grows excessively large, suggesting a balance between the number of partitions and partition size is essential for optimal performance.
Conclusion
In practice, how much data a Kafka topic can store is influenced by the configuration of the topic's retention settings, the number of partitions, and the physical infrastructure of the Kafka cluster. Kafka’s architecture allows handling large datasets effectively, but planning and monitoring are crucial to maintain performance and manage storage efficiently. Kafka administrators need to carefully design topics considering future data growth to ensure scalability and resilience.

