Mystery about Kafka's retention period
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Apache Kafka is a powerful, open-source stream processing platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on the abstraction of a distributed commit log. Since it deals with streams of records, the ability to retain and manage these records efficiently is critical. One core aspect of Kafka's capabilities is its retention policy, which plays a vital role in managing storage and ensuring that data is available for a defined period.
Understanding Kafka Retention Policies
Kafka's data retention policy determines how long records in a topic should be kept before they are eligible for deletion. Retention can be configured at the topic level and can be based on time or size, depending on the requirements.
Time-based Retention: This is the default setting, where messages are retained for a specific period, such as seven days. Any records older than the retention period are eligible for eviction.
Size-based Retention: Here, the retention of messages is determined by the total size of the log for the topic. Once the log reaches a specified size, older records are evicted to make room for newer ones.
Configurations Affecting Retention
Kafka offers several configurations to handle data retention:
log.retention.hours,log.retention.minutes,log.retention.ms: These settings control the maximum time Kafka will retain a log before it is eligible for deletion, based on the time unit specified.log.retention.bytes: This setting determines the maximum size of a log file before it is eligible for deletion.log.segment.bytes: This setting controls the size of each log segment file. When the size is reached, a new segment is started.log.segment.ms: Controls the period after which Kafka will force the log to roll even if the segment file size has not been reached.cleanup.policy: Determines the method used for log cleaning. Options include "delete" (default) and "compact". With "delete", logs that are beyond the retention period are deleted. The "compact" method ensures that only the latest version of a record is retained.
Technical Example
Consider a scenario where you set up a Kafka topic with the following settings:
In this case, Kafka retains logs for a maximum of 7 days or 1GB, whichever comes first. Additionally, each segment file will be split after it reaches 512MB. If a segment file does not reach this size within 7 days, it will still be eligible for deletion according to the time-based retention policy.
Impact of Retention on Performance
Retention settings can significantly impact Kafka's performance. More extended retention periods or larger sizes can lead to more disk usage. Hence, it is crucial to balance between performance, cost, and the necessity of data availability.
Best Practices
- Monitor Disk Usage: Continuously monitor disk usage to ensure you do not run out of space, which could affect Kafka's stability and performance.
- Adjust Accordingly: Adjust retention settings based on your usage patterns and requirements. For example, consider more extended retention periods for critical data.
- Use Topic-Level Retention: Customize settings per topic if different data streams have various requirements.
- Manage Log Cleanup: The
cleanup.policyshould be set depending on whether you need faster access to older records or compact storage without duplicates.
Summary Table
| Configuration | Description | Default Value | |
log.retention.hours | Sets the time Kafka retains logs in hours. | 168 (7 days) | |
log.retention.bytes | Maximum size of a log before it is eligible for deletion. | -1 (unlimited) | |
log.segment.bytes | Size threshold of each log segment after which a new segment is created. | 1073741824 (1GB) | |
log.segment.ms | Time after which a new log segment will be forced. | None | |
cleanup.policy | Method to manage log deletion or compaction. | "delete" |
Understanding and effectively configuring Kafka's retention policy is essential for managing data storage efficiently and ensuring that the system performs optimally while meeting all data availability and compliance requirements.

