Mystery about Kafka's retention period

Kafka

retention period

literature mysteries

Franz Kafka

data storage

Mystery about Kafka's retention period

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Apache Kafka is a powerful, open-source stream processing platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on the abstraction of a distributed commit log. Since it deals with streams of records, the ability to retain and manage these records efficiently is critical. One core aspect of Kafka's capabilities is its retention policy, which plays a vital role in managing storage and ensuring that data is available for a defined period.

Understanding Kafka Retention Policies

Kafka's data retention policy determines how long records in a topic should be kept before they are eligible for deletion. Retention can be configured at the topic level and can be based on time or size, depending on the requirements.

Time-based Retention: This is the default setting, where messages are retained for a specific period, such as seven days. Any records older than the retention period are eligible for eviction.

Size-based Retention: Here, the retention of messages is determined by the total size of the log for the topic. Once the log reaches a specified size, older records are evicted to make room for newer ones.

Configurations Affecting Retention

Kafka offers several configurations to handle data retention:

log.retention.hours, log.retention.minutes, log.retention.ms: These settings control the maximum time Kafka will retain a log before it is eligible for deletion, based on the time unit specified.
log.retention.bytes: This setting determines the maximum size of a log file before it is eligible for deletion.
log.segment.bytes: This setting controls the size of each log segment file. When the size is reached, a new segment is started.
log.segment.ms: Controls the period after which Kafka will force the log to roll even if the segment file size has not been reached.
cleanup.policy: Determines the method used for log cleaning. Options include "delete" (default) and "compact". With "delete", logs that are beyond the retention period are deleted. The "compact" method ensures that only the latest version of a record is retained.

Technical Example

Consider a scenario where you set up a Kafka topic with the following settings:

properties

1cleanup.policy=delete
2log.retention.hours=168  # 7 days
3log.retention.bytes=1073741824  # 1GB
4log.segment.bytes=536870912  # 512MB

In this case, Kafka retains logs for a maximum of 7 days or 1GB, whichever comes first. Additionally, each segment file will be split after it reaches 512MB. If a segment file does not reach this size within 7 days, it will still be eligible for deletion according to the time-based retention policy.

Impact of Retention on Performance

Retention settings can significantly impact Kafka's performance. More extended retention periods or larger sizes can lead to more disk usage. Hence, it is crucial to balance between performance, cost, and the necessity of data availability.

Best Practices

Monitor Disk Usage: Continuously monitor disk usage to ensure you do not run out of space, which could affect Kafka's stability and performance.
Adjust Accordingly: Adjust retention settings based on your usage patterns and requirements. For example, consider more extended retention periods for critical data.
Use Topic-Level Retention: Customize settings per topic if different data streams have various requirements.
Manage Log Cleanup: The cleanup.policy should be set depending on whether you need faster access to older records or compact storage without duplicates.

Summary Table

Configuration	Description	Default Value
`log.retention.hours`	Sets the time Kafka retains logs in hours.	168 (7 days)
`log.retention.bytes`	Maximum size of a log before it is eligible for deletion.	-1 (unlimited)
`log.segment.bytes`	Size threshold of each log segment after which a new segment is created.	1073741824 (1GB)
`log.segment.ms`	Time after which a new log segment will be forced.	None
`cleanup.policy`	Method to manage log deletion or compaction.	"delete"

Understanding and effectively configuring Kafka's retention policy is essential for managing data storage efficiently and ensuring that the system performs optimally while meeting all data availability and compliance requirements.