Apache Kafka
Data Retention
Big Data
Data Management
Stream Processing

Retaining data in Apache Kafka

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Apache Kafka is a distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since it deals with streams of data, the way it retains data is fundamental to its operation and performance. Data retention in Kafka is influenced by a combination of factors including topic configurations, broker settings, and retention policies.

How Kafka Stores Data

Kafka stores records in topics. Topics are divided into partitions, where each partition is an ordered, immutable sequence of records that is continually appended to. Partitions are distributed across a cluster of brokers for fault tolerance and increased performance.

Data within each partition is split into chunks called segments. A segment file in Kafka contains a set of records that are written to disk. Each segment file has an associated index file to provide quick access to data at read time.

Retention Policies

Kafka offers several configurations to manage how long data is retained in a topic:

Time-based Retention (log.retention.hours)

The default time-based retention policy will delete records from a topic that are older than a specified period of time. This period is configured by the log.retention.hours, log.retention.minutes, or log.retention.ms setting, depending on the level of granularity required.

Size-based Retention (log.retention.bytes)

Size-based retention involves setting a maximum size in bytes for a log. Once the log grows beyond this size, old segments are deleted until the total size is within the maximum limit.

Compaction (log.cleanup.policy=compact)

Log compaction is a process by which Kafka ensures that a topic retains at least the last known value for each key within the data record. This is particularly useful for topics that act as a record store or are used to restore state within a system. Compacted logs will still adhere to the time and size-based retention policies.

Additional Configurations Impacting Data Retention

Minimum Cleanable Dirty Ratio (log.cleaner.min.cleanable.ratio)

This configuration controls how much of the log has to be dirty (i.e., eligible for compaction because it has deletions or updates to keys) before the log compactor will begin cleaning. Lower values will cause more frequent cleaning cycles.

Delete Retention (log.cleaner.delete.retention.ms)

It specifies the length of time to retain delete markers in the log. Delete markers (also known as tombstones) are used in compacted topics to indicate that a record with a particular key has been deleted.

Summary of Key Configuration Parameters

The following table summarizes the key Kafka configuration parameters that impact data retention:

ParameterDescriptionDefault Value
log.retention.hoursMaximum time to retain a log without compacting168 (7 days)
log.retention.bytesMaximum size of a log before deleting old segments-1 (unlimited)
log.cleanup.policyDefault cleanup policy for logs (delete or compact)delete
log.cleaner.min.cleanable.ratioMinimum "dirty" ratio to trigger log compaction0.5
log.cleaner.delete.retention.msTime to retain delete markers in compacted topics86400000 (1 day)

Practical Example

Consider a scenario where data retention is critical for compliance reasons. If a Kafka topic must ensure that data is retained for exactly 4 weeks, you would configure it as shown below:

properties
# Set the log retention for 4 weeks
log.retention.hours=672

Conclusion

Understanding and configuring the data retention settings appropriately is essential to managing storage, ensuring data availability, and complying with data governance standards in Kafka. Whether to use time-based, size-based, or compaction strategies will depend on the specific requirements of the application and the nature of the data being handled.


Course illustration
Course illustration

All Rights Reserved.