Retaining data in Apache Kafka

Apache Kafka

Data Retention

Big Data

Data Management

Stream Processing

Retaining data in Apache Kafka

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Apache Kafka is a distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since it deals with streams of data, the way it retains data is fundamental to its operation and performance. Data retention in Kafka is influenced by a combination of factors including topic configurations, broker settings, and retention policies.

How Kafka Stores Data

Kafka stores records in topics. Topics are divided into partitions, where each partition is an ordered, immutable sequence of records that is continually appended to. Partitions are distributed across a cluster of brokers for fault tolerance and increased performance.

Data within each partition is split into chunks called segments. A segment file in Kafka contains a set of records that are written to disk. Each segment file has an associated index file to provide quick access to data at read time.

Retention Policies

Kafka offers several configurations to manage how long data is retained in a topic:

Time-based Retention (`log.retention.hours`)

The default time-based retention policy will delete records from a topic that are older than a specified period of time. This period is configured by the log.retention.hours, log.retention.minutes, or log.retention.ms setting, depending on the level of granularity required.

Size-based Retention (`log.retention.bytes`)

Size-based retention involves setting a maximum size in bytes for a log. Once the log grows beyond this size, old segments are deleted until the total size is within the maximum limit.

Compaction (`log.cleanup.policy=compact`)

Log compaction is a process by which Kafka ensures that a topic retains at least the last known value for each key within the data record. This is particularly useful for topics that act as a record store or are used to restore state within a system. Compacted logs will still adhere to the time and size-based retention policies.

Additional Configurations Impacting Data Retention

Minimum Cleanable Dirty Ratio (`log.cleaner.min.cleanable.ratio`)

This configuration controls how much of the log has to be dirty (i.e., eligible for compaction because it has deletions or updates to keys) before the log compactor will begin cleaning. Lower values will cause more frequent cleaning cycles.

Delete Retention (`log.cleaner.delete.retention.ms`)

It specifies the length of time to retain delete markers in the log. Delete markers (also known as tombstones) are used in compacted topics to indicate that a record with a particular key has been deleted.

Summary of Key Configuration Parameters

The following table summarizes the key Kafka configuration parameters that impact data retention:

Parameter	Description	Default Value
`log.retention.hours`	Maximum time to retain a log without compacting	168 (7 days)
`log.retention.bytes`	Maximum size of a log before deleting old segments	`-1` (unlimited)
`log.cleanup.policy`	Default cleanup policy for logs (`delete` or `compact`)	`delete`
`log.cleaner.min.cleanable.ratio`	Minimum "dirty" ratio to trigger log compaction	`0.5`
`log.cleaner.delete.retention.ms`	Time to retain delete markers in compacted topics	`86400000` (1 day)

Practical Example

Consider a scenario where data retention is critical for compliance reasons. If a Kafka topic must ensure that data is retained for exactly 4 weeks, you would configure it as shown below:

properties

# Set the log retention for 4 weeks
log.retention.hours=672

Conclusion

Understanding and configuring the data retention settings appropriately is essential to managing storage, ensuring data availability, and complying with data governance standards in Kafka. Whether to use time-based, size-based, or compaction strategies will depend on the specific requirements of the application and the nature of the data being handled.

Retaining data in Apache Kafka

Master System Design with Codemia

How Kafka Stores Data

Retention Policies

Time-based Retention (log.retention.hours)

Size-based Retention (log.retention.bytes)

Compaction (log.cleanup.policy=compact)

Additional Configurations Impacting Data Retention

Minimum Cleanable Dirty Ratio (log.cleaner.min.cleanable.ratio)

Delete Retention (log.cleaner.delete.retention.ms)

Summary of Key Configuration Parameters

Practical Example

Conclusion

Time-based Retention (`log.retention.hours`)

Size-based Retention (`log.retention.bytes`)

Compaction (`log.cleanup.policy=compact`)

Minimum Cleanable Dirty Ratio (`log.cleaner.min.cleanable.ratio`)

Delete Retention (`log.cleaner.delete.retention.ms`)