Kafka
Cleanup Policy
Configuration
Data Management
System Administration

Choosing the right cleanup policy in Kafka configuration

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Apache Kafka, a distributed streaming platform, allows for various configurations aimed at optimizing performance and resource management across different use cases. One such crucial configuration is the cleanup policy for a Kafka topic, which dictates how old logs are handled or purged by the system. Understanding and choosing the right cleanup policy is essential for maintaining system performance, managing storage efficiently, and ensuring that the data retention meets application requirements.

Cleanup Policies in Kafka

Kafka offers two primary cleanup policies:

  1. Delete: Logs are deleted after a specified retention period or size threshold has been reached.
  2. Compact: This policy is used to remove older records having a key if a newer record with the same key exists.

Both policies are vital in different scenarios and sometimes are used in conjunction. Administrators can set these policies at the topic level, allowing for fine-grained control over data management.

Delete Policy

The delete policy is straightforward; it involves deleting old records when they exceed a certain age or size. This policy is suitable for scenarios where data relevancy diminishes over time, such as log aggregation or time-sensitive data.

For configuring the delete policy, you would typically adjust the following properties in your topic configuration:

  • cleanup.policy=delete: This setting enables the delete policy.
  • retention.ms: Controls how long records are retained in the log in terms of time before they are eligible for deletion.
  • retention.bytes: Determines the maximum size of the log on disk before older records are deleted.

Example:

plaintext
# Create a topic with a retention policy of 24 hours
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic example-topic --config cleanup.policy=delete --config retention.ms=86400000

Compact Policy

The compact policy is key-based, ensuring that Kafka retains at least the last known value for each key. It's particularly useful for scenarios where the latest state is important—like databases or caches.

While compaction won’t delete all old records, it ensures the log doesn’t grow indefinitely. It is not based on time or log size but on the presence of newer records with the same key.

Key properties to configure the compact policy include:

  • cleanup.policy=compact: Enables log compaction.
  • min.cleanable.dirty.ratio: Controls how much of the log can be "dirty" (uncompacted) before Kafka triggers a compaction cycle.
  • segment.ms and segment.bytes: Control the roll-out of log segment files, which are units of log compaction.

Example:

plaintext
# Create a topic with log compaction enabled
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic state-topic --config cleanup.policy=compact

Choosing the Right Policy

The choice between delete and compact depends on your specific application needs:

  • High-volume logging or event streaming: If data is only relevant for a limited time, use the delete policy.
  • Stateful applications like stores or databases: If you must preserve the latest state per key, use the compact policy.

In some cases, both policies can be combined by setting cleanup.policy to delete,compact. This setup benefits from both maintaining the latest state per key and bounding the data by time or size.

Summary Table

PolicyUse CaseConfiguration KeyValue Example
DeleteTime-sensitive datacleanup.policydelete
retention.ms86400000 (1 day)
retention.bytes1073741824 (1GB)
CompactMaintain latest state per keycleanup.policycompact
min.cleanable.dirty.ratio0.5
segment.ms604800000 (1 week)
segment.bytes104857600 (100MB)

Understanding and properly configuring the cleanup policy in Kafka is critical for optimizing storage, system performance, and data relevancy. It's important to carefully consider your specific use case and configure appropriately to ensure that Kafka efficiently manages your data.


Course illustration
Course illustration

All Rights Reserved.