Kafka
Retention Policy
Data Management
Compact and Delete
System Configuration

Can't set Kafka retention policy to both compact and delete

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

In Apache Kafka, a popular distributed streaming platform, the management of how messages are retained or deleted plays a crucial role in both the performance and storage utilization of a Kafka cluster. Two primary retention policies available are delete and compact. However, due to their inherent characteristics and design goals, Kafka does not support setting these policies simultaneously on a single topic. This article explores why this is the case and details each policy to help users make informed decisions based on their use cases.

Understanding Kafka Retention Policies

Delete Policy

The delete policy in Kafka refers to the retention mechanism where old records in a log are deleted after a certain threshold — either time-based (retention.ms) or size-based (retention.bytes). This policy is straightforward and useful in scenarios where the freshness of data is essential, and there is a defined window of relevance for the data being stored.

Example:

For a topic configured with retention.ms set to 2 days, messages older than this period will be automatically purged from the system.

Compact Policy

The compact policy, on the other hand, is designed for scenarios where it's essential to retain at least the latest version of a particular record. It works by removing older records with the same key, ensuring that the log only contains the most recent update for each unique key. This policy is especially important in event sourcing or in systems where maintaining a full history of record changes is unnecessary.

Example:

If a topic is configured with the log compaction feature, and messages with keys A, A, and B are stored with the values 1, 2, and 1, respectively, log compaction would eventually reduce this to just A -> 2 and B -> 1.

Why Not Both?

At first glance, it might appear beneficial to combine these policies — ensuring the storage isn't over-used by retaining only a recent set of data and keeping the latest updates for each key. However, there are several reasons why this isn't supported:

  1. Conflicting Goals: Compaction is key-oriented, ensuring that the latest message for each key is never deleted. The delete policy aims to expunge data beyond a specific age or size, regardless of its content or importance. Merging these two would require intricate rules to govern which policy takes precedence, complicating the system design and potentially leading to data loss inconsistencies.
  2. Technical Complexity: Implementing both policies on the same topic would increase the complexity of the broker's storage management. The broker would need to track not only the age and size of log segments but also manage the state of keys across segments that might be under different stages of deletion and compaction.
  3. Performance Impact: Running both policies could degrade broker performance. Log compaction is a more resource-intensive process than simple deletion. Combining them would involve multiple passes over the log data — one for each policy — increasing I/O and CPU usage.

Recommendations for Users

Depending on your data requirements, you should carefully choose between these policies:

  • Use delete when data relevance is time-bound.
  • Use compaction when needing a complete record history isn't necessary but maintaining the latest state is crucial.

Table: Key Characteristics of Retention Policies

PolicyUse-caseConfigurationData Stored
DeleteTime-bound data relevanceretention.ms, retention.bytesOnly messages within the specified age/size
CompactMaintaining latest state per keycleanup.policy=compactLatest message per key, regardless of age/size

Conclusion

While combining delete and compact policies in Kafka might seem appealing for optimizing storage and retaining critical data, the inherent design and purpose of each are geared towards different scenarios. Understanding these nuances and applying the appropriate policy as per system requirements will ensure optimal Kafka performance and data integrity. Selecting the correct strategy often involves a trade-off between performance, storage utilization, and data availability, and should be aligned with specific application needs.


Course illustration
Course illustration

All Rights Reserved.