Cleanup PolicyCompact/Delete and log.retention
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Cleanup policies and log retention are crucial aspects of managing data storage and ensuring efficiency in systems that handle large volumes of data, such as databases or messaging systems like Apache Kafka. In this context, compact and delete policies play pivotal roles in controlling how long data persists and how space is utilized.
Cleanup Policies
Delete Policy
The delete policy is straightforward: data older than a specified age is deleted. This is the simplest form of data management policy, typically used where it is desirable to retain data only for a fixed period for reasons like regulatory compliance or storage optimization.
For example, in Apache Kafka, you can set a log.retention.hours parameter, which determines how long messages are retained in a topic before they are eligible for deletion. This parameter can be crucial in scenarios where disk space is a concern or data is only relevant for a limited time.
Compact Policy
The compact policy is more nuanced. It keeps the latest version of a record and removes older duplicates. This policy is key in scenarios where the total dataset is valuable but needs to be kept minimal and up-to-date. It’s especially useful for log-like data where the state changes over time and only the latest state is necessary for most purposes.
Kafka uses this policy in topics where downstream consumers might only care about the most recent value for a specific key. It ensures that the log footprint remains small while maintaining complete data history for each key up until the current state.
Log Retention
Log retention mechanisms dictate how long logs are kept before being deleted or compacted, affecting how much historical data a system can query. Here, system administrators set parameters controlling the size and age of logs, balancing between performance, cost, and compliance requirements.
Technical Parameters in Kafka
log.retention.hourssets the maximum time to retain a log before deleting it.log.retention.byteslimits the size of the log before old entries are deleted.log.cleanup.policycan be set to "delete" or "compact" or even a mix likecompact,deletewhich first compacts and then deletes old logs as they age out or exceed size restrictions.
Consider this example in Kafka configuration:
This setup ensures that logs are first compacted to remove redundancies, and then any logs beyond one week old or exceeding 500MB are deleted, balancing between minimal storage and complete, up-to-date records.
Summary Table
| Parameter | Purpose | Common Settings |
log.cleanup.policy | Determines how data is discarded. | delete, compact, or both |
log.retention.hours | Time before old data is discarded. | 168 hours (7 days) |
log.retention.bytes | Maximum size before old data is discarded. | 500MB |
log.segment.bytes | Size of a log segment before a new one is created. | 1GB |
Conclusion
Understanding and properly configuring cleanup policies and log retention settings are crucial for efficient system performance and compliance with data governance standards. Each setting needs to be tailored to specific needs and the nature of the data being handled, such as whether it's log data in Kafka or transaction records in a database.
By managing these settings effectively, organizations can ensure they make the most out of their storage infrastructure while keeping their data accessible and compliant with regulations.

