Cleanup Policy
Data Management
Log Retention
Data Deletion
Compact Operations

Cleanup PolicyCompact/Delete and log.retention

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Cleanup policies and log retention are crucial aspects of managing data storage and ensuring efficiency in systems that handle large volumes of data, such as databases or messaging systems like Apache Kafka. In this context, compact and delete policies play pivotal roles in controlling how long data persists and how space is utilized.

Cleanup Policies

Delete Policy

The delete policy is straightforward: data older than a specified age is deleted. This is the simplest form of data management policy, typically used where it is desirable to retain data only for a fixed period for reasons like regulatory compliance or storage optimization.

For example, in Apache Kafka, you can set a log.retention.hours parameter, which determines how long messages are retained in a topic before they are eligible for deletion. This parameter can be crucial in scenarios where disk space is a concern or data is only relevant for a limited time.

Compact Policy

The compact policy is more nuanced. It keeps the latest version of a record and removes older duplicates. This policy is key in scenarios where the total dataset is valuable but needs to be kept minimal and up-to-date. It’s especially useful for log-like data where the state changes over time and only the latest state is necessary for most purposes.

Kafka uses this policy in topics where downstream consumers might only care about the most recent value for a specific key. It ensures that the log footprint remains small while maintaining complete data history for each key up until the current state.

Log Retention

Log retention mechanisms dictate how long logs are kept before being deleted or compacted, affecting how much historical data a system can query. Here, system administrators set parameters controlling the size and age of logs, balancing between performance, cost, and compliance requirements.

Technical Parameters in Kafka

  • log.retention.hours sets the maximum time to retain a log before deleting it.
  • log.retention.bytes limits the size of the log before old entries are deleted.
  • log.cleanup.policy can be set to "delete" or "compact" or even a mix like compact,delete which first compacts and then deletes old logs as they age out or exceed size restrictions.

Consider this example in Kafka configuration:

properties
1# Set the cleanup policy to compact and delete
2log.cleanup.policy=compact,delete
3
4# Retain logs for 168 hours (one week)
5log.retention.hours=168
6
7# Only allow a log size of 500MB per partition
8log.retention.bytes=500MB

This setup ensures that logs are first compacted to remove redundancies, and then any logs beyond one week old or exceeding 500MB are deleted, balancing between minimal storage and complete, up-to-date records.

Summary Table

ParameterPurposeCommon Settings
log.cleanup.policyDetermines how data is discarded.delete, compact, or both
log.retention.hoursTime before old data is discarded.168 hours (7 days)
log.retention.bytesMaximum size before old data is discarded.500MB
log.segment.bytesSize of a log segment before a new one is created.1GB

Conclusion

Understanding and properly configuring cleanup policies and log retention settings are crucial for efficient system performance and compliance with data governance standards. Each setting needs to be tailored to specific needs and the nature of the data being handled, such as whether it's log data in Kafka or transaction records in a database.

By managing these settings effectively, organizations can ensure they make the most out of their storage infrastructure while keeping their data accessible and compliant with regulations.


Course illustration
Course illustration

All Rights Reserved.