Apache Kafka
Messaging System
Data Management
IT Solutions
Topic Deletion

Delete Messages from a Topic in Apache Kafka

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Apache Kafka is a distributed event streaming platform capable of handling trillions of events a day. It is designed to allow applications to publish and subscribe to streams of records in a fault-tolerant and durable manner. In many use cases, Kafka is employed as a kind of write-ahead log where data is durably recorded, allowing numerous consumers to read from it without impacting each other's read progress.

Understanding How Kafka Manages Data

Kafka stores streams of records in categories called topics. Data within topics is organized into partitions, where each partition is an ordered, immutable sequence of records. Records in a partition are assigned a sequential ID number called the offset which uniquely identifies each record within the partition.

Kafka does not allow for the direct deletion of a specific message once it is written to a topic. Instead, data is deleted using two primary mechanisms:

  1. Retention Policies: Records in a topic can be purged based on age or size limits set on the topic.
  2. Log Compaction: This feature allows Kafka to retain only the last value for each key within a log despite having many updates for that key.

Retention Policies

Retention policies in Kafka are configured at the topic-level through two main settings:

  • retention.ms: This dictates the maximum time period for which Kafka will retain records before they are eligible for deletion.
  • retention.bytes: This limits the total size of logs that can be stored within each partition. When the size limit is reached, older records are deleted to make room for newer ones.

By default, these settings are managed at the broker level but can be overridden per topic. For example:

bash
# Setting a retention policy on a specific topic
kafka-configs --bootstrap-server localhost:9092 --entity-type topics --entity-name my_topic --add-config retention.ms=1680000

Log Compaction

Log compaction is a feature targeted at scenarios where the same key may be updated multiple times. Instead of retaining all records for a key, Kafka will compact the log to ensure that it only retains the latest update for each key. This is particularly useful for restoring state in systems like databases or cache layers.

Configuring log compaction involves setting the following properties on a topic:

  • cleanup.policy: Set this to compact to enable log compaction.
  • min.cleanable.dirty.ratio: This controls how compacted log segments can get before they are cleaned up.
  • delete.retention.ms: Duration after which Kafka will delete the record's older versions post-compaction.

Example configuration:

bash
kafka-configs --bootstrap-server localhost:9092 --entity-type topics --entity-name config-topic --alter --add-config cleanup.policy=compact

Deleting Records Directly: Tombstone Messages

Kafka also allows for the direct deletion of a record by using a tombstone message. A tombstone is a special record with a key and a null value. When Kafka's log cleaner process encounters a tombstone, it will delete the key and any previous values associated with it. This allows for the effective deletion of records if using log compaction.

Example of producing a tombstone message:

java
producer.send(new ProducerRecord<>("my-topic", key, null));

Summary Table

TermDescription
retention.msTime after which data can be deleted from the topic.
retention.bytesMaximum size of the topic's data before older records are deleted.
cleanup.policyPolicy for deleting or compacting old entries; can be delete or compact.
min.cleanable.dirty.ratioRatio of dirty to clean entries that triggers cleaning in compacted logs.
delete.retention.msTime to retain a delete tombstone before actual record deletion.

Conclusion

While Kafka does not support the traditional 'delete' operation as seen in databases or other messaging systems, its robust data management strategies—retention policies, log compaction, and tombstone messages—provide flexible and powerful mechanisms for managing the lifecycle of data within the system. Understanding and configuring these features properly allows Kafka to be effectively integrated into various data management architectures, balancing performance, storage, and consistency according to the needs of the application.


Course illustration
Course illustration

All Rights Reserved.