Kafka
Data Compression
Codec
Message Processing
Data Decompression

Kafka message codec - compress and decompress

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Apache Kafka, a popular distributed streaming platform, is designed for handling real-time data feeds with high-throughput and low-latency characteristics. Effective data processing in Kafka often requires the use of compression techniques to optimize network usage and increase performance. This article covers the ways Kafka supports message compression and decompression, technical details, and examples of implementation.

Why Compress Messages in Kafka?

Compression in Kafka has a dual benefit: it reduces the size of the data being transmitted through the network between producers, brokers, and consumers which saves bandwidth and decreases storage on Kafka brokers. Moreover, Kafka's data structure inherently supports batch compression which can compress a set of messages together, further optimizing the data size.

Supported Compression Codecs

Kafka currently supports multiple compression codecs:

  • None (no compression)
  • GZIP
  • Snappy
  • LZ4
  • ZSTD (since Kafka 2.1)

Compression Implementation

When Kafka producers send messages to brokers, they have the option to apply compression at the message batch level. Here is an outline of how data is typically compressed and decompressed:

  1. Compression at Producer:
    • Producers collect messages that are destined for the same partition into a message set, known in newer versions of Kafka as a "record batch."
    • The entire record batch is compressed using the specified codec, reducing the size of the data sent to the Kafka brokers.
  2. Storage at Broker:
    • The Kafka broker stores the compressed message batch as is, without decompressing it. This helps in reducing I/O operations and saves storage.
  3. Decompression at Consumer:
    • Consumers receive the compressed batch of messages and decompress it upon receipt.

Example: Using Compression in Kafka Producer

Let's see how to configure a Kafka producer with compression using Java:

java
1import org.apache.kafka.clients.producer.KafkaProducer;
2import org.apache.kafka.clients.producer.ProducerRecord;
3import org.apache.kafka.clients.producer.ProducerConfig;
4import java.util.Properties;
5
6public class CompressedKafkaProducer {
7    public static void main(String[] args) {
8        Properties props = new Properties();
9        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
10        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
11        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
12        props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "gzip"); // Setting compression type
13
14        KafkaProducer<String, String> producer = new KafkaProducer<>(props);
15
16        try {
17            for (int i = 0; i < 100; i++) {
18                producer.send(new ProducerRecord<>("my-topic", "key-" + i, "value-" + i));
19            }
20        } finally {
21            producer.close();
22        }
23    }
24}

In the code above, the COMPRESSION_TYPE_CONFIG property is set to "gzip". You can replace "gzip" with "snappy", "lz4", or "zstd" depending on your requirements.

Performance Implications

Using compression can significantly increase Kafka’s throughput and reduce the data footprint at the cost of increased CPU usage for compression and decompression processes. The choice of compression codec can impact both the compression ratio and the computational requirements.

Compression Codec Comparison

CodecCompression RatioSpeedCPU UsageCompatibility
None1:1FastestVery LowAll
GZIPHighSlowHighAll
SnappyMediumFastLowAll
LZ4Medium-HighFastMediumAll
ZSTDVery HighMediumMedium-HighKafka 2.1+

Conclusion

Effective use of compression in Kafka can lead to better utilization of network and storage resources, although it requires careful consideration of the trade-offs between speed, CPU usage, and compression effectiveness. By analyzing the specific needs and constraints of an application, users can choose the appropriate compression codec and optimize their Kafka deployment for both performance and cost.


Course illustration
Course illustration

All Rights Reserved.