Apache Kafka
Stream Join
Data Streaming
Real-time Processing
Kafka Tutorials

Stream join example with Apache Kafka?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Apache Kafka is a distributed event streaming platform capable of handling trillions of events a day. Among its many capabilities, stream processing is a significant feature, where data streams are continuously consumed, processed, and produced. Stream joins are an integral part of this process, allowing for the merging of two streams based on certain conditions. This can be immensely useful in scenarios where you need to correlate data coming from different sources in real-time.

Understanding Stream Joins in Kafka

In Apache Kafka, stream joins are facilitated through Kafka Streams, a client library for building applications and microservices where the input and output data are stored in Kafka clusters. Kafka Streams supports various kinds of joins including:

  • Inner Joins
  • Left Joins
  • Outer Joins

These joins can be between KStream-KStream, KTable-KTable, and KStream-KTable.

Technical Example: KStream-KStream Join

Let's consider a practical example of a KStream-KStream join using Kafka Streams. Suppose we have two streams of data, orders and payments. We want to join these streams based on order ID to correlate orders with their corresponding payments.

Prerequisites

  1. Apache Kafka and ZooKeeper instances running.
  2. Topics created for 'orders' and 'payments'.

Sample Code

First, define the model classes for Order and Payment:

java
1public class Order {
2    public long orderId;
3    public String item;
4    public double amount;
5
6    // Constructor, getters and setters
7}
8
9public class Payment {
10    public long paymentId;
11    public long orderId;
12    public double paidAmount;
13
14    // Constructor, getters and setters
15}

Here, both Order and Payment classes include an orderId field which will be used as the join key.

Next, set up the streams configuration:

java
1Properties props = new Properties();
2props.put(StreamsConfig.APPLICATION_ID_CONFIG, "stream-join-example");
3props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
4props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.Long().getClass());
5props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

Let's define the logic for streaming and joining the data:

java
1StreamsBuilder builder = new StreamsBuilder();
2
3// Define the order and payment streams based on the topics
4KStream<Long, Order> orderStream = builder.stream("orders");
5KStream<Long, Payment> paymentStream = builder.stream("payments");
6
7// Perform the join operation
8KStream<Long, String> joinedStream = orderStream.join(paymentStream,
9    (order, payment) -> "Order ID: " + order.orderId + ", Item: " + order.item + ", Paid: " + payment.paidAmount,
10    JoinWindows.of(Duration.ofMinutes(5)), // joining within a 5 minutes window 
11    StreamJoined.with(
12        Serdes.Long(), /* key serde */
13        Serdes.serdeFrom(Order.class), /* order serde */
14        Serdes.serdeFrom(Payment.class)) /* payment serde */
15);
16
17// Write the results to a topic
18joinedStream.to("joined-orders-payments");
19
20// Start the stream
21KafkaStreams streams = new KafkaStreams(builder.build(), props);
22streams.start();

In this example, the join operation is a windowed join; we are considering records that arrive within a 5-minute window. The result of the join is then sent to another Kafka topic.

Summary Table

FeatureDescription
Stream-Stream JoinJoin two KStreams based on a key. Requires both records to be present during the join window.
Time WindowsDefine the time span in which to perform the join operation. Illustrated with a 5-minute window in the example.
Processing GuaranteesKafka Streams supports at-least-once and exactly-once processing guarantees for joins.
Use CaseCommonly used for real-time data enrichment, correlation of events arriving from different sources.

Conclusion

Stream joining in Apache Kafka provides powerful capabilities for real-time, context-rich data processing and analytics. By leveraging Kafka's robust architecture and Kafka Streams API, developers can implement complex stream processing applications that perform sophisticated data association, aggregation, and transformation on the fly.


Course illustration
Course illustration

All Rights Reserved.