Kafka Stream
Windowed Aggregation
Data Processing
Streaming Application
Kafka Tutorial

How to get windowed aggregation from kafka stream?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Apache Kafka is a distributed event streaming platform capable of handling trillions of events a day. Kafka Streams is a client library for building applications and microservices where the input and output data is stored in Kafka clusters. It provides a straightforward way to do stream processing using a functional style API.

Understanding Windowed Aggregation in Kafka Streams

Windowed aggregation is a powerful concept in stream processing, which involves grouping events that are close in time to compute some aggregate result. This is particularly useful for applications that need to analyze data over specific periods, such as calculating hourly averages or counting events within 30-second windows.

Kafka Streams supports various types of windows:

  • Tumbling windows: These are fixed-sized, non-overlapping and contiguous intervals.
  • Hopping windows: These are fixed-sized, overlapping intervals.
  • Sliding windows: These windows are defined by a fixed duration and slide by each incoming event.
  • Session windows: These dynamically adjust the size of the window based on the arrival of events.

Key Concepts and Components

Before diving into examples, it's crucial to understand some key components:

  • KStream and KTable: KStream is a record stream where each data item represents a self-contained datum in the unbounded dataset. KTable is a changelog stream from a primary-keyed table.
  • Time Windows: Defines the span of time over which the aggregation operation is applied.

Step-by-Step: Implementing Windowed Aggregation

To implement windowed aggregation, you would typically follow the steps outlined below:

  1. Create an input Kafka Streams topology to read from a source Kafka topic.
  2. Define the window operation according to the specific requirements (tumbling, hopping, sliding, or session windows).
  3. Apply the aggregation function such as count, sum, avg, etc.
  4. Write the results back to a Kafka topic or another sink.
java
1StreamsBuilder builder = new StreamsBuilder();
2KStream<String, Long> source = builder.stream("input-topic");
3
4// Define a tumbling time window of 5 minutes
5KTable<Windowed<String>, Long> aggregatedStream = source
6    .groupByKey()
7    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
8    .count();
9
10aggregatedStream.toStream().to("output-topic");

In this example, data is read from input-topic, grouped by key, and a count is computed over each 5-minute window. The resulting counts are written to output-topic.

Advanced Topics in Windowed Aggregation

Late Events Handling

Handling late events, i.e., events that arrive after a window has been closed, is crucial. Kafka Streams can manage this by keeping a period of time window data is still available for late arrivals, controlled by the until method and retention settings.

java
.windowedBy(TimeWindows.of(Duration.ofMinutes(5)).grace(Duration.ofMinutes(1)))

In this snippet, a 1-minute grace period is added, allowing late events to be included in the respective window up to one minute after the window has closed.

Interactive Queries

Kafka Streams also supports interactive queries. This feature allows applications to query the current state of a stream, including windowed aggregates, directly from the instance managing that part of the state.

Summary Table

Here is a summary of the different types of windows and their key characteristics:

Window TypeCharacteristicsUse Case
TumblingFixed size, non-overlapping, no gaps.Aggregations reset at each interval.
HoppingFixed size, can overlap, a gap is permissible.Useful for sliding aggregations with overlap.
SlidingWindow slides with each event, duration fixed.Precise control, computationally intensive.
SessionDynamically sized windows based on event arrival.Useful for activity sessions or bursts of events.

Conclusion

Windowed aggregation with Kafka Streams is a powerful way to perform real-time analytics and monitoring on event data grouped by specific time frames. It provides a versatile set of windowing options suitable for various use cases, from simple counting to complex sessionization. By properly configuring window types and understanding how to handle late events, developers can build robust, scalable stream-processing applications using Kafka Streams.


Course illustration
Course illustration

All Rights Reserved.