Kafka Streams - what is stored in memory and disk in Streams App
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Apache Kafka Streams is a client library for building applications and microservices where the inputs and outputs are stored in Kafka clusters. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology.
What is Stored in Memory and Disk in a Kafka Streams Application?
Kafka Streams manages its state using a combination of in-memory data structures and local storage. This management is crucial for ensuring both fast data access and data safety in cases of failures.
1. In-Memory Storage
Kafka Streams utilizes in-memory data structures primarily for caching and for storing the current state of windowed computations. The use of RAM allows for rapid access and manipulation of data, which is ideal for real-time processing requirements.
Example:
- Caching: Kafka Streams might use an in-memory cache for each instance to store recently accessed or computed records temporarily before flushing them to Kafka or to local state stores.
2. Disk Storage
Disk storage in Kafka Streams is predominantly used through state stores backed by change-log topics in Kafka. This set up helps in maintaining a durable state that can survive application restarts or failures.
State stores can be either:
- Persistent State Stores: These stores persist data on the disk. If a Streams application fails or is restarted, the state is restored from these persistent stores.
- Non-Persistent State Stores: These only keep their data in memory and are generally faster but do not survive a failure.
Kafka Streams uses RocksDB, a high-performance embedded database for key-value data, as a default persistent state store. It automatically manages the storage and provides efficient access to the state data.
Example:
- Local KV Store: A local RocksDB instance that might be storing aggregated counts per key required for a real-time dashboard.
Operations Involving Memory and Disk
When data is processed, Kafka Streams generally follows these steps:
- Consumes data from the input Kafka topics.
- Processes the data in-memory which may involve stateful operations.
- Updates the state store based on the processing results (either in-memory or persistent storage).
- Produces process results to output Kafka topics, which might involve flushing state store changes to the log-compacted Kafka topics.
Table: Kafka Streams Storage Summary
| Storage Type | Description | Usage Examples | Persistence | Performance Impact |
| In-Memory | Temporary, rapid access storage. | Caching, windowed computations. | No | High – Fast data access and processing. |
| Disk (Persistent State Stores) | Long-term, durable storage. Typically using RocksDB. | Snapshots of the current state, such as aggregated results over time. | Yes | Medium – Slower than RAM, but resilient. |
| Disk (Change-log Topics) | Kafka topics that store changes to the state store. | Backup for state recovery. | Yes | Depends on Kafka’s throughput and disk performance. |
Additional Topics and Considerations
High Availability and Fault Tolerance
Kafka Streams applications are designed to be fault-tolerant. By using Kafka’s change-log topics to back the state stores, any lost data due to application failure can be recovered. Kafka partitions, coupled with Streams tasks, can also be reassigned to different application instances during failures, providing high availability.
Scaling and Performance Optimization
Applications can be scaled by increasing the number of instances, each consuming a partition of Kafka topics. This scalability feature is vital for dealing with large volumes of data. Performance optimizations may include tuning the RocksDB settings, adjusting in-memory cache sizes, and configuring the frequency of state store snapshots.
Conclusion
Kafka Streams offers a robust platform for stateful and stateless processing of stream data directly sourced from Apache Kafka. It leverages a combination of in-memory data structures and disk-based storage mechanisms to balance performance with durability and fault tolerance. Understanding the interaction between these storage solutions is essential for designing efficient, scalable Kafka Streams applications.

