Kafka Stream output to a topic first or persist directly?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Apache Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in Kafka topics. Kafka Streams combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology. When developing applications with Kafka Streams, a common question is whether to output data to another Kafka topic or to persist it directly to a database or another storage system. Each approach has its advantages and use cases.
Kafka Streams Output to Topic
Implementation
When choosing to output to a Kafka topic, the main goal is typically to propagate processed streams for further consumption by other applications or services, or for durable storage and re-consumption.
In this simple example, sourceStream reads from a source topic, modifies the data, and then processedStream outputs the results to an 'output-topic'.
Advantages
- Decoupling of Components: Writing to a topic decouples the processing application from consumers. Decoupled systems are easier to manage and scale.
- Reusability of Stream Data: Data in Kafka topics can be replayed or consumed by multiple downstream systems for different purposes without affecting the source system.
- Leveraging Kafka Ecosystem: Kafka’s durability, replication, and fault tolerance handle data persistence, making it unnecessary to manage these aspects in your application code.
Kafka Streams Persistence to a Database
Implementation
Direct persistence to databases is usually implemented when the processed results are considered final and are needed for immediate, random read/write access, such as in an online transaction processing system.
Here, each record is processed and saved directly to a database within the foreach method.
Advantages
- Immediate Consistency: Persisting directly to a database ensures data consistency and immediate availability for querying.
- Transactional Support: Many relational databases support transactions, which are crucial for certain types of business applications.
- Complex Queries: Databases typically provide more sophisticated querying capabilities, which are essential for complex data access patterns.
Comparative Table: Output to Topic vs. Persist Directly
| Feature or Factor | Output to Topic | Persist Directly |
| Coupling | Low (decoupled systems) | High (tightly coupled to storage) |
| Data Reusability | High (data can be consumed repeatedly) | Low (data typically consumed once) |
| Management of Persistence | Handled by Kafka | Must be managed by the application |
| Scalability | High (Kafka handles scaling) | Depends on database scalability |
| Read/Write Access | Asynchronous read/write | Synchronous, immediate read/write |
| Transaction Support | Kafka transactions (stream processing) | Database transactions (ACID properties) |
| Suitability | High-throughput, multiple consumers | Transactional systems, complex queries |
Additional Considerations
- Complex Event Processing: Kafka Streams supports complex event processing, which might not be as easily achievable directly with a database due to the computational overhead and limited real-time processing capabilities.
- Cost and Overhead: Outputting to a topic might add additional overhead in scenarios where immediate consistency or complex querying is not required. Whereas, direct persistence might introduce higher costs and overhead due to the database’s capabilities.
- Data Volume and Velocity: High-volume or high-velocity data are usually better handled by Kafka due to its ability to efficiently manage backpressure and data buffering.
Conclusion
The choice between outputting to another Kafka topic or persisting directly to a database largely depends on the application's specific needs for data availability, consistency, processing complexity, and the architectural decisions surrounding system coupling and scalability. Each method has its trade-offs, and a hybrid approach may even be necessary in certain complex applications.

