Scaling out with 200+ Kafka topics
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Apache Kafka is a distributed streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Scaling Kafka to accommodate a setup with 200+ topics involves careful planning and management due to several factors such as data throughput, partitioning, and consumer coordination.
Understanding Kafka Topics and Partitions
Kafka topics are categories or feeds to which records are published. Topics in Kafka are split into partitions for which each record is assigned to a partition based on the partitioning strategy, typically a key-based approach or round-robin if no key is specified. Partitions allow Kafka to parallelize processing by distributing data across multiple brokers in the cluster.
Challenges with Large Numbers of Topics
- Broker Overhead: Each topic and partition consumes memory, file descriptors, and CPU resources on the broker.
- Zookeeper Load: Kafka uses Zookeeper to manage cluster metadata, and having many topics can lead to high Zookeeper loads.
- Replication Traffic: More partitions lead to more replication traffic, which can impact Kafka’s performance and increases the load on network resources.
- Consumer and Producer Connections: More topics can result in increased connection overhead if consumers and producers increase in number corresponding to the number of topics.
Best Practices for Scaling with 200+ Topics
1. Efficient Use of Partitions:
Partitions are the scalability unit in Kafka, but having too many partitions can also degrade performance. Monitor partition count per broker and avoid over-partitioning by keeping partitions large enough. This approach helps in managing broker storage and maintaining throughput.
2. Topic and Partition Design:
When designing topics and their corresponding partitions, consider factors such as expected throughput, data retention policies, and consumer configurations. A common pattern is to categorize topics broadly and use fewer topics with multiple partitions.
3. Resource Allocation:
Allocate sufficient resources to Kafka brokers to handle the load from multiple topics and partitions. Advanced configurations like configuring num.network.threads, num.io.threads, and queued.max.requests help in tuning the broker’s performance.
4. Monitoring and Operations:
Implement robust monitoring to track Kafka’s health and performance metrics. Use tools like LinkedIn’s Cruise Control for automating Kafka workload management and anomalies detection.
5. Use of Compact Topics:
Consider using Kafka's log compaction feature for topics that are used as a persistent store or event sourcing. Log compaction ensures that the log contains at least the last known value for each record's key, reducing the data size.
Technical Example
Consider a scenario where you manage a system that aggregates logs from multiple microservices into Kafka for real-time processing and analysis. Assuming each microservice logs into its own Kafka topic, the setup can rapidly scale to hundreds of topics. Each topic could be partitioned based on the severity level or another key metric, leading to an extensive partitioning scheme.
Example Configuration:
These configurations must be balanced thoughtfully to prevent excessive load on the Kafka cluster and ensure optimal performance at scale.
Summary Table
| Aspect | Consideration | Action Point |
| Partitions & Topics | Avoid excessive partitions per topic | Tune partitions based on needs |
| Resource Management | Adequate resources for handling load | Scale out brokers, adjust configurations |
| Performance Monitoring | Track performance and issues | Use tools like Cruise Control |
| Log Compaction | Necessary for event sourcing | Enable on appropriate topics |
In conclusion, scaling out Kafka with 200+ topics requires careful design consideration, efficient resource management, and rigorous monitoring to maintain a performant system. Each increase in topic count should be justified based on the use case, balancing the management overhead against the architectural benefits.

