Apache Pulsar
Kafka
Message Polling
Consumer Pull
Topic Messages

Apache Pulsar vs Kafka - do consumers pull (poll) messages off the topics?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Apache Pulsar and Apache Kafka are both prominent distributed messaging systems widely used in modern data architectures for streaming analytics, data integration, and mission-critical applications. They offer robust capabilities to handle high-throughput streams of data efficiently. However, there are fundamental differences in their architectures and message consumption patterns. These variations can significantly influence the choice between the two, depending on specific use cases.

Architectural Overview

Apache Kafka is built around the concept of the log—an ordered sequence of messages that is persisted on a disk. Kafka clusters consist of multiple brokers, and each broker can handle terabytes of stored messages. Kafka utilizes a partitioned and replicated log service where each log corresponds to a topic partition. Producers write to the partitions, and consumers read from them.

Apache Pulsar, on the other hand, distinguishes itself with a two-layered architecture consisting of brokers and bookkeepers. This separation of serving and storage layers allows Pulsar to handle storage offloading efficiently, making it potentially more scalable than Kafka in certain scenarios. Pulsar uses topics and namespaces as fundamental organizing principles and persists messages in an append-only log just like Kafka. Nonetheless, its architectural decoupling allows for independent scaling of compute and storage resources.

Message Consumption: Pull vs. Push Models

In terms of how messages are consumed, Kafka and Pulsar use fundamentally different models:

  • Kafka uses a pull-based model where the consumers poll the broker to fetch messages. This approach can lead to higher latencies when messages are not available, as the consumer might end up polling the server repeatedly without receiving new data. However, this model effectively allows consumers to control their message consumption rate and manage their workloads.
  • Pulsar employs a combination of push-and-pull, often described as broker-initiated pushing. In Pulsar, consumers connect to the brokers and send flow control permits that specify how many messages the consumer is ready to receive. The broker then pushes messages to the consumer. This model can lead to more efficient throughput and lower latency for message delivery compared to Kafka.

Scalability and Fault Tolerance

Both platforms offer robust scalability and fault tolerance features, but they do so in slightly different ways:

  • Kafka achieves high redundancy and fault tolerance through replication, ensuring that partitions are copied across multiple brokers. Its scalability is horizontally oriented, with partitions spread across brokers. However, every broker manages both traffic and storage, which can sometimes limit scalability concerning rapid data growth.
  • Pulsar achieves scalability through its segmented architecture. By separating brokers from bookies (which handle data storage), it allows for independent scaling of traffic handling and storage, potentially providing better performance in environments with variable workload patterns.

Performance Considerations

Performance in messaging systems can be critical, especially in real-time applications. Kafka is known for high throughput and durability, especially with larger message sizes. Pulsar counters with arguably better performance for mixed workloads and smaller messages due to lower publish latency and better end-to-end latency at scale.

Use Cases

  • Kafka is exceptionally well-suited for traditional event-streaming applications where massive amounts of data need to be ingested and processed in order.
  • Pulsar offers advantages in scenarios requiring real-time performance with strict latency requirements. It is also more suitable for cloud-native applications due to its decoupled architecture.

Summary Table

FeatureApache KafkaApache Pulsar
ArchitectureSingle-layer (Brokers manage everything)Two-layer (Separate brokers and bookies)
Message ConsumptionPull-based (Consumers poll messages)Push-Pull based (Broker pushes messages)
ScalabilityGood (Limited by broker capacity)Very good (Independent scaling of components)
Fault ToleranceHigh (Replication of data)High (Replicated and segmented storage)
Best Use CasesHigh-throughput logging and streamingReal-time messaging, Cloud-native apps

Conclusion

Choosing between Apache Pulsar and Kafka depends significantly on the specific requirements of the application, such as latency sensitivities, throughput needs, and operational complexities. Kafka might be preferable for systems with high data ingestion rates and durable message storage, whereas Pulsar could be a better match for dynamic environments with stringent performance demands.


Course illustration
Course illustration