In previous blog posts, we described several reasons why Apache Pulsar is an enterprise-grade streaming and messaging system that you should consider for your real-time use cases. We also took a deep dive into enterprise features like real-time durable storage to prevent data loss, multi-tenancy, geo-replication, encryption, and security. These blog posts helped develop an understanding of Pulsar and generated a lot of interest in the community. One question we're frequently asked is how Apache Pulsar compares as an alternative to Apache Kafka.
In this series of Pulsar and Kafka comparison posts, I'll walk you through a couple of areas that I believe are important, and critical for people to choose a robust, highly available, high performance streaming messaging platform.
Messaging model is the first thing that users should consider when choosing a streaming messaging system. The messaging model should cover following 3 areas:
- Message consumption - How are messages dispatched and consumed?
- Message acknowledgement - How are messages acknowledged?
- Message retention - How long are messages retained, what triggers their removal, and how are they removed?
In a modern real-time streaming architecture, messaging use cases can be separated into two categories: queuing and streaming.
Queuing is unordered or shared messaging. With queuing messaging, multiple consumers are created to receive messages from a single point-to-point messaging channel. When the channel delivers a message, any of the consumers could potentially receive it. The messaging system's implementation determines which consumer actually receives the message.
Queuing use cases are usually found in conjunction with stateless applications. Stateless applications don’t care about ordering but they do require the ability to acknowledge or remove individual messages as well as the ability to scale consumption parallelism as much as possible. Typical queuing-based messaging systems include RabbitMQ and RocketMQ.
In contrast, streaming is strictly ordered or exclusive messaging. With streaming messaging, there is always only one consumer consuming the messaging channel. The consumer receives the messages dispatched from the channel in the exactorder in which they were written.
Streaming use cases are usually associated with stateful applications. Stateful applications care about ordering and their state. The ordering of messages determines the state of a stateful application. Ordering will impact the correctness of whatever processing logic the application needs to apply when out-of-order consumption occurs.
Both streaming and queuing are necessary in a microservices-oriented or event-driven architecture.
Apache Pulsar unifies queuing and streaming into a unified messaging model: producer-topic-subscription-consumer. A topic (partition) is a named channel for sending messages. Each topic partition is backed by a distributed log stored in Apache BookKeeper. Each message published by publishers is only stored once on a topic partition, replicated to be stored on multiple bookies (BookKeeper servers), and can be consumed as many times as necessary by consumers.
The topic is the source of truth for consumption. Although messages are only stored once on the topic partition, there can be different ways of consuming those messages. Consumers are grouped together for consuming messages. Each group of consumers is a subscription on a topic. Each consumer group can have its own way of consuming the messages---exclusively, shared, or failover---which can be different across consumer groups. This combines queuing and streaming in one model and API, and it was designed and implemented with the goal of not impacting performance and introducing cost overhead, while also providing a lot of flexibility to users to consume messages in a way that’s best for the use case at hand.
As the name states, there can be only one single consumer in a subscription (consumer group) consuming a topic partition at any given time. Figure 1 below illustrates an example of an exclusive subscription. There is one active consumer A-0 with subscription A. Messages m0 through m4 are delivered in order and consumed by A-0. If another consumer A-1 wants to attach to subscription A, it will not be allowed to do so.
Figure 1. Exclusive subscription
With failover subscriptions, multiple consumers can attach to the same subscription. For a given topic partition, however, one consumer will be elected as the master consumer for that topic partition. Other consumers will be designated failover consumers. When the master consumer disconnects, the partition will be reassigned to one of the failover consumers to consume while the newly assigned consumer will become the new master consumer. When this happens, all non-acked messages will be delivered to the new master consumer. This is similar to consumer partition rebalancing in Apache Kafka. Figure 2 illustrate a failover subscription. Consumers B-0 and B-1 are subscribed to consume messages via subscription B. B-0 is the master consumer and receives all messages. B-1 is the failover consumer and will take over consumption in case of failure in consumer B-0.
Figure 2. Failover subscription
With shared subscriptions, you can have as many consumers as you want attach to the same subscription. Messages are delivered in a round-robin distribution across multiple consumers and any given message is delivered to only one consumer. When a consumer disconnects, all the messages that were delivered to it and not acknowledged will be rescheduled to be sent to the remaining consumers alive on that subscription. Figure 3 illustrates a shared subscription. Consumers C-1, C-2, and C-3 all consume messages on the same topic partition. Each consumer receives around ⅓ of the messages. If you want to increase the consumption rate, you can simply more consumers to the same subscription (as many as you want) without increasing the number of partitions.
Figure 3. Shared subscription
Exclusive and failover subscriptions allow only one consumer per topic partition per subscription. They consume messages in partition order. They are best applied to streaming use cases where strict ordering is required. Shared subscriptions, on the other hand, allow multiple consumers per topic partition. Each consumer in same subscription only receives a portion of the messages published to a topic partition. Shared subscriptions are best for queuing use cases where ordering is not required and can scale the number of consumers beyond the number of partitions.
A subscription in Pulsar is effectively the same as a consumer group in Apache Kafka. Creating subscriptions is highly scalable and very cheap. You can create as many subscriptions as you need. Different subscriptions on the same topic don't have to be of the same subscription type. This means that you can have one failover subscription with 10 consumers and one shared subscription with 20 consumers on the same topic. If the shared subscription is slow in processing events, you can add more consumers to the shared subscription without changing the number of partitions. Figure 4 depicts a topic that has 3 subscriptions, A, B, and C, and illustrates how messages flow through the system from producers to consumers.
Figure 4. Unified streaming and queuing in Apache Pulsar
Besides the unified messaging API, since a Pulsar topic partition is effectively a distributed log stored in Apache BookKeeper, it also provides a reader API (similar to the consumer API but without cursor management) for users to fully control how to consume messages themselves.
When using a messaging system distributed across machines, failures can occur. In the case of consumers consuming messages from a topic in a messaging system, both the consumers consuming the messages and the message brokers serving the topic partition can fail. When such failure occurs, it is useful to be able to resume consumption from where consumers have left off once everything recovers, both so that you don’t miss messages and also so that you don’t have to process messages that have already been acknowledged. The resume point is often known as offset in Apache Kafka, the process of updating resume point is called message acknowledgment, or committing offset.
In Apache Pulsar, cursors are used for tracking the message acknowledgment for each subscription. The cursor is updated any time the consumers ack messages on the topic partitions. Updating the cursors ensures that the consumers will not receive the message again. However cursors are not simple offsets as in Apache Kafka. Cursors are much more.
There are two ways in Apache Pulsar on how messages can be acknowledged, ack individually or ack cumulatively. With cumulative acknowledgment, the consumer only needs to acknowledge the last message it receives. All the messages in the topic partition up to (and including) the provide message id will be marked as acknowledged and will not be re-delivered to the consumer again. Cumulative acknowledgment is effectively same as offset update in Apache Kafka.
The differentiating feature of Apache Pulsar is the ability to ack individually, aka selective acking. Consumers are able to acknowledge messages individually. Acked messages will not be redelivered. Figure 5 illustrates the difference between ack individual and ack cumulative (messages in gray box are acknowledged and will not be redelivered). At the top of the diagram, it shows an example of ack cumulative, messages before (and include) M12 are marked as acked. At the bottom of the diagram, it shows an example of acking individually. Only message M7 and M12 are acknowledged - in the case of consumer failures, all the messages will be redelivered except M7 and M12.
Consumers in an exclusive or failover subscription are able to ack messages individually or cumulatively; while consumers in a shared subscription are only allowed to ack messages individually. The ability to ack messages individually provides better experiences on handling consumer failures. It is extremely important for some applications to prevent redeliver already acknowledged messages, because for those applications processing messages can take long time or are very expensive.
The flexibility on choosing subscription type and acknowledge method allow Pulsar to support various messaging and streaming use cases in a simple unified API.
Figure 5. Individual vs. cumulative ack
Cursors are managed by brokers and stored in BookKeeper ledgers. You can check out our Cursors in Apache Pulsar for more details.
In contrast with traditional messaging systems, messages are not removed immediately after they are acknowledged. Pulsar brokers only update the cursor when receiving message acknowledgments. Messages can only be deleted after all the subscriptions have already consumed it (the messages are marked as acknowledged in their cursors). However, Pulsar also allows you to keep messages for a longer time even after all subscriptions have already consumed them. This is done by configuring a message retention period. Figure 6 illustrates how message retention looks like in a topic partition with 2 subscriptions. Subscription A already consumed all the messages before M6 and Subscription B already consumed all the messages before M10. That means all the messages before M6 (in gray boxes) are safe to remove. The messages between M6 and M9 are still not consumed by subscription A, they can not be removed.
If a topic partition is configured with a message retention period, the messages M0 to M5 will be kept around for the configured time period, even A and B already consumed them.
Figure 6. Message retention
Beside message retention, Pulsar also supports message time-to-live (TTL).A message will automatically be marked as acknowledged if it is not consumed by any consumers within the configured TTL time period. The difference between message retention and message TTL is that message retention applies to messages that are marked as acknowledged and set to be deleted. Retention is a time limit applied on a topic whereas TTL applies to messages that are not consumed. TTL is thus a time limit on consumption with a subscription. Figure 6 above illustrates TTL in Pulsar. If there is no consumer alive for subscription B, for example, message M10 will automatically be marked as acknowledged after the configured TTL time period has elapsed, even if no consumers have actually read the message.
The table below lists the similarities and differences between Apache Pulsar and Apache Kafka.
|Consumption||More focused on streaming, exclusive messaging on partitions. No shared consumption.||Unified messaging model and API.
|Acking||Simple offset management
||Unified messaging model and API.
|Retention||Messages are deleted based on retention. If a consumer doesn’t read messages before retention period, it will lose data.||Messages are only deleted after all subscriptions consumed them. No data loss even the consumers of a subscription are down for a long time.
Messages are allowed to keep for a configured retention period time even after all subscriptions consume them.
|TTL||No TTL support||Supports message TTL|
If you would like to hear a short sentence about how Apache Pulsar differs from Apache Kafka in their respective messaging models, here is mine:
Apache Pulsar combines high-performance streaming (which Apache Kafka pursues) and flexible traditional queuing (which RabbitMQ pursues) into a unified messaging model and API. Pulsar gives you one system for both streaming and queuing, with the same high performance, using a unified API.
In this blog post, I walked you through Apache Pulsar's messaging model, which unifies queuing and streaming into a single API. Applications can use this single API for both high-performance queuing and streaming without the overhead of setting up, say, RabbitMQ for queuing and Kafka for streaming. I hope that this post gives you an idea of how message consumption, removal, and retention work in Apache Pulsar and that you've learned the difference between the Pulsar and Kafka messaging models. In upcoming blog posts, I will walk you through architectural details of Apache Pulsar and the differences that Pulsar presents in comparison to Apache Kafka with respect to data distribution, replication, availability, and durability.
If you are interested in this topic, and want to learn more about Apache Pulsar, please visit the official website at https://pulsar.apache.org. You can also participate in the Pulsar community via:
- The Pulsar slack channel. You can self-register at https://apache-pulsar.herokuapp.com/
- The Pulsar email list.
To get the latest updates about Pulsar, you can follow the project on Twitter @apache_pulsar.