10 Reasons to Choose Apache Pulsar Over Apache Kafka
Apache Pulsar's unique features such as tiered storage, stateless brokers, geo-aware replication, and multi-tenancy may be a reason to choose it over Apache Kafka.
Join the DZone community and get the full member experience.
Join For FreeToday, many data architects, engineers, dev-ops, and business leaders are struggling to understand the pros and cons of Apache Pulsar and Apache Kafka. As someone who has worked with Kafka in the past, I wanted to compare these two technologies.
If you are looking for insights on when to use Pulsar, here are 10 advantages of the technology that might be the deciding factors for you.
Pulsar’s Brokers Are Stateless (Easier Scale-Out)
In Kafka, you select a fixed number of brokers. Later, you realize you need more brokers to scale-out your application. Since Kafka stores the messages on the brokers, it requires you to re-partition the topic to make full use of newly added partitions.
In Pulsar, the state is kept in a separate storage layer (Apache BookKeeper). The broker layer is separate from the storage layer, allowing you to add and use brokers without moving any data. This means you can fully leverage a new broker without the need to re-partition existing data.
Tiered Storage (Longer Message Retention and Cost Savings for Storage)
Kafka has a default retention period of 7 days which means data will be deleted after one week. Pulsar, by default, retains all unacknowledged data but discards acknowledged data immediately.
Both Kafka and Pulsar allow you to change this behavior by setting custom retention policies. However, there is usually a limit on how much data you can store in your main storage, and adding more storage increases costs. Tiered storage allows you to choose the right and most cost-effective storage for different types of data. For example, historic data is not needed all the time, only when bootstrapping (backfilling) applications, so you don’t need the same storage type for different types of data.
Pulsar’s storage layer is organized into segments that are spread across all storage nodes. Segments can be written to the main storage or off-loaded to a different type of storage. This allows Pulsar to offer tiered storage, which Kafka does not yet support. Tiered storage offers multiple layers of storage, such as main storage (SSD-based) and historic storage (S3), and allows you to use them transparently.
Quorum-Based Replication (Improved Latency Consistency)
For replication, Pulsar uses a quorum-based algorithm, as opposed to the leader-follower-based approach in Kafka. The guarantees are the same, but the quorum approach tends to yield more consistent latencies. Consistent latency is important for many applications, for example, to reach certain SLAs, such as the response time for a query.
Geo-Aware Replication (Improved Availability)
Pulsar has built-in geo-aware replication. This allows Pulsar to replicate data across data centers in different geographical locations. Having copies of the messages in multiple data centers improves its availability in the case of data center outages or network partitions. No external tooling is needed.
Multi-Tenancy (Simplified Infrastructure and Management)
Pulsar includes support for multi-tenancy, which enables multiple user groups to share the same cluster, either via access control or in entirely different namespaces. In Kafka, this feature is still under discussion. Without multi-tenancy, you need to build an abstraction layer on top of the messaging system, or use an entirely new cluster for a different group of users.
Encryption (Improved Security)
Pulsar offers full end-to-end encryption from the client to the storage nodes. Full in-flight encryption is often a requirement for data security. Currently, Kafka does not have end-to-end encryption.
Multi-Protocol Support (Easy to Integrate With Existing Applications)
Pulsar can speak other protocols, such as RabbitMQ, AMQP, and even Kafka (!). Additionally, support is available for Presto for reading historical stream events in parallel.
Pulsar Functions (Turn-Key Stream Processing)
Pulsar Functions offer a way to do lightweight stream processing on top of Pulsar, a process that’s conceptually similar to Kafka Streams. Interestly, Pulsar’s functions are directly deployed on the broker nodes (or as pods in a kubernetes cluster), whereas Kafka’s streams run as separate applications. Because of this, many stream processing tasks can be solved directly with Apache Pulsar, simplifying operational complexity.
Apache Flink Integration (Full-Blown Batch and Stream Processing)
The Pulsar community has communicated openly about the limitations of Pulsar Functions, e.g. state management and DAG flows. In case Pulsar Functions isn’t a fit for your needs, there is an actively maintained Pulsar <> ApacheFlink connector.
Pulsar Has Been Battle-Tested. (Pulsar Has Been Proven to Work at Scale)
Pulsar is well-established. It was originally developed and used internally at Yahoo, and later donated to the Apache Software Foundation in 2016. Since then, it’s been used in mission-critical applications by Tencent, Splunk, and many others.
As With All Tech — It’s Not All Sunshine and Rainbows
Pulsar requires two systems: Apache BookKeeper and Apache Zookeeper. Kafka "just" requires Zookeeper. More systems could increase the operational complexity. However, it’s also the reason why Pulsar provides additional flexibility and both Kafka and Pulsar require setup and maintenance.
There is no simple answer on when to choose Pulsar versus Kafka and the impact of your decision could be great. In this post, I’ve shared some key differences that I hope can help you and your team make the right decision. If you want to learn more about Apache Pulsar, you can visit pulsar.apache.org or join the mailing lists or the Pulsar Slack channel. Feel free to reach out via Twitter @stadtlegende.
Published at DZone with permission of Maximilian Michels. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments