Bring Streaming to Apache Cassandra with Apache Pulsar
Apache Pulsar® — an open-source, distributed messaging and streaming platform that’s easy to deploy, simple to scale, and packed with developer-friendly APIs. In this article, you will know how to stream from Pulsar to Apache Cassandra®, the powerful NoSQL database designed to support data-heavy applications in the cloud.
Join the DZone community and get the full member experience.
Join For FreeOne such technology is Apache Pulsar® — an open-source, distributed messaging and streaming platform that’s easy to deploy, simple to scale, and packed with developer-friendly APIs. So the next question is: how can you stream from Pulsar to Apache Cassandra®, the powerful NoSQL database designed to support data-heavy applications in the cloud?
Join our beginner-friendly Pulsar workshop on YouTube and learn how to connect Pulsar with Cassandra for streaming! In this post, we’ll set the scene with an introduction to Pulsar and guide you through four hands-on exercises where you’ll use these free, cloud-native technologies: Katacoda, Kesque, GitPod, and DataStax Astra DB. Each exercise will also be linked to the step-by-step instructions on the DataStax Developers GitHub wiki.
Let’s dig in.
A Quick Introduction to Apache Pulsar
For a bit of background, Pulsar was originally developed by Yahoo! and open-sourced in 2016 as a cloud-native, distributed messaging and streaming platform. Today, it’s a top-level Apache Software Foundation and is used by dozens of companies worldwide, including Comcast, Verizon Media, and (yours truly) DataStax.
Pulsar is largely favored by enterprises and developers for its superior resilience and lightweight compute process, which makes Pulsar ideal for real-time apps and streaming data between sensors and IoT devices. On a slightly more technical level, anyone using Pulsar will likely gush over the following features:
Pulsar represents both publish/subscribe messaging and queuing, where the consumer can subscribe to a topic in three different ways:
- Shared subscription where Pulsar shares the messages on that topic across all of its subscribed consumers.
- Exclusive subscription where every consumer gets their own copy of the data.
- Fail-over subscription provides the benefit of an exclusive subscription but if a consumer falls, Pulsar will send the data to a backup.
Additionally:
- Designed for multi-tenancy: Pulsar was built from the ground up as a multi-tenant system, which allows for a more cost-effective deployment that you can share across multiple teams and across multiple applications. This also applies to the next feature.
- Seamless geo-replication: With built-in geo-replication, Pulsar keeps your data safe by easily replicating persistently stored messages across multiple Pulsar clusters.
- Better scaling: Pulsar separates compute from storage, which makes it simpler for developers to expand capacity to hundreds of nodes.
These features not only make Pulsar a good fit for Cassandra but also with any cloud-native architecture. Speaking of which, let’s take a look at Pulsar’s unique architecture.
Understanding the Architecture Behind Pulsar
Pulsar is a tiered, distributed system comprised of three components:
- Apache BookKeeper®: An open-source storage service that handles persistent storage of messages.
- Apache ZooKeeper®: An open-source server that handles coordination tasks between distributed clusters.
- Brokers: A stateless component that mainly handles and load balances messages between producers and consumers, as well as store messages in BookKeeper instances (bookies).
In the middle of the diagram, we have the Pulsar brokers themselves, which are what talk to the producers and consumers. These basically take a logical model of topics and messages and turn them into storage that can be assigned to the bookies.
Now that you have some background, let’s move on to the workshop where you’ll learn how to use Pulsar and get familiar with the free technologies you can use to simplify your streaming setup.
The Workshop: Four Labs. One Mission.
In this workshop, we give you four simple “labs” that will show you how to connect Pulsar with Cassandra for streaming.
There’s nothing to install and no software to pay for, so flex your coding fingers and get a head start with each lab description below.
Lab 1: Set up Apache Pulsar
In this first lab, you’ll learn how to:
- Install Apache Pulsar from the tarball
- Configure infrastructure components in Pulsar
- Create a topic to store messages
- Read and write messages on the topic
To do all this without installing anything, you’ll be leveraging Katacoda, an interactive platform for software engineers to learn and experiment with different technologies.
Follow the instructions for Lab 1 on GitHub to get started.
Lab 2: Produce and Consume Messages With Kesque
In this second lab, you’ll meet Kesque, a fully managed cloud messaging service powered by Pulsar. As a side note, DataStax acquired Kesque and now includes it as part of Luna Streaming, which is a completely free, production-ready distribution of Pulsar with handy admin and monitoring tools.
For now, you will simply use Kesque itself to:
- Create a topic
- Use the free IDE GitPod to create a producer and consumer in Java
- Create a message schema using the Kesque UI
To give you a better understanding, here’s a simple diagram of how all these technologies will work together.
Lab 3: Connect Cassandra With Astra DB
Now it’s time to connect Pulsar to Cassandra and create a database where you can store the messages sent from Pulsar/Kesque. But instead of installing Cassandra and dealing with all the operational complexity that comes with it, you can just use Astra DB and then connect it to Kesque.
Astra DB is a multi-cloud database-as-a-service (DBaaS) built on Cassandra. It’s the simplest way to benefit from Cassandra’s robust and highly scalable architecture — without the headache of managing the details yourself. So, in this lab you’ll:
- Sign up for a free Astra DB account
- Create a database in Astra DB
- Create a table to store the data sent from Pulsar
Follow the instructions for Lab 3 on GitHub.
Lab 4: Stream From Pulsar to Cassandra
In this fourth lab, you’ll finally start streaming. Here’s the big picture of what you’ve set up so far and what’s next.
So, in this last lab you will:
- Create a sink in Pulsar/Kesque
- Connect the sink with Astra DB
- Watch the messages stream into your table in Astra DB
Follow the instructions for Lab 4 on GitHub.
Follow the Full Workshop on YouTube and Keep Learning
By the end of this workshop, you will have used completely free technologies to successfully stream from Pulsar to Cassandra. Congratulations!
Remember: if you need more guidance during this workshop, you can follow the whole thing step-by-step with the workshop video on YouTube (skip to minute 19 for the labs).
Published at DZone with permission of Cedrick Lunven. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments