An Introduction to Stream Processing with Pulsar Functions
Get up and going with Pulsar Functions.
Join the DZone community and get the full member experience.
Join For FreeThere is a lot of excitement about “serverless,” including debates about what exactly it means (including whether “serverless” is even a meaningful name given that code still runs on a server somewhere). Regardless of your exact definition, the fundamental idea of serverless is to simplify life for developers by decoupling them from the infrastructure that executes the programming logic they create.
In contrast to the developer experience in traditional monolithic application development, in which developers spend a significant share of their time considering how their code integrates with and interacts with the architecture and operation of the overall application, the promise of serverless is that developers can instead focus exclusively on their logic by virtue of a simple API and abstraction.
This allows the infrastructure and the operations team to handle the environment that executes that logic. General-purpose serverless frameworks exist, but the same concepts can also be applied to more specific technologies.
You may also like: Life Beyond Kafka With Apache Pulsar.
Stream processing has traditionally been the domain of specialized stream processing engines (SPEs) such as Apache Storm, Apache Heron, and others. These SPEs offer sophisticated frameworks and execution models capable of carrying out a wide array of processing.
Their approach, based on functional programming concepts (e.g. map, flatmap, etc.) and compilation of processing flows into directed acyclic graphs (DAGs), has also been incorporated into many hybrid stream processing systems including Apache Spark Streaming, Apache Kafka Streams and Apache Flink. Although powerful and flexible, these frameworks are unfamiliar to the bulk of developers, cumbersome to learn, and complex for operations teams to manage in production.
Complexity and overhead have been a significant barrier to the adoption of streaming in data processing. However, new technologies are bringing serverless concepts to the domain of stream processing. In this article, we’ll look at how Pulsar Functions brings serverless concepts into stream processing inside the Apache Pulsar messaging system.
Why Pulsar Functions
A significant portion of data processing use cases is simple and lightweight. Simple ETL (extract, transform, and load) operations, event-based services, real-time aggregation, and event routing are examples of use cases that do not require complex topologies nor processing graphs.
Although these use cases can be implemented using an SPE, developers and users have been plagued by the following:
- Setting up a whole separate stream processing cluster was too complex and burdensome, especially given that users will only need a very small subset of the features of the SPE.
- The cost of operations was unreasonably high for this simple processing — since full-blown SPEs have many features, they naturally have a high level of complexity in terms of deployment, monitoring, and maintenance.
- The API of full-blown SPEs is too complex and convoluted for most simple use cases. Many SPEs have APIs based on a functional programming model (e.g. map, flatmap, reduce, etc.). Those APIs can be a powerful tool, but for many use cases, especially if the user is not comfortable with the functional programming paradigm, it may be too complex and unwieldy.
Pulsar Functions was created to make it much easier to develop and deploy processing logic on streaming data. It was developed with the following design goals:
- Simple API:Anyone with the ability to write a function in a supported language should be able to be productive in a matter of minutes.
- Multi-language: Support popular languages like Java, Scala, Python, Go, and JavaScript.
- Built-in state management. To simplify the architecture for the developer, computations should be allowed to keep state across computations. The system should take care of persisting this state in a robust manner. Basic things like incrementBy, get, put, and update functionality are a must.
- Managed Runtime: Developers should not need to worry about where and how to run their computation. The developer simply submits his or her computation and the system will run it.
- Automatic load balancing: The managed runtime should take care of assigning workers to the functions.
- Scale-up and down: Users should be able to scale the number of function instances up and down using the managed runtime.
- Fault Tolerance: The managed runtime should also run the developer’s computation in a reliable and fault-tolerant manner to minimize downtime.
- Multi-tenancy: Different computations should be isolated from each other. Developers should specify the amount of resources their computation needs and the runtime will enforce these resource quotas.
- Flexible deployment Model: Computations should be able to run as a thread, a process, a docker container, etc. Also, they should support running on external schedulers like Kubernetes.
What Are Pulsar Functions?
Pulsar Functions are a lightweight processing framework that lives natively inside the Apache Pulsar messaging and streaming platform (see my earlier post for an introduction to Pulsar). Pulsar Functions takes its inspiration not only from stream processing engines like Apache Heron and Apache Storm, but also from Function as a Service (FaaS) offerings like AWS Lambda and Google Cloud Functions.
Pulsar Functions enables you to write processing functions using common languages, such as Java, Python, and Go and deploy those functions to a Pulsar cluster. No complex SDK is required. Pulsar handles setting up the execution environment of the function, providing resiliency, and ensuring the message delivery guarantees are followed. The processing logic can be anything that you can fit in a function, including data transformations, dynamic routing, data enrichment, analytics, etc.
The beauty of Pulsar Functions is that you can have the benefits of an SPE without needing to deploy one. Pulsar Functions can handle many of the processing tasks often sent to an SPE without the complexity of deploying, managing and developing and deploying to one. If you are already using an SPE or still need to deploy one, you can easily connect Pulsar to just about any stream processing engine that you like (including Apache Spark Streaming, Apache Storm, Apache Heron, or Apache Flink).
How Pulsar Functions Works
A Pulsar Function consumes data from one or more Pulsar topics, processes the data using custom logic, and, if necessary, writes results to other Pulsar topics using a simple API. One or more instances of a Pulsar Function execute the processing logic defined by the user. A function can also use the provided state interface to persist intermediate results. Other functions can query that state to retrieve these results.
In its simplest form, you don’t even need an SDK to implement a Pulsar Function. In Java, for example, a user can simply implement thejava.util.function.Function
interface, which has just a single apply
method. Here’s an example of a Pulsar Function that applies a simple transformation to a message (appending a ‘!’ to the string):
import java.util.Function;
public class ExclamationFunction implements Function<String, String> {
@Override
public String apply(String input) { return String.format("%s!", input); }
}
If the user needs context-related information, such as the name of the function, the user can just implement the PulsarFunction
interface instead of the Java Function
interface. Here’s an example:
public interface PulsarFunction<I, O> {
O process(I input, Context context) throws Exception;
}
Pulsar Functions can be deployed in multiple configurations, as we will discuss in detail next.
Pulsar Functions Deployment Options
Pulsar Functions are run by executors called instances. A single instance executes one copy of the function. Pulsar Functions have parallelism built in because a function can have many instances, the number of which can be set in the function’s configuration.
To maximize deployment flexibility, Pulsar Functions offer several execution environments to support multiple deployment options and a number of runtimes to execute functions written in different programming languages. The following execution environments are currently supported:
Runtime |
Description |
Process runtime |
Each instance is run as a process. |
Kubernetes / Docker runtime |
Each instance is run as a Docker container |
Threaded runtime |
Each instance is run as a thread. This type is applicable only to Java instances since the Pulsar Functions framework itself is written in Java. |
Each execution environment incurs different costs and provides different isolation guarantees.
Running a Pulsar Function
The easiest way to run a Pulsar Function is to instantiate a runtime and a function and run them locally (local run mode). A helper command line tool makes this very simple. In local run mode, the function runs as a standalone runtime and can be monitored and controlled by any process, Docker container, or thread control mechanisms available.
Users can spawn these runtimes across machines manually or use sophisticated schedulers like Mesos/Kubernetes to distribute them across a cluster. Below is an example of the command to start a Pulsar function in “local run” mode:
$ bin/pulsar-admin functions localrun \
--inputs persistent://sample/standalone/ns1/test_src \
--output persistent://sample/standalone/ns1/test_result \
--jar examples/api-examples.jar \
--className org.apache.pulsar.functions.api.examples.ExclamationFunction
A user can also run a function inside the Pulsar cluster alongside the broker. In this mode, users can "submit" their functions to a running Pulsar cluster, and Pulsar will take care of distributing them across the cluster and monitoring and executing them.
This model allows developers to focus on writing their functions and not worry about managing a function’s life cycle. Below is an example of submitting a Pulsar Function to be run in a Pulsar cluster:
$ bin/pulsar-admin functions create \
--inputs persistent://sample/standalone/ns1/test_src \
--output persistent://sample/standalone/ns1/test_result \
--jar examples/api-examples.jar \
--className org.apache.pulsar.functions.api.examples.ExclamationFunction \
--name myFunction
Another option is to place the entire configuration for the function in a YAML file, like this:
inputs: persistent://sample/standalone/ns1/test_src
output: persistent://sample/standalone/ns1/test_result
jar: examples/api-examples.jar
className: org.apache.pulsar.functions.api.examples.ExclamationFunction
name: myFunction
If you configure a function via YAML, you can use this much simpler create
command:
$ bin/pulsar-admin functions create \
--configFile ./my-function-config.yaml
Processing Guarantees
Pulsar Functions offer the following , which can be specified on a per-function basis:
- At most once
- At least once
- Effectively once
Effectively once processing is achieved using a combination of at-least-once processing and server-side message deduplication. This means that a state update can happen twice, but that state update will only be applied once, while any duplicated state is discarded at the server-side.
Conclusion
From this introduction, I hope I’ve piqued your interest in Pulsar Functions and shown you how their extended capabilities allow you to use Pulsar as a unified system for processing data streams. There are many more capabilities and possibilities with Pulsar Functions; you can read more about them on the Apache Pulsar website.
In future posts, we will compare Pulsar Functions to other frameworks, such as Apache Kafka KStreams and Apache Flink Functions.
Stay tuned for more exciting blogs on Apache Pulsar!
Further Reading
Opinions expressed by DZone contributors are their own.
Comments