23. April 2021
7 min

An Introduction to Event-Stream-Processing platforms

Apache Kafka is currently the dominant leader in the field of Event Stream Processing. However, there are many up-and-coming technologies which learns from the limitations of Kafka and provides a similar or even richer set of features. As the other platforms are increasingly gaining their momentum, an important question arises: which platform to choose from these competitors? This blog series will try to answer this question with a comprehensive evaluation of the key features of these platforms.

Let’s start the series by first trying to understand what an Event-Stream-Processing platform is and introducing the platforms to be considered along with their general concepts.

Why do we need an Event-Stream-Processing platform?

When building distributed systems or microservices applications, one of the main concerns is how to handle the exchange of data between systems and among services. Event-Driven-Architecture (EDA), which promotes the idea of ​​sharing data in the form of events, is one of the popular approaches. This approach has many benefits as well as numerous challenges. One of the main advantages of using events is that it ensures the minimum coupling between applications since the exchange of events is usually asynchronous. Moreover, instead of having the control flow dictated from the source component, with events, inversion of control is achieved when any application can listen to the events and react independently. This provides a good basis to scale and expand the system with more functionalities more freely. For a deeper look into the EDA topic, here are some interesting articles for reference:

There are different levels of utilization of events. At the highest level, events are used as the primitive form of data store for the entire system. All state changes in the system are recorded as events and all components rely on this single source of truth to derive the current state of the system. This idea arose and was named differently in various domains. In the enterprise software development community, it is usually known as the combination of event sourcing and CQRS patterns. In the world of internet companies, the very same idea is referred to as event stream processing. If you are interested, Martin Kleppmann wrote an amazing blogpost to break down these different terms from different communities and extract the similarities between them. The essential idea is placing events streams in the center of the infrastructure as the data backbone. These streams can be continuously processed, enriched, and transformed to de-normalize data into different representations which match the access patterns of different specialized data systems.

In order to build such systems, an Event-Stream-Processing (ESP) platform comes in handy.

Functionalities of an ESP platform

As the events streams will serve as the integration point for all data systems and applications in the infrastructure, the platform must have a number of important functionalities:

  • First and foremost, an ESP platform must have a persistence layer for event storage.
  • Means to publish and consume events with various messaging patterns must be supported by the platform.
  • The platform should provide or facilitate the realization of the abstraction of event streams on top of the low-level event storage and allow the high-level processing on this abstraction. Out-of-the-box data integration solution should also be supported to quickly connect various data systems into the platform and establish the data flow between them with minimum setup and maintenance.
  • Finally, as an important part of any system operating on a significant scale, monitoring tools must be provided by the platform.

Now that the essential functionalities of an ESP platform are covered, let’s quickly go through some of the many ESP platforms currently available. As a side note, only open-source platforms are considered because they can serve as a good starting point when someone wants to go into the world of Event Stream Processing before making more investment into the commercially licensed platforms. Moreover, there are many fully-managed service of the open-source platforms on the market. Therefore, a good overview of these open-source platforms could be beneficial for the long run.

Of course, there is Apache Kafka, which was first created at LinkedIn and now is the top open-source ESP platform. There is Apache Pulsar , which was initially developed at Yahoo! and now is one of the main competitors of Kafka. RocketMQ is another popular name which is developed mostly by the company Alibaba. There are also a number of open-source projects on ESP from the Cloud Native Computing Foundation (CNCF). For example, there are NATS Streaming which is developed on top of the NATS messaging system by the company Synadia,  Pravega which is currently a sandbox project of CNCF.  

Within the scope of this series, only three ESP platforms: Apache Kafka, Apache Pulsar and NATS Streaming, are considered.

Apache Kafka

If you are unfamiliar with Kafka, you are welcome to check out our separated blog series of Kafka 101 which covers most of the basic concepts of Kafka and could serve as a good starting point for new users.

Apache pulsar

Apache Pulsar is a distributed messaging and streaming platform which was originally created at Yahoo! and later open-sourced as a project of the Apache Software Foundation.

Pulsar is usually set up and runs as cluster with three required components:

  • Pulsar broker: this component is responsible for serving requests from clients. Unlike Kafka, the Pulsar broker is stateless. Whenever it receives a request from client, it will in turn call the storage layer in the cluster to retrieve necessary data and information.
  • BookKeeper: this is an open-source storage service which persist messages or events in distributed logs called ledgers. Pulsar utilizes BookKeeper as its storage layer.
  • Zookeeper: Pulsar relies on Zookeeper to manage all the metadata of both brokers and BookKeeper nodes (usually called Bookies) in the cluster.

Messages published to Pulsar are organized into different topics, which are the logical storage abstractions of Pulsar on top of BookKeeper ledgers. A Pulsar topic can be logically segmented into multiple partitions. Internally, a partition is also a normal Pulsar topic which is managed transparently to users. The partitioning allows multiple client instances to jointly consume a Pulsar topic for scalability in event streaming use cases. This will be elaborated in more detail in the subsequent blogpost on the message delivery feature of the platforms.

Pulsar provides three main client APIs:

  • Producer API: this is used to publish messages to Pulsar.
  • Consumer API: this is used for consuming messages from Pulsar. The consumption model of this API is quite similar to traditional messaging systems, consumer must create a subscription to a Pulsar topic and send acknowledgement back to Pulsar after it has consumed successfully a message. The Pulsar broker will keep track of the consumption status of each subscription and actively push unacknowledged messages to consumers.
  • Reader API: this API is also used for consuming messages from Pulsar. However, unlike Consumer API, reader does not have to create subscription and acknowledge messages. Client using reader API can freely jump to any point on a Pulsar topic to consume messages.

In addition to the low-level APIs, Pulsar also support two additional higher-level tools for stream processing and data integration:

  • Pulsar Functions: with Pulsar Functions, users can implement simple processing logic as function and deploy it to the Pulsar cluster similarly to AWS Lambda. Each time a message is received on Pulsar, the function will automatically be invoked to transform the input and/or generate new message on the output topic.
  • Pulsar IO connectors: this tool provides a convenient way for data integration with other data systems. Users can implement and deploy connector to Pulsar cluster to import and export data in and out of Pulsar. In addition, there are also numerous off-the-shelf connectors for many common data systems such as PostgreSQL, HDFS, Cassandra and even Kafka.

Apart from the open-source release, there are also a few companies which provide Pulsar as a fully-managed service such as StreamNative, Splunk.

NATS Streaming

This is an open-source ESP platform built on top of the NATS messaging system. Therefore, this can be a good choice for organization which has NATS in the current infrastructure and wants to quickly enable ESP. Although at the moment NATS Streaming is not actively developed anymore and soon will be replaced by another new project named JetStream, the new project is also built on top of NATS and has similar concept to NATS Streaming. By evaluating NATS Streaming and addressing its current limitations, the evaluation result here could also be applicable to the new platform.

Normally, multiple server instances are grouped together into a running NATS Streaming cluster. Internally, a NATS Streaming server is made up from a normal NATS server which handles the delivery of messages and a streaming module which is in charge of persistence functionality. When a message is sent to NATS Streaming, it first traverses through the NATS server and then is received and persisted by the listening streaming module. When a client wants to consume a message, the streaming module retrieves the message from the persistence layer and sends it to client via the NATS server. There is no direct connection between the client and the streaming module. As a result, when a message is published and the connection between the NATS server and the streaming module is disrupted for some reasons, the message will be lost since the messaging layer of NATS server is non-persistent. This is one of the factors which directly affects the message delivery semantics on NATS Streaming. This point will be elaborated in more detail in the blogpost about message delivery functionality.

Regarding the persistence layer, users have the flexibility to attach different durable storage systems to the streaming server. Currently, NATS Streaming supports out-of-the-box two types of persistence layer, namely, messages (or events) can be persisted directly on the file system of the server instance or they can be stored in a relational database.

Messages on NATS Streaming are organized into different channels. NATS Streaming provides client API to publish and subscribe to messages on its channels. Unlike Kafka and Pulsar, a NATS channel is the smallest consumption unit cannot be further partitioned. When a client subscribes to a NATS channel, the server will maintain the consumption status of each subscriber and subscriber must acknowledge with the server after consuming each message. Otherwise, the server will keep re-pushing unacknowledged messages to client.

Summary

So there you go! After this first blog of the series, we can now answer the questions why an ESP platform is needed and what are the essential functionalities a platform should have. The overview of Kafka and Pulsar should also lay a good basis for the more in-depth evaluation of these platforms later. The next blog posts in this series will sequentially go through 4 main point of functionalities on each considered platform.

Comment article