State of the art: stream processing (part 1)
/ˌsteɪt əv ðɪ ˈɑːt/
"The most recent stage in the development of a product, incorporating the newest technology, ideas, and features."
Are you trying to make sense of the numerous data stream processing frameworks out there? Look no further! In this article, we dive into a comparison of popular options such as Flink, ksqlDB, Storm, Faust, Apex, Estuary, and many, many more.
This is part 1 of a series of 3 articles. A compiled table of the mentioned frameworks and libraries will be made available after the third article!
Stream processing
There are a lot of definitions and alternative terms floating around, but here's a longer definition to make sure we are on the same level:
Stream processing is a method of handling and analyzing data streams in real-time. It involves detecting, collecting, and processing events or "data in motion" as they happen, instead of working with data that is stored in a database or file. This allows for real-time decision-making and actions to be taken based on the data. In terms of data engineering, event processing involves designing and building systems that can handle and process high volumes of data streams with low latency, while also providing fault tolerance and scalability. This is typically done with the use of specialized frameworks and tools that are designed specifically for event processing.
Most of the frameworks and libraries that are being used for event processing can do a whole lot more, sometimes even batch data processing - for this article we will still include them but only focus on the streaming component.
Categorizing
Streaming data processing frameworks and libraries can be categorized in several ways, I tried to focus on the main properties that are used for evaluation by potential users. Naturally, there is a lot of overlap between the categories for a lot of engines, keep that in mind.
- Programming languages: Frameworks and libraries that are written in specific programming languages, such as Apache Kafka for Java or Faust for Python. This is important in the sense that the tooling of the framework often correlates with the number of potential contributors, especially for OSS.
- Exposed APIs: Frameworks and libraries that provide specific APIs for processing streaming data, such as the Postgres wire protocol compatible API of Risingwave or the Apache Kafka Streams API, or PyFlink and Flink SQL. This is one of the bigger factors when shopping around for event processing frameworks as most often we would like them to natively integrate in our existing stack as opposed to building out new build processes, developer tooling, etc. for new APIs.
- Use cases: Frameworks and libraries that are designed for specific use cases, such as real-time analytics, event-driven architectures, and data pipelines. Some use cases can do perfectly fine without the technical overhead of stateful processing for example, while others are entirely dependent on it.
- Integrations: Modern frameworks and libraries are designed to work with specific data sources and targets, such as Kafka, or AWS SQS on the source side and most of the popular data warehouses/lakes on the sink side, like GCP BigQuery or AWS S3.
- Deployment: Frameworks and libraries can be deployed in different ways, such as on-premises, in the cloud, or as managed services. Having the option to self-host can be important for certain organizations for cost and regulatory purposes, while other companies are fine with paying a bit more and having not to care about day-to-day operations.
- Processing model: Frameworks and libraries that support different processing models like batch, real-time, micro-batch, etc. This is a bit more of a technical factor, but can greatly influence choices as extending the real-time capabilities of the same engine to batch processing via a Kappa architecture can greatly simplify engineering requirements.
Let's take a look at the biggest frameworks for each category in detail!
Programming languages
The King
Java is considered the "King" of programming languages in the event processing space (and the Platform Engineering side of Data Engineering as a whole, for now), primarily due to the popularity and wide adoption of Apache Kafka. Kafka is a distributed streaming platform written in Java, and it has become the de facto standard for event streaming in many organizations. As a result, many other streaming data processing frameworks and libraries have been built on top of or in conjunction with Kafka, and thus are also written in Java.
The current biggest (in terms of corporate adoption and probably even OSS contributions) stream processing framework, Flink is also written in Java, which allows the two to tightly integrate and strengthen the languages hold on the domain.
The popularity of Java as a programming language for event processing also means that there are a large number of potential contributors to open-source projects written in Java. The Java community is one of the largest and most active in the software development world, with a wide range of developers, engineers, and researchers contributing to various projects. This results in a wealth of knowledge and expertise, as well as a large number of contributors, which can lead to more robust and feature-rich frameworks and libraries.
Furthermore, because Java is a widely-used and well-established language, it also means that there is a large pool of experienced Java developers available to work with these frameworks and libraries, making it easier for organizations to find and hire the talent they need to build and maintain their event processing systems.
Another amazing piece of Java software is definitely worth a mention; Hazelcast. A distributed computation and storage platform for consistently low-latency querying, aggregation, and stateful computation against event streams and traditional data sources. They also offer various language implementations of their client library which allows us to easily interact with the internal, low-latency processing engine, Jet.
Runner ups
While Java is an amazing language for robust real-time systems, other languages are present in the space; for example, Redpanda, a Kafka API-compatible streaming data platform written in C++ has been implementing its own transformation framework that is built on WASM.
The quick rise in popularity of Rust also brings us a lot of incredible pieces of software such as Materialize and RisingWave. Both of these frameworks take the approach of building complete streaming data platforms from the ground up and exposing a familiar, SQL interface to ease adoption.
There are frameworks implemented in higher-level languages, such as Python; for example, Faust, initially created by Robinhood, is not just a wrapper around some other library, it is a pure Python implementation! It takes a lot of inspiration from Kafka Streams but gives it a Pythonic twist, which is a breath of fresh air after using so many Java API wrappers.
Another Rust & Python implementation (and a personal favorite!) is Bytewax, which, at its core is a distributed processing engine that uses the dataflow computational model implemented in Rust and exposed via a comfy Python library. Bytewax boasts a wide range of use cases from ELT pipelines to online machine-learning workloads.
Just to expand the scene, we also have Nuclio, a serverless compute engine, and Benthos - a high-performance and resilient stream processor - which are both written in Go!
Exposed APIs
The second section we'll look into today is going to be about what APIs the frameworks provide for us, users, as abstractions over their internal mechanisms.
The usual suspects
Exposed APIs are one of the most important factors to consider when evaluating event processing frameworks and libraries, as they greatly impact the ease of integration into an organization's existing codebase. A well-designed API can greatly simplify the process of integrating a new framework into an existing system, reducing the need for additional development and testing.
For example, Apache Flink provides several different APIs for different programming languages, including a Java API, PyFlink for Python, and a SQL interface. This allows developers to use the language and tools that they are most comfortable with and enables them to easily integrate Flink into their existing codebase. There is some disparity of features between the APIs but the very active developer community is actively working on smoothening these out.
The SQL Interface gives an opportunity for analysts to dip their toes into an otherwise very engineering-heavy environment, which can be a great enabler of real-time analytics! The greater good is mass adoption in the streaming analytics space for now, so things like this can be considered a great feature.
Kafka Streams, a stream processing library built on top of Kafka, also provides a simple and easy-to-use Java API, allowing developers to easily integrate it into their existing codebase.
Materialize and RisingWave, both classified as "streaming data warehouses", provide an SQL API, which allows developers to use SQL to interact with the data, in a Postgres wire API-compatible way and can be integrated with other SQL-compatible systems. They both enable stateful processing by allowing DDL statements to create live, breathing materialized views that update in real-time based on incoming data.
Developer familiarity with the PostgreSQL wire protocol should not be underestimated! Here's a great article from the folks over at Materialize about what this compatibility entails and why is it so awesome.
An honorable mention for frameworks that expose an SQL API is ksqldb, a framework tightly integrated with Kafka, which received a lot of support from Confluent, but after their recent acquisition of Immerok - a managed Flink platform - it's safe to expect that it will not receive enough attention in the future to stay relevant.
If we venture into more managed solutions, Quix for example offers SDKs in Python and C#, but also has an API available that allows communication through HTTP or WebSockets, which can make it a great fit for a lot of backend applications. They also take care of managing everything related to the underlying message broker, so the developer only has to write business logic!
In the same vein, Apache Pulsar enables serverless functions to be used as transformations on live data.
Similarly, Estuary has a product called Flow, which also handles the whole real-time processing lifecycle from an operational perspective starting with its managed connectors, which can handle CDC as well and enable us to write and deploy transformations.
Youngblood
If you are moderately familiar with the streaming processing world, none of these names are new to you and you might think that there is no innovation happening at all. Thankfully, this is not the case!
For example, there's Streampark - currently under incubation in the Apache Software Foundation - which allows the developer to create streaming apps faster than before by abstracting away the gritty details of Flink and its environment.
Another great example of specialized software is Nuclio, which is a serverless compute framework, built for event processing and targeting the Data Science community and their online machine learning workloads. While written in Go, it offers various runtimes which basically enable the developer to use their preferred language, just outsource the processing Lambda-style to Nuclio.
While not strictly a stream processing framework or library, Tinybird is worth a mention as their managed Clickhouse solution can be a great alternative for large-scale data collection and transformation. They provide a slick UI and a great developer experience via their API or CLI.
Until next time
In conclusion, when evaluating event processing frameworks and libraries, it's important to consider the programming language they are written in, as well as the exposed APIs they provide. Frameworks and libraries written in popular programming languages like Java have a large number of potential contributors to open-source projects and a wealth of knowledge and expertise available. Exposed APIs play a crucial role in the ease of integration of event processing frameworks and libraries into an organization's existing codebase, and make it easier for developers to use the language and tools they are most comfortable with.
Thank you for reading this article on event processing and streaming data processing frameworks. I hope you found it informative. In the next chapter of this series, we will delve deeper into the other categories like use cases, integrations, deployment, and processing models, and discuss the biggest frameworks for each category in detail. Stay tuned for the next article!
Member discussion