Is stream processing the next big thing in big data?
by Kostas Tzoumas, Data Artisans.
One of the major trends of late 2015 and 2016 has been the accelerated growth of data streaming. Often also referred to as real time data or fast data, the term refers to two things: first, technologies that allow you to get answers from data with very low latency, and second, an architecture in which data is treated as ever-flowing event streams rather than static files or tables. As one of the initial members of the Apache Flink™ project, and a co-founder of data Artisans, the company founded by the initial Flink creators, I have been very close to this trend.
There are several drivers for data stream processing. From a business perspective, monitoring the business in real-time and providing demand-based and push services to the customers is becoming increasingly important. To name a few, financial services want to offer push-based products to their customers, and prevent fraud in real time. In manufacturing and IoT, it is important to monitor machine data continuously and issue alerts. And in retail, providing product recommendations taking into account the latest purchasing data can improve the quality of recommended products. From a technology perspective, it is becoming more and more popular to architect software developments around connected microservices rather than a monolithic infrastructure. This trend also implies that the data backend is better architected as ever-flowing streams of events rather than as static tables. In the end, stream processing enables the obvious: continuous analysis on data that is continuously produced (which is most interesting data sets nowadays).
This is not the first time that stream processing is being tried out; several products existed in the past but never reached mass adoption, partly because they were hard to use and partly because more data is available now as streams. Early open source approaches to stream processing lacked at least one of the three: the ability to handle streams of high volume, the ability to provide answers with low latency, and the ability to provide correct answers under failure scenarios and when events arrive out of order. This made early adopters to build hybrid solutions (such as the so-called “Lambda architecture”), or work around the limitation of batch platforms.
This has now changed, with mature technologies such as Apache Kafka, Apache Flink, and Apache Beam (incubating) providing scale, distribution, and consistency. In particular, Apache Flink pioneered a stream processor that can do all three: handle streams of high volume, provide low latency results, and guarantee correctness. And all that in a developer-friendly package. With a framework like Flink, stream processing can go well beyond the “real-time” niche. While real time applications are very important, Flink jobs can be used to intuitively implement continuous applications which were previously either standalone applications or periodic batch jobs (e.g., in Hadoop). Even more, Flink treats traditional batch processing as a special case of stream processing (after all, what is a file if not a stream that happens to have a beginning and an end?). This means that by treating data as event streams and using a powerful stream processor like Flink, one can cover real-time, continuous, as well as historical data processing with one framework.
For more information on Apache Flink and real-time (or not) data stream processing, I invite you to check out the upcoming Flink Forward 2016 conference website, the project’s website, as well as the data Artisans technical blog.