On Stream Processing and Apache Flink. Q&A with Fabian Hueske
Q1. What is parallel stream processing and how this technology differs from traditional batch data processing?
Distributed stream processing frameworks process events from unbounded data streams. In contrast to batch applications that read, process, and produce bounded amounts of input and output data, stream processing applications continuously ingest and process unbounded data streams and produce unbounded results. Due to the unbounded nature of the input and the need for incremental processing, state and time management are important cornerstones for stream processing applications. Moreover, most stream processing applications have stricter latency SLAs and up-time requirements than traditional batch applications.
Q2. What use cases benefits from stream processing?
Stream processing can be applied to a wide spectrum of applications. The most obvious are probably ETL and analytics use cases that can benefit from much lower processing latencies than periodic batch jobs. However, stream processors are also becoming popular frameworks to implement transactional business applications and microservice architectures.
Q3. What are the technical challenges for stream processing applications?
Dealing with unbounded input requires careful state management and reasoning about when a computation can be performed. Advanced stream processing frameworks such as Flink provide fault tolerance, state consistency guarantees, scalability, and APIs to significantly ease application development. However, stream processing still requires a bit of a mind shift for many data engineers and application developers.
Q4. There are a number of open source frameworks for stream processing frameworks available today, including Apache’s Spark Streaming, Storm, Kafka Streams, Samza, Flink, and Apex. How do you choose the right framework?
Not every framework works equally well for every use case. While some ETL and analytics applications work fine with latencies of seconds or even a few minutes, applications like fraud detection on credit card transactions have much tighter latency SLAs. The expressiveness and features of a frameworks the programming model, such as the control over state and time that it offers should definitely be considered. Finally, the operational features of a framework need to be evaluated. Since most stream processing applications are designed to run continuously for weeks or months, it is important to be able to upgrade an application, scale it in or out, or migrate it to a different cluster.
Q5. You are a longtime Apache Flink committer. How can Apache Flink be useful for developing stream processing applications?
Apache Flink features APIs on different levels of abstraction. Analytical use cases can be easily realized with Flink’s Table API or SQL. These APIs are designed as unified APIs for batch and stream sources, meaning that the same query produces the same result regardless whether it is evaluated over a file or a Kafka topic (given that both contain the same data). Event-driven application that require explicit control over state and time are better served by Flink’s DataStream API and ProcessFunctions. Moreover, Flink includes special-purpose libraries, such as the CEP (complex event processing) library to define and match patterns over event streams.
Q6. What is special about Flink’s system architecture?
Apache Flink has several notable features including its mechanism to take consistent checkpoints of an applications state with low overhead that are used recover the application in case of a failure. Flink’s watermark-based support for event time allows users to trade off processing latency and result completeness. However, the architectural choice that I find most exciting is Flink’s unified approach to batch and stream processing. While still being under development, future versions of Flink will have unified APIs and a unified processing engine to process bounded and unbounded data, treating bounded input as a special case for which optimized processing techniques are leveraged. This will allow Flink users to run the same code on streaming and bounded data with the performance characteristics of dedicated stream and batch processors.
Q7. How much effort is required to deploy and configure Flink clusters?
It is very easy to start a local Flink instance. You can download the distribution and start Flink with a few commands. Moreover, the Flink community provides Docker images that make it really easy to spin up a local instance. For distributed setups, it really depends on the environment. Flink runs a bare metal cluster but also supports cluster resource managers such as Kubernetes, Hadoop Yarn, or Mesos. Of course a proper production setup requires many steps for proper setup like integrating it into the overall system architecture or cloud environment and connecting it to existing logging and metrics infrastructure. The Flink documentation give many pointers for how to do this.
Flink applications can be deployed in two ways: the traditional cluster mode where a Flink cluster is started and one or more applications are submitted for execution. Or an application-centric approach where a job is submitted to a resource manager and the Flink processes are started just for this application. We call these modes job mode and library mode, respectively because Flink behaves more like a library than like a distributed cluster framework in the latter case.
Q8. What are the key lessons learned in implementing scalable streaming applications with Flink’s DataStream API?
I think the most important lesson is to learn how to work with time. Stream processing applications continuously ingest unbounded streams which forces developers to decide when to treat the input as complete and to perform a computation or to emit a result. Event timestamps and watermarks provide important information for this decision, but still the developer needs understand how they work and the trade-offs at hand.
Q9. How do you ensure continuous run and maintenance of these applications in operational environments?
Checkpoints are Flink’s mechanism to ensure state consistency when recovery from failures. A running application periodically takes a consistent checkpoint of its state and writes it to a persistent storage like HDFS or S3.
In case of a failure, Flink restarts the application and initializes the state of all operators from the latest checkpoint. Since a checkpoint also includes the reading positions of its sources (for example Kafka partition offsets), Flink can guarantee exactly-once state consistency and jobs continue processing as if the failure never happened. Since checkpointing is an expensive operation, Flink features many optimizations that significantly reduce the overhead of taking and loading checkpoints. The same mechanism can be used to scale applications in and out, upgrade the logic of an application without losing its state, or migrate the application to a different cluster.
Q10. What is the road map ahead for Apache Flink?
The Flink community has promoted the idea that batch processing is a special case of stream processing for a long time. However, Flink still has separate APIs for batch and stream processing which are translated to separate parts of the execution engine. This will change in the next versions of Flink. The community is finally working towards a truly unified processing engine and will later also consolidate its APIs. A big step towards that goal was Alibaba’s recent donation of its internal Flink fork called Blink that features many improvements for batch scheduling, fault-tolerance, and SQL. The community is currently integrating many of Blink’s features into the Flink master. See this blog post for details. There are more exciting efforts on the way, for example integrating Flink with Hive, adding a Python Table API, and improving support for interactive data exploration use cases. You can find an extensive summary of Flink’s roadmap on our website