“I would consider stream data analysis to be a major unique selling proposition for Flink. Due to its pipelined architecture Flink is a perfect match for big data stream processing in the Apache stack.”–Volker Markl
I have interviewed Volker Markl, Professor and Chair of the Database Systems and Information Management group at the Technische Universität Berlin. Main topic of the interview is the Apache Top-Level Project, Flink.
Q1. Was it difficult for the Stratosphere Research Project (i.e., a project originating in Germany) to evolve and become an Apache Top-Level Project under the name Flink?
Volker Markl: I do not have a frame of reference. However, I would not consider the challenges for a research project originating in Germany to be any different from any other research project underway anywhere else in the world.
Back in 2008, when I conceived the idea for Stratosphere and attracted co-principal investigators from TU Berlin, HU Berlin, and the Hasso Plattner Institute Potsdam, we jointly worked on a vision and had already placed a strong emphasis on systems building and open-source development early on. It took our team about three years to deliver the first open-source version of Stratosphere and then it took us several more years to gain traction and increase our visibility.
We had to make strides to raise awareness and make the Stratosphere Research Project more widely known in the academic, commercial, research, and open-source communities, particularly, on a global scale. Unfortunately, despite our having started in 2008, we had not foreseen there being a naming problem. The name Stratosphere was trademarked by a commercial entity and as such we had to rename our open-source system. Upon applying for Apache incubation, we put the renaming issue to a vote and finally agreed upon the name Flink, a name that I am very happy with.
Flink is a German word that means ‘agile or swift.’ It suits us very well since this is what the original project was about. Overall, I would say, our initiating this project in Germany (or in Europe for that matter) did not impose any major difficulties.
Q2. What are the main data analytics challenges that Flink is attempting to address?
Volker Markl: Our key vision for both Stratosphere and now Flink was “to reduce the complexity that other distributed data analysis engines exhibit, by integrating concepts from database systems, such as declarative languages, query optimization, and efficient parallel in-memory and out-of-core algorithms, with the Map/Reduce framework, which allows for schema on read, efficient processing of user-code, and massive scale-out.” In addition, we introduced two novel features.
One focused on the ‘processing of iterative algorithms’ and the other on ‘streaming.’ For the former, we recognized that fixed-point iterations were crucial for data analytics.
Hence, we incorporated varying iterative algorithm processing optimizations.
For example, we use delta-iterations to avoid unnecessary work, reduce communication, and run analytics faster.
Moreover, this concept of iterative computations is tightly woven into the Flink query optimizer, thereby alleviating the data scientist from (i) having to worry about caching decisions, (ii) moving invariant code out of the loop, and (iii) thinking about & building indexes for data used between iterations. For the latter, since Flink is based on a pipelined execution engine akin to parallel database systems, this formed a good basis for us to integrate streaming operations with rich windowing semantics seamlessly into the framework. This allows Flink to process streaming operations in a pipelined way with lower latency than (micro-)batch architectures and without the complexity of lambda architectures.
Q3. Why is Flink an alternative to Hadoop’s MapReduce solution?
Volker Markl: Flink is a scalable data analytics framework that is fully compatible with the Hadoop ecosystem.
In fact, most users employ Flink in Hadoop clusters. Flink can run on Yarn and it can read from & write to HDFS. Flink is compatible with all Hadoop input & output formats and (as of recently and in a beta release) even has a Map/Reduce compatibility mode. Additionally, it supports a mode to execute Flink programs on Apache Tez. Flink can handle far more complex analyses than Map/Reduce programs. Its programming model offers higher order functions, such as joins, unions, and iterations. This makes coding analytics simpler than in Map/Reduce. For large pipelines consisting of many Map/Reduce stages Flink has an optimizer, similar to what Hive or Pig offer for Map/Reduce for relational operations. However, in contrast, Flink optimizes extended Map/Reduce programs and not scripting language programs built on top.
In this manner, Flink reduces an impedance mismatch for programmers. Furthermore, Flink has shown to grossly outperform Map/Reduce for many operations out of the box and since Flink is a stream processor at its core, it can also process continuous streams.
Q4. Could you share some details about Flink’s current performance and how you reduce latency?
Volker Markl: Flink is a pipelined engine. A great deal of effort has been placed in enabling efficient memory management.
The system gracefully switches between in-memory and out-of-core algorithms.
The Flink query optimizer intelligently leverages partitioning and other interesting data properties in more complex analysis flows. Thereby, reducing communication, process-ing overhead, and thus latency. In addition, the delta iteration feature reduces the overhead during iterative computations, speeds up analytics, and shortens execution time. There are several performance studies on the web that show that Flink has very good performance or outperforms other systems.
Q5. What about Flink’s reliability and ease of use?
Volker Markl: We have had very good feedback regarding both usability and reliability. It is extremely easy to get started with Flink if you are familiar with Java, Scala, or Python. Flink APIs are very clean. For example, the table, graph, and dataset APIs are easy to use for anyone who has been writing data analytics programs in Java and Scala or in systems, such as MATLAB, Python, or R.
Flink supports a local mode for debugging and a lot of effort has been put on it requiring little configuration, so that developers can move a job to production with small effort.
Flink has had native memory management and operations on serialized data from very early on. This reduces configuration and enables very robust job execution.
The system has been tested on clusters with hundreds of nodes. Projects that develop notebook functionality for rapid prototyping, namely Apache Zeppelin are integrating with Flink to further reduce overhead and get an analysis pipeline up and running.
Like other open-source projects, Flink is constantly improving its reliability and ease-of-use with each release. Most recently, a community member created an interactive shell, which will make it easier for first-time users to conduct data analysis with Flink. The Berlin Big Data Center (http://bbdc.berlin) is currently prototyping machine learning and text mining libraries for Flink based on the Apache Mahout DSL.
SICS (The Swedish Institute for Computer Science) in Stockholm is currently working on a solution to ease installation, whereas Data Artisans is providing tooling to further improve the ease of use.
Q6. How well does Flink perform for real time (as opposed to batch) big data analytics?
Volker Markl: I would consider stream data analysis to be a major unique selling proposition for Flink. Due to its pipelined architecture Flink is a perfect match for big data stream processing in the Apache stack. It provides native data streams with window operations and an API for streaming that matches the API for the analysis of data at rest.
The community has added a novel way to checkpoint streams with low overhead and is now working on surfacing persistent state functionality.
Data does not have to be moved across system boundaries (e.g., as in a lambda architecture) when combining both streams and datasets. Programmers do not have to learn different programming paradigms when crafting an analysis. Administrators do not have to manage the complexity of multiple engines as in a lambda architecture (for instance, managing version compatibility). And of course the performance shows a clear benefit due to deep integration.
Q7. What are the new Flink features that the community is currently working on?
Volker Markl: There are plenty of new features. A major ongoing effort is graduating Flink’s streaming API and capabilities from beta status. A recent blog post details this work (http://data-artisans.com/stream-processing-with-flink.html). Another effort is continuing to expand Flink’s libraries, namely, FlinkML for Machine Learning & Gelly for graph processing by adding more algorithms.
Flink’s Table API is a first step towards SQL support, which is planned for both batch and streaming jobs. The ecosystem around Flink is also growing with systems, such as Apache Zeppelin, Apache Ignite, and Google Cloud Dataflow integrating with Flink.
Q8. What role does Data Artisans (a Berlin-based startup) play in the Flink project?
Volker Markl: The startup data Artisans was created by a team of core Flink committers & initiators of the Flink project. They are committed to growing the Apache Flink community and code base.
Q9. Is Flink an alternative to Spark and Storm?
Volker Markl: I would consider Flink to be an alternative to Spark for batch processing, if you need graceful degradation for out-of-core operations or processing iterative algorithms that can be incrementalized. Also, Flink is an alternative to Spark, if you need real data streaming with a latency that the Spark microbatch processing cannot provide. Flink is an alternative to any lambda architecture, involving Storm with either Hadoop or Spark, as it can process richer operations and can easily process data at rest and data in motion jointly in a single processing framework.
Q10. What are the major differences between Flink, Spark, and Storm?
Volker Markl: Overall, the core distinguishing feature of Flink over the other systems is an efficient native streaming engine that supports both batch processing and delta iterations. In particular, it enables efficient machine learning and graph analysis through query optimization across APIs as well as its highly optimized memory management, which supports graceful degradation from in-memory to out-of-core algorithms for very large distributed datasets.
Flink is an alternative to those projects, although many people are using several engines on the same Hadoop cluster built on top of YARN, depending on the specific workload and taste.
At its core, Flink is a streaming engine, surfacing batch and streaming APIs. In contrast, at its core, Spark is at an in-memory batch engine that executes streaming jobs as a series of mini-batches. Compared to Storm, Flink streaming has a checkpointing mechanism with lower overhead, as well as an easy to use API. Certainly, Flink supports batch processing quite well. In fact, a streaming dataflow engine is a great match for batch processing, which is the approach that parallel databases (e.g., Impala) have been following.
Q11. Is Flink already used in production?
Volker Markl: Indeed, two companies already use Flink in production for both batch and stream processing, and a larger number of companies are currently trying out the system. For that reason, I am looking forward to the first annual Flink conference, called Flink Forward (http://flink-forward.org), which will take place on Oct 12-13, 2015 in Berlin, where I am certain we will hear more about its use in production.
Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA, http://www.dima.tu-berlin.de/) group at the Technische Universität Berlin (TU Berlin). Volker also holds a position as an adjunct full professor at the University of Toronto and is director of the research group “Intelligent Analysis of Mass Data” at DFKI, the German Research Center for Artificial Intelligence.
Earlier in his career, Dr. Markl lead a research group at FORWISS, the Bavarian Research Center for Knowledge-based Systems in Munich, Germany, and was a Research Staff member & Project Leader at the IBM Almaden Research Center in San Jose, California, USA. Dr. Markl has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 18 patents, has transferred technology into several commercial products, and advises several companies and startups.
He has been speaker and principal investigator of the Stratosphere research project that resulted in the “Apache Flink” big data analytics system and is currently leading the Berlin Big Data Center (http://bbdc.berlin). Dr. Markl currently also serves as the secretary of the VLDB Endowment and was recently elected as one of Germany’s leading “digital minds” (Digitale Köpfe) by the German Informatics Society (GI).
A detailed Bio can be found at http://www.user.tu-berlin.de/marklv.
– MONDAY JAN 12, 2015, The Apache Software Foundation Announces Apache™ Flink™ as a Top-Level Project
– Apache Flink Frequently Asked Questions (FAQ)
– Mirror of Apache Flink
– On Apache Ignite v1.0. Interview with Nikita Ivanov. ODBMS Industry Watch, February 26, 2015
– AsterixDB: Better than Hadoop? Interview with Mike Carey. ODBMS Industry Watch, October 22, 2014
– Common misconceptions about SQL on Hadoop, ODBMS.org
– SQL over Hadoop: Performance isn’t everything… ODBMS.org
– Getting Up to Speed on Hadoop and Big Data. ODBMS.org
Follow ODBMS.org on Twitter: @odbmsorg