Languages for Big Data Queries
Prof. Dr. Christoph Reichenbach, Frankfurt Big Data Lab
Professor for Software Engineering and Programming Languages
Goethe University Frankfurt
– January 20, 2015
As we gain access to increasing numbers of data sources and better tools for data analytics, we also gain opportunities for refining our Big Data applications. However, evolving these applications is still more challenging today than evolving `regular’ software. There are several reasons for this increased level of challenge; in my experience, the main ones are latency, unstructured data, and fault tolerance. Here, `latency’ means that atomic Big Data analysis steps (Map-Reduce programs, mostly) take a substantial amount of time before they return a result on realistic data, `unstructured data’ refers to the common case of data sources that do not obey a strict typing discipline, and `fault tolerance’ refers to analysis steps that support partial failures, falling back to older results (where possible). These are what software engineers call essential complexities: challenges that are intrinsic to the problem. We can try to mitigate these complexities, but we cannot eliminate them.
In addition to these essential complexities we are also facing a number of accidental complexities, that is, complexities caused purely by the particular approaches we choose to solving Big Data challenges. These avoidable complexities largely manifest in how we manage plumbing and intermediate artefacts. For example, it is still not uncommon in the industry to find Map-Reduce pipelines that require data analysts to manually synchronise resource identifiers between stages; due to the high latency intrinsic to Big Data, simple naming bugs can lead to months of wasted computation time.
To address these accidental challenges, practitioners have implemented a number of special-purpose programming languages. Among the Big Data community there appear to be two dominant approaches towards designing such languages. One is to move the challenge to a dedicated `plumbing language’, which effectively corresponds to what the programming language community might call a dataflow language: a language that defines sources (databases, files with unstructured data), sinks (the final output, or intermediate results), and operators, which take any number of sources and emit their result to a sink, which can then again become a source for a subsequent computational stage. This idea of building an explicit processing pipeline is the basic concept underlying systems such as Flume and Pig. It simplifies the naming of shared sinks and sources, and permits optimisations such as operator rewriting and (at least conceptually) detecting and exploiting sharing across multiple flow programs.
The alternative approach is to try to translate SQL into the world of Big Data. Examples of this approach are HiveQL, Impala, Drill, and Dremel. Using such a query language is very appealing from a usability angle— data analysts can use known language concepts and defer to the query optimiser to lay out the optimal processing pipeline automatically.
The principal limitation of such query languages is their limited set of built-in operators. Dataflow languages, especially Flume (which is effectively a library in a general-purpose programming language), are designed to work on user-specified operators and are therefore easily extensible. Query languages are harder to extend, as each logical operator (as written by the data analyst) may have multiple physical operators (concrete algorithms, corresponding to operators of dataflow languages). Moreover, the smarts inherent in a query optimiser have to know when to pick which physical operator.
Our experiences in our own PQL/Java system have shown that building an extensible query language is both possible and practical, but that writing extensions to such systems requires more background than for dataflow systems.
Thus, data analysts will most likely find that dedicated query languages are ideal if the language operators cover their exact needs— and otherwise that they are better served by data flow languages, plus any additional operators they might need, implemented in a suitable general-purpose language.
A second limitation in query languages is the construction of processing pipelines. Analogously to the construction of query plans in traditional databases, the challenge of building an optimal processing pipeline from a declarative query is computationally very hard, most likely intractable, and may depend on information that is not available while the pipeline is constructed, such as which of two external data sources is bigger. This means that declarative languages may produce suboptimal pipeline structures unless their query planners are highly sophisticated (e.g., able to dynamically update the processing pipeline). Unfortunately, in Big Data applications, the difference between `optimal’ and `suboptimal’ can easily be the difference between `practical’ and `infeasible’.
Of course, partitioning all `Big Data’ languages into the above two categories does not do the breadth of available languages justice. I have so far omitted several interesting languages, such as Sawzall and our own PQL/Java, both of which express parallelisable computations directly, akin to the Big Data SQL dialects, though Sawzall prefers a more imperative program structure. I have also omitted work by the functional programming community, especially the Haskell community, who have been experimenting with parallel execution for longer than most other communities, exploiting that their language is free of side effects in the conventional sense and can thus be parallelised more easily. Another omission is LogicBlox, a proprietary declarative language that extends Datalog. LogicBlox is strictly more powerful than SQL and almost as concise.
Despite this, the system has produced query plans that were efficient enough to substantially speed up several known analyses from the programming languages community that had previously been hand-optimised.
LogicBlox and Haskell stand out from the other languages mentioned here in that they cover the full spectrum of data analytics needs from operators to plumbing to visualisation in a single language.
As we can see, a rich fauna of languages has evolved in the world of Big Data, with functional, declarative, dataflow, and imperative languages all trying to leave their mark. These languages attempt to avoid accidental complexity and mitigate the effect of essential complexity in Big Data analytics. As with programming languages outside the world of Big Data, more abstract languages (declarative languages, in particular) can offer the quickest path from idea to implementation, though limitations in the catalogue of operators and sub-optimal processing pipelines may limit them in some applications.
While the separation between the dataflow-plus-operators approach and the declarative languages approach may seem unfortunate at first glance, we should not forget that the latter effectively compiles into the former. There is still untapped potential in bridging these two approaches— for example, who says that, given enough examples of dataflow programs, we can’t build a declarative query language that learns from hand-written dataflow programs and operators, or adjusts pipelines dynamically? Similarly, we may not have heard the last of the Haskell community, especially with last year’s release of the Haxl library, and an Open Source Datalog processor for Big Data might yet make an appearance.
I thus remain optimistic that the fauna of Big Data processing languages has a few rounds of evolution left for us.