On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum.
“With terabytes, things are actually pretty simple — most conventional databases scale to terabytes these days. However, try to scale to petabytes and it’s a whole different ball game.” –Florian Waas.
On the subject of Big Data Analytics, I interviewed Florian Waas (flw). Florian is the Director of Software Engineering at EMC/Greenplum and heads up the Query Processing team.
Q1. What are the main technical challenges for big data analytics?
Florian Waas: Put simply, in the Big Data era the old paradigm of shipping data to the application isn’t working any more. Rather, the application logic must “come” to the data or else things will break: this is counter to conventional wisdom and the established notion of strata within the database stack.
Instead of stand-alone products for ETL, BI/reporting and analytics we have to think about seamless integration: in what ways can we open up a data processing platform to enable applications to get closer?
What language interfaces, but also what resource management facilities can we offer? And so on.
At Greenplum, we’ve pioneered a couple of ways to make this integration reality: a few years ago with a Map-Reduce interface for the database and more recently with MADlib, an open source in-database analytics package. In fact, both rely on a powerful query processor under the covers that automates shipping application logic directly to the data.
Q2. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?
Florian Waas: With terabytes, things are actually pretty simple — most conventional databases scale to terabytes these days. However, try to scale to petabytes and it’s a whole different ball game.
Scale and performance requirements strain conventional databases. Almost always, the problems are a matter of the underlying architecture. If not built for scale from the ground-up a database will ultimately hit the wall — this is what makes it so difficult for the established vendors to play in this space because you cannot simply retrofit a 20+ year-old architecture to become a distributed MPP database over night.
Having said that, over the past few years, a whole crop of new MPP database companies has demonstrated that multiple PB’s don’t pose a terribly big challenge if you approach it with the right architecture in mind.
Q3. How do you handle structured and unstructured data?
Florian Waas: As a rule of thumb, we suggest to our customers to use Greenplum Database for structured data and to consider Greenplum HD—Greenplum’s enterprise Hadoop edition—for unstructured data. We’ve equipped both systems with high-performance connectors to import and export data to each other, which makes for a smooth transition when using one for pre-processing for the other, query HD using Greenplum Database, or whatever combination the application scenario might call for.
Having said this, we have seen a growing number of customers loading highly unstructured data directly into Greenplum Database and convert it into structured data on the fly through in-database logic for data cleansing, etc.
Q4. Cloud computing and open source: Do you they play a role at Greenplum? If yes, how?
Florian Waas: Cloud computing is an important direction for our business and hardly any vendor is better positioned than EMC in this space. Suffice it to say, we’re working on some exciting projects.
So, stay tuned!
As you know, Greenplum has been very close to the open source movement, historically. Besides our ties with the Postgres and Hadoop communities we released our own open source distribution of MADlib for in-database analytics (see also madlib.net)
Q5. In your blog you write that classical database benchmarks “aren’t any good at assessing the query optimizer”. Can you please elaborate on this?
Florian Waas: Unlike customer workloads, standard benchmarks pose few challenges for a query optimizer – the emphasis in these benchmarks is on query execution and storage structures. Recently, several systems that have no query optimizer to speak of have scored top results in the TPC-H benchmark.
And, while impressive at these benchmarks, these systems do usually not perform well in customer accounts when faced with ad-hoc queries — that’s where a good optimizer makes all the difference.
Q6. Why do we need specialized benchmarks for a subcomponent of a database?
Florian Waas: On the one hand, an optimizer benchmark will be a great tool for consumers.
A significant portion of the total cost of ownership of a database system comes from the cost of query tuning and manual query rewriting, in other words, the shortcomings of the query optimizer. Without an optimizer benchmark it’s impossible for consumers to compare the maintenance cost. That’s like buying a car without knowing its fuel consumption!
On the other hand, an optimizer benchmark will be extremely useful for engineering teams in optimizer development. It’s somewhat ironic that vendors haven’t invested in a methodology to show off that part of the system where most of their engine development cost goes.
Q7. Are you aware of any work in this area (Benchmarking query optimizers)?
Florian Waas: Funny, you’d asked. Over the past months I’ve been working with coworkers and colleagues in the industry on some techniques – we’re still far away from a complete benchmark but we’ve made some inroads.
Q8. You had done some work with “dealing with plan regressions caused by changes to the query optimizer”. Could you please explain what the problem is and what kind of solutions did you develop?
Florian Waas: A plan regression is a regression of a query due to changes to the optimizer from one release to the next. For the customer this could mean, after an upgrade or patch release one or more of their truly critical queries might run slower–maybe even so slow that it start impacting their daily business operations.
With the current test technology plan regressions are very hard to guard against simply because the size of the input space makes it impossible to achieve perfect test coverage. This dilemma made a number of vendors increasingly risk averse and turned into the biggest obstacle for innovation in this space. Some vendors came up with rather reactionary safety measures. To use another car analogy: many of these are akin to driving with defective breaks but wearing a helmet in the hopes that this will help prevent the worst in a crash.
I firmly believe in fixing the defective breaks, so to speak, and developing better test and analysis tools. We’ve made some good progress on this front and start seeing some payback already. This is an exciting and largely under-developed area of research!
Q9. Glenn Paulley of Sybase in a keynote at SIGMOD 2011 asked the question of ‘how much more complexity can database systems deal with? What is your take on this?
Florian Waas: Unnecessary complexity is bad. I think everybody will agree with that. Some complexity is inevitable though, and the question becomes: How are we dealing with it?
Database vendors have all too often fallen into the trap of implementing questionable features quickly without looking at the bigger picture. This has led to tons of internal complexity and special casing, not to mention the resulting spaghetti code.
When abstracted correctly and broken down into sound building blocks a lot of complexity can actually be handled quite well. Again, query optimization is a great example here: modern optimizers can be a joy to work with.
They are built and maintained by small surgical teams that innovate effectively! Whereas older models require literally dozens of engineers just to maintain the code base and fix bugs.
In short, I view dealing with complexity primarily as an exciting architecture and design challenge and I’m proud we assembled a team here at Greenplum that’s equally excited to take on this challenge!
Q10. I published an interview with Marko Rodriguez and Peter Neubauer, leaders of the Tinkerpop2 project. What is your opinion on Graph Analysis and Manipulation for databases?
Florian Waas: Great stuff these guys are building –- I’m interested to see how we can combine Big Data with graph analysis!
Q11. Anything else you wish to add?
Florian Waas: It’s been fun!
Florian Waas (flw) is Director of Software Engineering at EMC/Greenplum and heads up the Query Processing team. His day job is to bring theory and practice together in the form of scalable and robust database technology.
- On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software.
ODBMS.ORG: Resources on Analytical Data Platforms: Blog Posts | Free Software| Articles|