On Pivotal HD. Interview with Scott Yara and Florian Waas.
“A distribution is not–or not necessarily–a fork of the code and we have no intention to fork Hadoop. At this point, the value-add that we bring to the table is strictly layered on top of Apache HD and interacts cleanly with the vanilla Hadoop stack” –Scott Yara and Florian Waas.
Greenplum announced on Monday, February 25th a new Hadoop distribution: Pivotal HD. I asked a few questions on Pivotal HD to Scott Yara, Senior Vice President, Products and Co-Founder Greenplum/EMC, and Florian Waas, Senior Director of Advanced Research and Development at Greenplum/EMC.
Q1. What is in your opinion the status of adoption of, and investment in, open source projects such as Hadoop within the Enterprise?
Scott Yara, Florian Waas: We have seen a massive shift in perception when it comes to open source.
In the past, innovation was primarily driven by commercial R&D departments and open source was merely trying to catch up to them. And even though a number of open source projects from that era have become household names they weren’t necessarily viewed as leaders in innovation.
This has fundamentally changed in recent years: open source has become a hotbed of innovation in particular in infrastructure technology. Hadoop and a variety of other data management and database products are testament to this change. Enterprise customers do realize this trend and have started adopting open source large-scale. It allows them to get their hands on new technology much faster than was the case before and as a additional perk this technology comes without the dreaded vendor lock-in.
By now, even the most conservative enterprises have developed open source strategies that ensures they have their hand on the pulse and adoption cycles are short and effective.
So, in short, the prospects for open source have never been better!
Q2. In your opinion is the future of Hadoop made of hybrid products?
Scott Yara, Florian Waas: Hadoop is a collection of products or tools and, apart from the relatively mature HDFS interfaces, is still evolving. Its original value proposition has changed quite dramatically. Remember, initially it was all about MapReduce the cool programming paradigm that lets you whip up large-scale distributed programs in no time requiring only rudimentary programming skills.
Yet, that’s not the reason Hadoop has attracted the attention of enterprises lately. Frankly, the MapReduce programming paradigm was a non-starter for most enterprise customers: it’s at too low a level of abstraction and curating and auditing MapReduce programs is prohibitively expensive for customers unless they have a serious software development shop dedicated to it. What has caught on, however, is the idea of ‘cheap scalable storage’!
In our view the future of Hadoop is really this: a solid abstraction of storage in the form of HDFS with any number of different processing stacks on top, including higher-level query languages. Naturally this will be a collection of different products, hybrids where necessary. I think we’ve only seen the tip of the iceberg yet.
Q3. Why introducing a new Hadoop distribution?
Scott Yara, Florian Waas: Let’s be clear about one thing first: to us a distribution is simply a bundle of software components that comes with the assurance that the bundled products have been integration-tested and certified. To enterprise customers this assurance is vital as it gives them the single point of contact when things go wrong. And exactly this is the objective of Pivotal HD.
A distribution is not–or not necessarily–a fork of the code and we have no intention to fork Hadoop. At this point, the value-add that we bring to the table is strictly layered on top of Apache HD and interacts cleanly with the vanilla Hadoop stack.
As long as no vendor actively subverts the Hadoop project, we don’t see any need to fork. That being said, if a single vendor sweeps up a significant number of contributors or even committers of any individual project it always raises a couple of red flags and customers will be concerned whether somebody is going to hijack the project. At this point, we’re not aware of any such threat to the open-source nature of Hadoop.
Q4. How did you expand Hadoop capabilities as a data platform with Pivotal HD?
Scott Yara, Florian Waas: Pivotal HD is a full Apache HD distribution plus some Pivotal add-ons. As we said before, the HDFS abstraction is a pretty good one—but the standard stack on top of it is lacks severely in performance and expressiveness; so we give customers better alternatives. For enterprise customers this means: you can use Pivotal HD like regular Hadoop where applicable but if you need more, you get it in the same bundle.
Q5. What is the rationale beyond introducing HAWQ, a relational database that runs atop of HDFS?
Scott Yara, Florian Waas: Not quite. We’ve transplanted a modern distributed query engine onto HDFS. We stripped out a lot of “incidental” database technology that databases are notorious for. HAWQ gives enterprises the best of both worlds: high-performance query processing for a query language they already know on the one hand, and scalable open storage on the other hand. And, unlike with a database, data isn’t locked away in a proprietary format: in HAWQ you can access all stored data with any number of tools when you need to.
Q6. How does Pivotal HD differ from Hadapt in this respect?
Scott Yara, Florian Waas: Hadapt is still in its infancy with what looks like a long way to go; mainly because they couldn’t tap into a MPP database product.
Folks sometimes forget how much work goes into building a commercially viable query processor. In the case of Greenplum, it’s been about 10 years of engineering.
Q7. How does HAWQ work?
Scott Yara, Florian Waas: HAWQ is modern distributed and parallel query processor atop HDFS–with all the features you truly need, but without the bloat of a complete RDBMS.
Obviously there’s a number of rather technical details how exactly the two worlds integrate and interested readers can find specific technical descriptions on our website.
Q8 You write in the Greenplum Blog that “HAWQ draws from the 10 years of development on the Greenplum Database product”. Can you be more specific?
Scott Yara, Florian Waas: Building a distributed query engine that is general and powerful enough to support deep analytics is a very tough job. There are no shortcuts. Hive and all of these SQL-ish interfaces we’ve recently seen are an attempt at it and work well for simple queries but basically failed to deliver solid performance when it comes to advanced anlaytics.
Having spent a long time working on DB internals we sometimes keep forgetting how steep a development this technology has undergone. Folks new in this space constantly “discover” some of the problems query processing has dealt with for a long time already, like join ordering—this learning-by-doing approach is kind of cute, but not necessarily effective.
Q9. Why does HAWQ have its own execution engine separate from MapReduce? Why does it manage its own data?
Looks like we’re answering the questions in the wrong order
MapReduce is a great tool to teach parallelism and distributed processing and makes for an intuitive hands-on experience. But unless your problem is as simple as a distributed word-count, MapReduce quickly becomes a major development and maintenance headache; and even then the resulting performance is sub-standard.
In short, MapReduce, while maybe great for software shops with deep expertise in distributed programming and a do-it-yourself attitude, is not enterprise-ready.
Q10. HAWQ supports Columnar or row-oriented storage. Why this design choice?
Scott Yara, Florian Waas: Columnar vs. row-orientation really is a smoke screen; always has been. We’ve long advocated to view columnar for what it is: a feature, not an architectural principle. If your query processor follows even the most basic software engineering principles supporting column-orientation is really easy.
Plenty of white papers have been written on the differences and discussed the application scenarios where one out-performs the other and vice versa. As so often, there is no one-size-fits-all. HAWQ lets customers use what they feel is the right format for the job. We want customers to be successful, not blindly follow an ideology.
The same way different requirements in the workload demand different orientation, HAWQ can ingest different data formats way beyond column or row orientation—optimized for query processing, or optimized for 3rd party applications, etc.—which rounds out the picture.
Q11. Could you give us some technical detail on how the SQL parallel query optimizer and planner works?
Scott Yara, Florian Waas: What you see in HAWQ today is the true and tried Greenplum Database MPP optimizer with a couple of modifications but largely the same battle-tested technology. That’s what allowed us to move ahead of the competition so quick while everybody else is still trying to catch up to basic MPP functionality.
Having said that, we’re constantly striving for improvement and pushing the limits. Over the past years, we have invested in what we believe is a ground-breaking optimizer infrastructure which we’ll unveil later this summer. So, stay tuned!
Q12. Could you give us some details on the partitioning strategy you use and what kind of benchmark results do you have?
Scott Yara, Florian Waas: The benchmarks are a funny thing: hardly any competitor can run even the most basic database benchmarks yet, so we’re comparing on the simple, almost trivial, queries only. Anyways, here’s what we’ve been seeing so far: if the query is completely trivial the nearest competitor is slower by at least a factor of two. For anything even slightly more complex the difference widens quickly to one to two orders of magnitude.
Q13. Apache Hadoop is open-source, do you have plans to open up HAWQ and the other technologies layered atop it?
Scott Yara, Florian Waas: We’ve been debating this but haven’t really made a decision, as of yet.
Q14. The Hadoop market is crowded: e.g Cloudera (Impala), Hortonworks’ Data Platform for Windows, Intel’s Hadoop distribution, NewSQL data store Hadapt. How do you stand out of this crowd of competitors?
Scott Yara, Florian Waas: We clearly captured a position of leadership with HAWQ and enterprise customers do recognize that. We’ve also received a lot of attention from competitors which shows that we clearly hit a nerve and deliver a piece of the puzzle enterprises have long been waiting for.
Q15. With Pivotal HD are you competing in the same market space as Teradata Aster?
Scott Yara, Florian Waas: Aster has traditionally targeted a few select verticals. For all we can tell, it looks like we’re seeing the continuation of that strategy with a highly specialized offering going forward.
In contrast to that, Pivotal HD strives to be a general purpose solution for as broad a customer spectrum as you can imagine.
Q16. Jeff Hammerbacher in 2011 said “The best minds of my generation are thinking about how to make people click ads… That sucks. If instead of pointing their incredible infrastructure at making people click on ads, they pointed it at great unsolved problems in science, how would the world be different today?” What is your take on this?
Scott Yara, Florian Waas: Jeff garnered a lot of attention with this quote but let’s face it, this type of criticism isn’t exactly novel nor very productive. For decades, Joseph Weizenbaum, one of the pioneers of AI famously lamented about the genius and technology wasted on TV satellites. Along the same lines other MIT faculty have decried the fact that their most successful engineering students become quants on Wall Street. The list is probably long.
Instead of scolding people for what they didn’t do, I’d say, let’s empower people and give them tools to do great things and solve truly important problems. It’s not at coincidence that Big Data problems are at the heart of the most pressing challenges humanity faces today. So, let’s get moving!
Senior Vice President, Products and Co-Founder Greenplum/EMC.
In his role as SVP, Products, Scott is responsible for the division’s overall product development and go-to-market efforts, including engineering, product management, and marketing. Scott is a co-founder of Greenplum and was President of the company. Prior to Greenplum, Scott served as vice president for Digital Island, a publicly traded Internet infrastructure services company that was acquired by Cable & Wireless in 2001. Prior to Digital Island, Scott served as vice president for Sandpiper Networks, an Internet content delivery services company that merged with Digital Island in 1999. At Sandpiper, Scott helped to create the industry’s first content delivery network (CDN), a globally distributed computing infrastructure comprised of several thousand servers, and used by many of the industry’s largest Internet services including Microsoft and Disney.
As Senior Director of Advanced Research and Development at Greenplum/EMC, Florian Wass heads up the division’s department of Impossible Ideas. That is to say, his day job is to look into ideas that are far from ready to be undertaken as engineering efforts, and then look at what would it take to turn theory into practice.
He obtained his MSc in Computer Science from Passau University, Germany and a PhD in database research from the University of Amsterdam. Florian Waas has worked as a researcher for several European research consortia and universities in Germany, Italy, and The Netherlands. Before joining Greenplum, Florian Waas held positions at Microsoft and Amazon.com.
– ODBMS.org free resources on Big Data and Analytical Data Platforms
Blog Posts | Free Software | Articles | Lecture Notes | PhD and Master Thesis |
Follow ODBMS.org on Twitter: @odbmsorg