On Big Data, Analytics and Hadoop. Interview with Daniel Abadi.
“Some people even think that “Hadoop” and “Big Data” are synonymous (though this is an over-characterization). Unfortunately, Hadoop was designed based on a paper by Google in 2004 which was focused on use cases involving unstructured data (e.g. extracting words and phrases from Webpages in order to create Google’s Web index). Since it was not originally designed to leverage the structure in relational data in order to take short-cuts in query processing, its performance for processing relational data is therefore suboptimal.”— Daniel Abadi.
Q1. On the subject of “eventual consistency”, Justin Sheehy of Basho recently said in an interview (1): “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”.
What is your opinion on this? Would you wish to have an “eventual consistency” update to your bank account?
Daniel Abadi: It’s definitely true that it’s a common misconception that banks use the ACID features of database systems to perform certain banking transactions (e.g. a transfer from one customer to another customer) in a single atomic transaction. In fact, they use complicated workflows that leverage semantics that reasonably approximate eventual consistency.
Q2. Justin Sheely also said in the same interview that “the use of eventual consistency in well-designed systems does not lead to inconsistency.” How do you ensure that a system is well designed then?
Daniel Abadi: That’s exactly the problem with eventual consistency. If your database only guarantees eventual consistency, you have to make sure your application is well-designed to resolve all consistency conflicts.
For example, eventually consistent systems allow you to update two different replicas to two different values simultaneously. When these updates propagate to the each other, you get a conflict. Application code has to be smart enough to deal with any possible kind of conflict, and resolve them correctly. Sometimes simple policies like “last update wins” is sufficient to handle all cases.
But other applications are far more complicated, and using an eventually consistent database makes the application code far more complex. Often this complexity can lead to errors and security flaws, such as the ATM heist that was written up recently. Having a database that can make stronger guarantees greatly simplifies application design.
Q3. You have created a start up called Hadapt which claims to be the ” first platform to combine Apache Hadoop and relational DBMS technologies”. What is it? Why combining Hadoop with Relational database technologies?
Daniel Abadi: Hadoop is becoming the standard platform for doing large scale processing of data in the enterprise. It’s rate of growth far exceeds any other “Big Data” processing platform. Some people even think that “Hadoop” and “Big Data” are synonymous (though this is an over-characterization). Unfortunately, Hadoop was designed based on a paper by Google in 2004 which was focused on use cases involving unstructured data (e.g. extracting words and phrases from Webpages in order to create Google’s Web index).
Since it was not originally designed to leverage the structure in relational data in order to take short-cuts in query processing, its performance for processing relational data is therefore suboptimal.
At Hadapt, we’re bringing 3 decades of relational database research to Hadoop. We have added features like indexing, co-partitioned joins, broadcast joins, and SQL access (with interactive query response times) to Hadoop, in order to both accelerate its performance for queries over relational data and also provide an interface that third party data processing and business intelligence tools are familiar with.
Therefore we have taken Hadoop, which used to be just a tool for super-smart data scientists, and brought it to the mainstream by providing a high performance SQL interface that business analysts and data analysis tools already know how to use. However, we’ve gone a step further and made it possible to include both relational data and non-relational data in the same query; so what we’ve got now is a platform that people can use to do really new and innovative types of analytics involving both unstructured data like tweets or blog posts and structured data such as traditional transactional data that usually sits in relational databases.
Q4. What is special about tha Hadapt architecture that makes it different from other Hadoop-based products?
Daniel Abadi: The prevalent architecture that people use to analyze structured and unstructured data is a two-system configuration, where Hadoop is used for processing the unstructured data and a relational database system is used for the structured data. However, this is a highly undesirable architecture, since now you have two systems to maintain, two systems where data may be stored, and if you want to do analysis involving data in both systems, you end up having to send data over the network which can be a major bottleneck. What is special about the Hadapt architecture is that we are bringing database technology to Hadoop, so that Hadapt customers only need to deploy a single cluster — a normal Hadoop cluster — that is optimized for both structured and unstructured data, and is capable of pushing the envelope on the type of analytics that can be run over Big Data.
Q5. You claim that Hadapt SQL queries are an order of magnitude faster than Hadoop+Hive? Do you have any evidence for that? How significant is this comparison for an enterprise?
Daniel Abadi: We ran some experiments in my lab at Yale using the queries from the TPC-H benchmark and compared the technology behind Hadapt to both Hive and also traditional database systems.
These experiments were peer-reviewed and published in SIGMOD 2011. You can see the full 12 page paper here. From my point of view, query performance is certainly important for the enterprise, but not as important as the new type of analytics that our platform opens up, and also the enterprise features around availability, security, and SQL compliance that distinguishes our platform from Hive.
Q6. You say that “Hadoop-connected SQL databases” do not eliminate “silos”. What do you mean with silos? And what does it mean in practice to have silos for an enterprise?
Daniel Abadi: A lot of people are using Hadoop as a sort of data refinery. Data starts off unstructured, and Hadoop jobs are run to clean, transform, and structure the data. Once the data is structured, it is shipped to SQL databases where it can be subsequently analyzed. This leads to the raw data being left in Hadoop and the refined data in the SQL databases. But it’s basically the same data — one is just a cleaned (and potentially aggregated) version of the other. Having multiple copies of the data can lead to all kinds of problems. For example, let’s say you want to update the data in one of the two locations — it does not get automatically propagated to the copy in the other silo. Furthermore, let’s say you are doing some analysis in the SQL database and you see something interesting and want to drill down to the raw data — if the raw data is located on a different system, such a drill down becomes highly nontrivial. Furthermore, data provenance is a total nightmare. It’s just a really ugly architecture to have these two systems with a connector between them.
Q7. What is “multi-structured” data analytics?
Daniel Abadi: I would define it as analytics than include both structured data and unstructured data in the same “query” or “job”. For example, Hadapt enables keyword search, advanced analytic functions, and common machine learning algorithms over unstructured data, all of which can be invoked from a SQL query that includes data from relational tables stored inside the same Hadoop cluster. A great example of how this can be used in real life can be seen in the demo given by Data Scientist Mingsheng Hong.
Q8. On the subject of Big Data Analytics, Werner Vogels of Amazon.com said in an interview  that “one of the core concepts of Big Data is being able to evolve analytics over time. In the new world of data analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without being constrained by resources.”
What is your opinion on this? How Hadapt can help on this?
Daniel Abadi: Yes, I agree. This is precisely why you need a system like Hadapt that has extreme flexibility in terms of the types of data that it can analyze (see above) and types of analysis that can be performed.
Q9. With your research group at Yale you are developing a new data management prototype system called Calvin. What is it? What is special about it?
Daniel Abadi: Calvin is a system designed and built by a large team of researchers in my lab at Yale (Alex Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Anton Petrov, Michael Giuffrida, and Aaron Segal) that scales transactional processing in database systems to millions of transactions per second while still maintaining ACID guarantees.
If you look at what is available in the industry today if you want to scale your transactional workload, you have all these NoSQL systems like HBase or Cassandra or Riak — all of which have to eliminate ACID guarantees in order to achieve scalability. On the other hand, you have NewSQL systems like VoltDB or various sharded MySQL implementations that maintain ACID but can only scale when transactions rarely cross physical server boundaries. Calvin is unique in the way that it uses a deterministic execution model to eliminate the need for commit protocols (such as two phase commit) and allow arbitrary transactions (that can include data on multiple different physical servers) to scale while still maintaining the full set of ACID guarantees. It’s a cool project. I recommend your readers check out the most recent research paper describing the system.
Q10. Looking at three elements: Data, Platform, Analysis, what are the main research challenges ahead?
Daniel Abadi: Here are a few that I think are interesting:
(1) Scalability of non-SQL analytics. How do you parallelize clustering, classification, statistical, and algebraic functions that are not “embarrassingly parallel” (that have traditionally been performed on a single server in main memory) over a large cluster of shared-nothing servers.
(2) Reducing the cognitive complexity of “Big Data” so that it can fit in the working set of the brain of a single analyst who is wrangling with the data.
(3) Incorporating graph data sets and graph algorithms into database management systems.
(4) Enabling platform support for probabilistic data and probabilistic query processing.
Daniel Abadi is an associate professor of computer science at Yale University where his research focuses on database system architecture and implementation, and cloud computing. He received his PhD from MIT, where his dissertation on column-store database systems led to the founding of Vertica (which was eventually acquired by Hewlett Packard).
His research lab at Yale has produced several scalable database systems, the best known of which is HadoopDB, which was subsequently commercialized by Hadapt, where Professor Abadi serves as Chief Scientist. He is a recipient of a Churchill Scholarship, an NSF CAREER Award, a Sloan Research Fellowship, the 2008 SIGMOD Jim Gray Doctoral Dissertation Award, and the 2007 VLDB best paper award.
– MapReduce vs RDBMS [PPT]
– Big Data and Analytical Data Platforms:
Blog Posts | Free Software | Articles | PhD and Master Thesis|