On Hadoop RDBMS. Interview with Monte Zweben.
“HBase and Hadoop are the only technologies proven to scale to dozens of petabytes on commodity servers, currently being used by companies such as Facebook, Twitter, Adobe and Salesforce.com.”–Monte Zweben.
Is it possible to turn Hadoop into a RDBMS? On this topic, I have interviewed Monte Zweben, Co-Founder and Chief Executive Officer of Splice Machine.
Q1. What are the main challenges of applications and operational analytics that support real-time, interactive queries on data updated in real-time for Big Data?
Monte Zweben: Let’s break down “real-time, interactive queries on data updated in real-time for Big Data”. “Real-time, interactive queries” means that results need to be returned in milliseconds to a few seconds.
For “Data updated in real-time” to happen, changes in data should be reflected in milliseconds. “Big Data” is often defined as dramatically increased volume, velocity, and variety of data. Of these three attributes, data volume typically dominates, because unlike the other attributes, its growth is virtually unbounded.
Traditional RDBMSs like MySQL or Oracle can support real-time, interactive queries on data updated in real-time, but they struggle on handling Big Data. They can only scale up on larger servers that can cost hundreds of thousands, if not millions of dollars per server.
Big Data technologies such as Hadoop can easily handle Big Data data volumes with their ability to scale-out on commodity hardware. However, with their batch analytics heritage, they often struggle to provide real-time, interactive queries. They also lack ACID transactions to support data updated in real time.
So, real-time applications and operational analytics had to choose between real-time interactive queries on data updated in real-time, or Big Data volumes. With Splice Machine, these applications can have the best of both worlds: real-time interactive queries, the reliability of real-time updates on ACID transactions, and the ability to handle Big Data volumes with a 10x price/performance improvement over traditional RDBMSs.
Q2. You suggested that companies should replace their traditional RDBMS systems. Why and when? Do you really think this is always possible? What about legacy systems?
Monte Zweben: Companies should consider replacing their traditional RDBMSs when they experience significant cost or scaling issues. Our informal surveys of customers indicate that up to half of traditional RDBMSs experience cost or scaling issues. The biggest barrier to migrating from a traditional RDBMS to a new database like Splice Machine is converting custom stored procedure (e.g., PL/SQL code). Operational analytics often have limited custom stored procedure code, so the migration process is generally straightforward.
Operational applications typically have thousands of lines of custom stored procedure code, but in extreme cases it can run into hundreds of thousands to millions of lines of code. There are actually commercially-supported tools that will convert from PL/SQL to the Java needed for Splice Machine. We have typically seen them convert from 70-95% accurately, but it will obviously depend on the complexity of the original code. Financially, migration makes sense for many companies to get an ongoing 10x price/performance, but there are cases when it does not make sense because converting custom code is too expensive.
Q3. Is scale-out the solution to Big Data at scale? Why?
Monte Zweben: Scale-out is definitely the critical technology to making Big Data work at scale. Scale-out leverages inexpensive, commodity hardware to parallelize queries to easily achieve a 10x price/performance improvement over existing database technologies.
Q4. You have announced your real-time relational database management system. What is special about Splice Machine`s Hadoop RDBMS?
Monte Zweben: We are the only Hadoop RDBMS. There are obviously many RDBMSs, but we are the only one with scale-out technology from Hadoop. Hadoop is the only scale-out technology proven to scales into dozens of petabytes on commodity hardware at companies like Facebook. There are other SQL-on-Hadoop technologies, but none of them can support real-time ACID transactions.
Q5 Hadoop-connected SQL databases do not eliminate “silos”. How do you handle this?
Q6. How did you manage to move Hadoop beyond its batch analytics heritage to power operational applications and real-time analytics?
Monte Zweben: At its core, Hadoop is a distributed file system (HDFS) where data cannot be updated or deleted. If you want to update or delete anything, you have to reload all the data (i.e., batch load). As a file system, it has very limited ability to seek specific data; instead, you use Java MapReduce programs to scan all of the data to find the data you need. It can easily take hours or even days for queries to return data (i.e., batch analytics). There is no way you could support a real-time application on top of HDFS and MapReduce.
By using HBase (a real-time key value store on top of HDFS), Splice Machine provides a full RDBMS on top of Hadoop.
You can now get real-time, interactive queries on real-time updated data on Hadoop, necessary to support operational applications and analytics.
Q7. How do you use Apache Derby™ and Apache HBase™/Hadoop?
Monte Zweben: Splice Machine marries two proven technology stacks: Apache Derby for ANSI SQL and HBase/Hadoop for proven scale out technology. With over 15 years of development, Apache Derby is a Java-based SQL database. Splice Machine chose Derby because it is a full-featured ANSI SQL database, lightweight (<3 MB), and easy to embed into the HBase/Hadoop stack.
HBase and Hadoop are the only technologies proven to scale to dozens of petabytes on commodity servers, currently being used by companies such as Facebook, Twitter, Adobe and Salesforce.com. Splice Machine chose HBase and Hadoop because of their proven auto-sharding, replication, and failover technology.
Q8. Why did you replace the storage engine in Apache Derby with HBase?
Q9. Why did you redesign the planner, optimizer, and executor of Apache Derby?
Monte Zweben: We redesigned the planner, optimizer, and executor of Derby because Splice Machine has a distributed computing infrastructure instead of its old shared-disk storage. Distributed computing requires a functional re-architecting because computation must be distributed to where the data is, instead of moving the data to the computation.
Q10. What are the main benefits for developers and database architects who build applications?
Monte Zweben: There are two main benefits to Splice Machine for developers and database architects. First, no longer is data scaling a barrier to using massive amounts of data in an application; you no longer need to prune data or rewrite applications to do unnatural acts like manual sharding. Second, you can enjoy the scaling with all the critical features of an RDBMS – strong consistency, joins, secondary indexes for fast lookups, and reliable updates with transactions. Without those features, developers have to implement those functions for each application, a costly, time-consuming, and error-prone process.
Monte Zweben, Co-Founder and Chief Executive Officer, Splice Machine
A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Branch Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit.
In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of Red Prairie. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company, and is the chairman of Clio Music, an advanced music research and development company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings.
Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science. Monte’s involvement with CMU, which has been a long-time leader in distributed computing and Big Data research, helped inspire the original concept behind Splice Machine.
ODBMS.org: Several Free Resources on Hadoop.
–> FOLLOW ODBMS.ORG ON TWITTER: @odbmsorg