Interview with Mike Stonebraker.
“I believe that “one size does not fit all”. I.e. in every vertical market I can think of, there is a way to beat legacy relational DBMSs by 1-2 orders of magnitude.” — Mike Stonebraker.
I have interviewed Mike Stonebraker, serial entrepreneur and professor at MIT. In particular, I wanted to know more about his last endeavor, VoltDB.
Q1. In your career you developed several data management systems, namely: the Ingres relational DBMS, the object-relational DBMS PostgreSQL, the Aurora Borealis stream processing engine(commercialized as StreamBase), the C-Store column-oriented DBMS (commercialized as Vertica), and the H-Store transaction processing engine (commercialized as VoltDB). In retrospective, what are, in a nutshell, the main differences and similarities between all these systems? What are they respective strengths and weaknesses?
Stonebraker: In addition, I am building SciDB, a DBMS oriented toward complex analytics.
I believe that “one size does not fit all”. I.e. in every vertical market I can think of, there is a way to beat legacy relational DBMSs by 1-2 orders of magnitude.
The techniques used vary from market to market. Hence, StreamBase, Vertica, VoltDB and SciDB are all specialized to different markets. At this point Postgres and Ingres are legacy code bases.
Q2. In 2009 you co-founded VoltDB, a commercial start up based on ideas from the H-Store project. H-Store is a distributed In Memory OLTP system. What is special of VoltDB? How does it compare with other In-memory databases, for example SAP HANA, or Oracle TimesTen?
Stonebraker: A bunch of us wrote a paper “Through the OLTP Looking Glass and What We Found There” (SIGMOD 2008). In it, we identified 4 sources of significant OLTP overhead (concurrency control, write-ahead logging, latching and buffer pool management).
Unless you make a big dent in ALL FOUR of these sources, you will not run dramatically faster than current disk-based RDBMSs. To the best of my knowledge, VoltDB is the only system that eliminates or drastically reduces all four of these overhead components. For example, TimesTen uses conventional record level locking, an Aries-style write ahead log and conventional multi-threading, leading to substantial need for latching. Hence, they eliminate only one of the four sources.
Q3. VoltDB is designed for what you call “high velocity” applications. What do you mean with that? What are the main technical challenges for such systems?
Stonebraker: Consider an application that maintains the “state” of a multi-player internet game. This state is subject to a collection of perhaps thousands of
streams of player actions. Hence, there is a collective “firehose” that the DBMS must keep up with.
In a variety of OLTP applications, the input is a high velocity stream of some sort. These include electronic trading, wireless telephony, digital advertising, and network monitoring.
In addition to drinking from the firehose, such applications require ACID transactions and light real-time analytics, exactly the requirements of traditional OLTP.
In effect, the definition of transaction processing has been expanded to include non-traditional applications.
Q4. Goetz Grafe (HP fellow) said in an interview that “disk-less databases are appropriate where the database contains only application state, e.g., current account balances, currently active logins, current shopping carts, etc. Disks will continue to have a role and economic value where the database also contains history (e.g. cold history such as transactions that affected the account balances, login & logout events, click streams eventually leading to shopping carts, etc.)” What is your take on this?
Stonebraker: In my opinion the best way to organize data management is to run a specialized OLTP engine on current data. Then, send transaction history data,
perhaps including an ETL component, to a companion data warehouse. VoltDB is a factor of 50 or so faster than legacy RDBMSs on the transaction piece, while column stores, such as Vertica, are a similar amount faster on historical analytics. In other words, specialization allows each component to run dramatically faster than a “one size fits all” solution.
A “two system” solution also avoids resource management issues and lock contention, and is very widely used as a DBMS architecture.
Q5. Where will the (historical) data go if we have no disks? In the Cloud?
Stonebraker: Into a companion data warehouse. The major DW players are all disk-based.
Q6. How VoltDB ensures durability?
Stonebraker: VoltDB automatically replicates all tables. On a failure, it performs “Tandem-style” failover and eventual failback. Hence, it totally masks most errors. To protect against cluster-wide failures (such as power issues), it supports snapshotting of data and an innovative “command logging” capability. Command logging
has been shown to be wildly faster than data logging, and supports the same durability as data logging.
Q7. How does VoltDB support atomicity, consistency and isolation?
Stonebraker: All transaction are executed (logically) in timestamp order. Hence, the net outcome of a stream of transactions on a VoltDB data base is equivalent
to their serial execution in timestamp order.
Q8. Would you call VoltDB a relational database system? Does it supports standard SQL? How do you handle scalability problems for complex joins of large amount of data?
Stonebraker: VoltDB supports standard SQL.
Complex joins should be run on a companion data warehouse. After all, the only way to interleave “big reads” with “small writes” in a legacy RDBMS is to use snapshot isolation or run with a reduced level of consistency.
You either get an out-of-date, but consistent answer or an up-to-date, but inconsistent answer. Directing big reads to a companion DW, gives you the same result as snapshot isolation. Hence, I don’t see any disadvantage to doing big reads on a companion system.
Concerning larger amounts of data, our experience is that OLTP problems with more than a few Tbyte of data are quite rare. Hence, these can easily fit in main memory, using a VoltDB architecture.
In addition, we are planning extensions of the VoltDB architecture to handle larger-than-main-memory data sets. Watch for product announcements in this area.
Q9. Does VoltDB handle disaster recovery? If yes, how?
Stonebraker: VoltDB just announced support for replication over a wide area network. This capability support failover to a remote site if a disaster occurs. Check
out voltdb web site for details.
Q10. VoltDB`s mission statement is “to deliver the fastest, most scalable in-memory database products on the planet”. What performance measurements do you have until now to sustain this claim?
Stonebraker: We have run TPC-C at about 50 X the performance of a popular legacy RDBMS. In addition, we have shown linear TPC-C scalability to 384 cores
(more than 3 million transactions per second). That was the biggest cluster we could get access to; there is no reason why VoltDB would not continue to scale.
Q11. Can In-Memory Data Management play a significant role also for Big Data Analytics (up to several PB of data)? If yes, how? What are the largest data sets that VoltDB can handle?
Stonebraker: VoltDB is not focused on analytics. We believe they should be run on a companion data warehouse.
Most of the warehouse customers I talk to want to keep increasing large amounts of increasingly diverse history to run their analytics over. The major data warehouse players are routinely being asked to manage petabyte-sized data warehouses. It is not clear how important main memory will be in this vertical market.
Q12. You were very critical about Apache Hadoop, but VoltDB offers an integration with Hadoop. Why? How does it work technically?
What are the main business benefits from such an integration?
Stonebraker: Consider the “two system” solution mentioned above. VoltDB is intended for the OLTP portion, and some customers wish to run Hadoop as a data
warehouse platform. To facilitate this architecture, VoltDB offers a Hadoop connector.
Q13. How “green” is VoltDB? What are the tradeoff between total power consumption and performance: Do you have any benchmarking results for that?
Stonebraker: We have no official benchmarking numbers. However, on a large variety of applications VoltDB is a factor of 50 or more faster than traditional RDBMSs. Put differently, if legacy folks need 100 nodes, then we need 2!
In effect, if you can offer vastly superior performance (say times 50) on the same hardware, compared to another system, then you can offer the same performance on 1/50th of the hardware. By definition, you are 50 times “greener” than they are.
Q14. You are currently working on science-oriented DBMSs and search engines for accessing the deep web. Could you please give us some details. What kind of results did you obtain so far?
Stonebraker: We are building SciDB, which is oriented toward complex analytics (regression, clustering, machine learning, …). It is my belief that such analytics
will become much more important off into the future. Such analytics are invariably defined on arrays, not tables. Hence, SciDB is an array DBMS, supporting a dialect of SQL for array data. We expect it to be wildly faster than legacy RDBMSs on this kind of application. See SciDB.org for more information.
Q15. You are a co-founder of several venture capital backed start-ups. In which area?
Check the company web sites for more details.
Dr. Stonebraker has been a pioneer of data base research and technology for more than a quarter of a century. He was the main architect of the INGRES relational DBMS, and the object-relational DBMS, POSTGRES. These prototypes were developed at the University of California at Berkeley where Stonebraker was a Professor of Computer Science for twenty five years. More recently at M.I.T. he was a co-architect of the Aurora/Borealis stream processing engine, the C-Store column-oriented DBMS, and the H-Store transaction processing engine. Currently, he is working on science-oriented DBMSs, OLTP DBMSs, and search engines for accessing the deep web. He is the founder of five venture-capital backed startups, which commercialized his prototypes. Presently he serves as Chief Technology Officer of VoltDB, Paradigm4, Inc. and Goby.com.
Professor Stonebraker is the author of scores of research papers on data base technology, operating systems and the architecture of system software services. He was awarded the ACM System Software Award in 1992, for his work on INGRES. Additionally, he was awarded the first annual Innovation award by the ACM SIGMOD special interest group in 1994, and was elected to the National Academy of Engineering in 1997. He was awarded the IEEE John Von Neumann award in 2005, and is presently an Adjunct Professor of Computer Science at M.I.T.