Interview with Mike Stonebraker.

by Roberto V. Zicari on May 2, 2012

“I believe that “one size does not fit all”. I.e. in every vertical market I can think of, there is a way to beat legacy relational DBMSs by 1-2 orders of magnitude.” — Mike Stonebraker.

I have interviewed Mike Stonebraker, serial entrepreneur and professor at MIT. In particular, I wanted to know more about his last endeavor, VoltDB.

RVZ

Q1. In your career you developed several data management systems, namely: the Ingres relational DBMS, the object-relational DBMS PostgreSQL, the Aurora Borealis stream processing engine(commercialized as StreamBase), the C-Store column-oriented DBMS (commercialized as Vertica), and the H-Store transaction processing engine (commercialized as VoltDB). In retrospective, what are, in a nutshell, the main differences and similarities between all these systems? What are they respective strengths and weaknesses?

Stonebraker: In addition, I am building SciDB, a DBMS oriented toward complex analytics.
I believe that “one size does not fit all”. I.e. in every vertical market I can think of, there is a way to beat legacy relational DBMSs by 1-2 orders of magnitude.
The techniques used vary from market to market. Hence, StreamBase, Vertica, VoltDB and SciDB are all specialized to different markets. At this point Postgres and Ingres are legacy code bases.

Q2. In 2009 you co-founded VoltDB, a commercial start up based on ideas from the H-Store project. H-Store is a distributed In Memory OLTP system. What is special of VoltDB? How does it compare with other In-memory databases, for example SAP HANA, or Oracle TimesTen?

Stonebraker: A bunch of us wrote a paper “Through the OLTP Looking Glass and What We Found There” (SIGMOD 2008). In it, we identified 4 sources of significant OLTP overhead (concurrency control, write-ahead logging, latching and buffer pool management).
Unless you make a big dent in ALL FOUR of these sources, you will not run dramatically faster than current disk-based RDBMSs. To the best of my knowledge, VoltDB is the only system that eliminates or drastically reduces all four of these overhead components. For example, TimesTen uses conventional record level locking, an Aries-style write ahead log and conventional multi-threading, leading to substantial need for latching. Hence, they eliminate only one of the four sources.

Q3. VoltDB is designed for what you call “high velocity” applications. What do you mean with that? What are the main technical challenges for such systems?

Stonebraker: Consider an application that maintains the “state” of a multi-player internet game. This state is subject to a collection of perhaps thousands of
streams of player actions. Hence, there is a collective “firehose” that the DBMS must keep up with.

In a variety of OLTP applications, the input is a high velocity stream of some sort. These include electronic trading, wireless telephony, digital advertising, and network monitoring.
In addition to drinking from the firehose, such applications require ACID transactions and light real-time analytics, exactly the requirements of traditional OLTP.

In effect, the definition of transaction processing has been expanded to include non-traditional applications.

Q4. Goetz Grafe (HP fellow) said in an interview that “disk-less databases are appropriate where the database contains only application state, e.g., current account balances, currently active logins, current shopping carts, etc. Disks will continue to have a role and economic value where the database also contains history (e.g. cold history such as transactions that affected the account balances, login & logout events, click streams eventually leading to shopping carts, etc.)” What is your take on this?

Stonebraker: In my opinion the best way to organize data management is to run a specialized OLTP engine on current data. Then, send transaction history data,
perhaps including an ETL component, to a companion data warehouse. VoltDB is a factor of 50 or so faster than legacy RDBMSs on the transaction piece, while column stores, such as Vertica, are a similar amount faster on historical analytics. In other words, specialization allows each component to run dramatically faster than a “one size fits all” solution.

A “two system” solution also avoids resource management issues and lock contention, and is very widely used as a DBMS architecture.

Q5. Where will the (historical) data go if we have no disks? In the Cloud?

Stonebraker: Into a companion data warehouse. The major DW players are all disk-based.

Q6. How VoltDB ensures durability?

Stonebraker: VoltDB automatically replicates all tables. On a failure, it performs “Tandem-style” failover and eventual failback. Hence, it totally masks most errors. To protect against cluster-wide failures (such as power issues), it supports snapshotting of data and an innovative “command logging” capability. Command logging
has been shown to be wildly faster than data logging, and supports the same durability as data logging.

Q7. How does VoltDB support atomicity, consistency and isolation?

Stonebraker: All transaction are executed (logically) in timestamp order. Hence, the net outcome of a stream of transactions on a VoltDB data base is equivalent
to their serial execution in timestamp order.

Q8. Would you call VoltDB a relational database system? Does it supports standard SQL? How do you handle scalability problems for complex joins of large amount of data?

Stonebraker: VoltDB supports standard SQL.
Complex joins should be run on a companion data warehouse. After all, the only way to interleave “big reads” with “small writes” in a legacy RDBMS is to use snapshot isolation or run with a reduced level of consistency.
You either get an out-of-date, but consistent answer or an up-to-date, but inconsistent answer. Directing big reads to a companion DW, gives you the same result as snapshot isolation. Hence, I don’t see any disadvantage to doing big reads on a companion system.

Concerning larger amounts of data, our experience is that OLTP problems with more than a few Tbyte of data are quite rare. Hence, these can easily fit in main memory, using a VoltDB architecture.

In addition, we are planning extensions of the VoltDB architecture to handle larger-than-main-memory data sets. Watch for product announcements in this area.

Q9. Does VoltDB handle disaster recovery? If yes, how?

Stonebraker: VoltDB just announced support for replication over a wide area network. This capability support failover to a remote site if a disaster occurs. Check
out voltdb web site for details.

Q10. VoltDB`s mission statement is “to deliver the fastest, most scalable in-memory database products on the planet”. What performance measurements do you have until now to sustain this claim?

Stonebraker: We have run TPC-C at about 50 X the performance of a popular legacy RDBMS. In addition, we have shown linear TPC-C scalability to 384 cores
(more than 3 million transactions per second). That was the biggest cluster we could get access to; there is no reason why VoltDB would not continue to scale.

Q11. Can In-Memory Data Management play a significant role also for Big Data Analytics (up to several PB of data)? If yes, how? What are the largest data sets that VoltDB can handle?

Stonebraker: VoltDB is not focused on analytics. We believe they should be run on a companion data warehouse.

Most of the warehouse customers I talk to want to keep increasing large amounts of increasingly diverse history to run their analytics over. The major data warehouse players are routinely being asked to manage petabyte-sized data warehouses. It is not clear how important main memory will be in this vertical market.

Q12. You were very critical about Apache Hadoop, but VoltDB offers an integration with Hadoop. Why? How does it work technically?
What are the main business benefits from such an integration?

Stonebraker: Consider the “two system” solution mentioned above. VoltDB is intended for the OLTP portion, and some customers wish to run Hadoop as a data
warehouse platform. To facilitate this architecture, VoltDB offers a Hadoop connector.

Q13. How “green” is VoltDB? What are the tradeoff between total power consumption and performance: Do you have any benchmarking results for that?

Stonebraker: We have no official benchmarking numbers. However, on a large variety of applications VoltDB is a factor of 50 or more faster than traditional RDBMSs. Put differently, if legacy folks need 100 nodes, then we need 2!

In effect, if you can offer vastly superior performance (say times 50) on the same hardware, compared to another system, then you can offer the same performance on 1/50th of the hardware. By definition, you are 50 times “greener” than they are.

Q14. You are currently working on science-oriented DBMSs and search engines for accessing the deep web. Could you please give us some details. What kind of results did you obtain so far?

Stonebraker: We are building SciDB, which is oriented toward complex analytics (regression, clustering, machine learning, …). It is my belief that such analytics
will become much more important off into the future. Such analytics are invariably defined on arrays, not tables. Hence, SciDB is an array DBMS, supporting a dialect of SQL for array data. We expect it to be wildly faster than legacy RDBMSs on this kind of application. See SciDB.org for more information.

Q15. You are a co-founder of several venture capital backed start-ups. In which area?

Stonebraker: The recent ones are: StreamBase (stream procession), Vertica (data warehouse market), VoltDB (OLTP), Goby.com (data aggregation of web sources), Paradigm4 (SciDB and complex analytics)

Check the company web sites for more details.

——————————–
Mike Stonebraker
Dr. Stonebraker has been a pioneer of data base research and technology for more than a quarter of a century. He was the main architect of the INGRES relational DBMS, and the object-relational DBMS, POSTGRES. These prototypes were developed at the University of California at Berkeley where Stonebraker was a Professor of Computer Science for twenty five years. More recently at M.I.T. he was a co-architect of the Aurora/Borealis stream processing engine, the C-Store column-oriented DBMS, and the H-Store transaction processing engine. Currently, he is working on science-oriented DBMSs, OLTP DBMSs, and search engines for accessing the deep web. He is the founder of five venture-capital backed startups, which commercialized his prototypes. Presently he serves as Chief Technology Officer of VoltDB, Paradigm4, Inc. and Goby.com.

Professor Stonebraker is the author of scores of research papers on data base technology, operating systems and the architecture of system software services. He was awarded the ACM System Software Award in 1992, for his work on INGRES. Additionally, he was awarded the first annual Innovation award by the ACM SIGMOD special interest group in 1994, and was elected to the National Academy of Engineering in 1997. He was awarded the IEEE John Von Neumann award in 2005, and is presently an Adjunct Professor of Computer Science at M.I.T.

– On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum. (February 1, 2012)

– A super-set of MySQL for Big Data. Interview with John Busch, Schooner. (February 20, 2012)

– Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB. (December 14, 2011)

– On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica. (November 16, 2011)

– vFabric SQLFire: Better then RDBMS and NoSQL? (October 24, 2011)

– The future of data management: “Disk-less” databases? Interview with Goetz Graefe. (August 29, 2011).

From → Uncategorized

3 Comments Leave one →

Stefan Edlich permalink

Great interview!
And its really nice to see that also Stonebraker is arguing for a polyglot persistence world.

Please continue with this series. (Perhaps SAP Hana next? 😉
Regards
Stefan Edlich

Reply
Evgeniy Grigoriev permalink

My question is not about NewSQL directly and not about details of thesis “one size does not fit all”. I understand that special tasks need special decisions but I would ask about “universal” DBMSs which are still used by huge number of developers (I think that these DBMSs cab be described better with word “usual” but not “universal”). Mike Stonebraker is known for me as author of “Third-Generation Database System Manifesto” where a list of requirements is given which DBMSs must satisfy to and I hope Mike could answer my question (if he saw it)

Was a spirit of the “Manifesto” implemented in the “usual” DBMSs fully and properly? This question is not about language extensions and new technologies. They exist but they look like a pieces of a puzzle which is not piled yet. As a result many of new possibilities are not popular (e.g. “object” extensions of SQL3) and developers prefer to use all the new versions of “usual” DBMSs in old way. So, from point of view of these developers current “usual” DBMS can be hardly named as third-generation ones. Is possible to say that the “Manifesto” is still actual?

Reply
Marten R. permalink

Regarding question 13 (How “green” is VoltDB?) it would be interesting to have benchmarking results, including performance and power consumption, of VoltDB and traditional RDBMSs on the same hardware. The ratio of power consumption and performance, resulting in a value like “watt per performance-point”, could answer the question of how “green” VoltDB is more accurately. The academic question is: If VoltDB is 50 times faster then a traditional RDBMS, is it also 50 times more efficient when comparing the “Watt per performance-point”-values?

Great interview with a lot of interesting insights!

Reply

Interview with Mike Stonebraker.

Leave a Reply Cancel reply

About the author

Archives

Meta

About

Flickr

Search

Interview with Mike Stonebraker.

Leave a Reply Cancel reply

About the author

Tags

Archives

Meta

About

Flickr

Search