“Distributed joins are hard to scale”: Interview with Dwight Merriman.
On the topic of NoSQL databases, I asked a few questions to Dwight Merriman, CEO of 10gen.
You can read the interview below.
Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?
The catalyst for acceptance of new non-relational tools has been the desire to scale –particularly the desire to scale operational databases horizontally on commodity hardware. Relational is a powerful tool that every developer already knows; using something else has to clear a pretty big hurdle. The push for scale and speed is helping people clear that and add other tools to the mix. However, once the new tools are part of the established set of things developers use, they often find there are other benefits – such as agility of development of data backed applications. So we see two big benefits : speed and agility.
Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them? How do you position 10gen?
The products vary on a few dimensions:
– what’s the data model?
– what’s the scale-out / data distribution model?
– what’s the consistent model (strong consistency or eventual or both?)
– does the product make development more agile, or just focus on scaling only
MongoDB uses a document-oriented data model, auto-sharding (range based partitioning) for data distribution, and is more towards the strong consistency side of the spectrum (for example atomic operations on single documents are possible). Agility is a big priority for the mongodb project.
Q3. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed “object” cache over multiple machines. Do you agree with this? How do these new data stores compare with object-oriented databases?
Some products have a simple philosophy, and others in the space do not. However a little bit must be left out if one wants to scale well horizontally. The philosophy with MongoDB project is to try to make the functionality rich, but not to add any features which make scaling hard or impossible. Thus there is quite a bit of functionality : ad hoc queries, secondary indexes, sorting, atomic operations, etc.
Q4. Recently you annouced the completion of a round of funding adding Sequoia Capital as New Investment Partner. Why Sequoia Capital is investing in 10gen and how do you see the database market shaping on?
We think the “NoSQL” will be an important new tool for companies of all sizes. We expect large companies in the future to have in house three types of data stores : a traditional RDBMS, a NoSQL database, and a data warehousing technology.
Q5. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?
We agree. It has nothing to do with SQL. It has to do with relational though – distributed joins are hard to scale.
Q6. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes.
Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?
Without loss of generality, we believe no. With loss of generality, perhaps yes. For example if you say, i only want star schemas, i only want to bulk load data at night and query it all day, i only want to run a few really expensive queries not millions of tiny ones — then it works and you are now in the realm of the relational business intelligent / data warehousing products, which do scale out quite well.
If you put some restrictions on – say require the user to do their data models certain ways, or force them to run a proprietary low latency network, or only have 10 nodes instead of 1000, you can do it. But we think there is a need for scalable solutions that work on commodity hardware, particularly because cloud computing is such an important trend in the future. And there are other benefits too such as developer agility.