Interview with Rick Cattell: There is no “one size fits all” solution.
I start in 2011 with an interview with Dr. Rick Cattell.
Rick is best known for his contributions to database systems and middleware — He was a founder of SQL Access (a predecessor to ODBC), the founder and chair of the Object Data Management Group (ODMG), and the co-creator of JDBC.
Rick has worked for over twenty years at Sun in management and senior technical roles, and for ten years in research at Xerox PARC and at Carnegie-Mellon University.
You can download the article by Rick Cattell: “Relational Databases, Object Databases, Key-Value Stores, Document Stores, and Extensible Record Stores: A Comparison.”
Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?
Rick Cattell: Basically, things changed with “Web 2.0” and with other applications where there were thousands or millions of users writing as well as reading a database. RDBMSs could not scale to this number of writers. Amazon (with Dynamo) and Google (with BigTable) were forced to develop their own scalable datastores. A host of others followed suit.
Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them?
Rick Cattell: That’s a good question. The proliferation and differences are confusing, and I have no one-paragraph answer to this question. The systems differ in data model, consistency models, and many other dimensions. I wrote a couple papers and provide some references on my website, these may be helpful for more background. There I categorize several kinds of “NoSQL” data stores, according to data model: key-value stores, document stores, and extensible record stores. I also discuss scalable SQL stores.
Q3. How new data stores compare with respect to relational databases?
Rick Cattell: In a nutshell, NoSQL datastores give up SQL and they give up ACID transactions, in exchange for scalability. Scalability is achieved by partitioning and/or replicating the data over many servers. There are some other advantages, as well: for example, the new data stores generally do not demand a fixed data schema, and provide a simpler programming interface, e.g. a RESTful interface.
Q4. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed “object” cache over multiple machines. How do these new data stores compare with object-oriented databases?
Rick Cattell: It is true, OODBs provide features that NoSQL systems do not, like integration with OOPLs, and ACID transactions. On the other hand, OODBs do not provide the horizontal scalability. There is no “one size fits all” solution, just as OODBs and RDBMSs are good for different applications.
Q5. With the emergence of cloud computing, new data management systems have surfaced. What is in your opinion of the direction in which cloud computing data management is evolving? What are the main challenges of cloud computing data management?
Rick Cattell: There are a number of data management issues with cloud computing, in addition to the scaling issue I already discussed. For example, if you don’t know which servers your software is going to run on, you cannot tune your hardware (RAM, flash, disk, CPU) to your software, or vice versa.
Q6 What are cloud stores omitting that enable them to scale so well?
Rick Cattell: You haven’t defined “cloud stores”. I’m going to assume that you mean something similar to what we discussed earlier: new data stores that provide horizontal scaling. In which case, I answered that question earlier: they give up SQL and ACID.
Q7. Will cloud store projects end up with support for declarative queries and declarative secondary keys?
Rick Cattell: As I interpret this question, systems such as MongoDB already have this. Also, a SQL interpreter has been ported to BigTable, but the lower-level interface has proven to be more popular. The main scalability problem with declarative queries is when queries require operations like joins or transactions that span many servers: then you get killed by the node coordination and data movement.
Q8. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management.
To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?
Rick Cattell: I agree with Stonebraker. There are actually two points here: one about performance (of each server) and one about scalability (of all the servers together). We already discussed the latter.
Stonebraker makes an important point about the former: with databases that fit mostly in RAM (on distributed servers), the DBMS architecture needs to change dramatically, otherwise 90% of your overhead goes into transaction coordination, locking, multi-threading latching, buffer management, and other operations that are “acceptable” in traditional DBMSs, where you spend your time waiting for disk. Stonebraker and I had an argument a year ago, and reached agreement on this as well as other issues on scalable DBMSs. We wrote a paper about our agreement, which will appear in CACM. It can be found on my website in the meantime.
Q9. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes. Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?
Rick Cattell: Yes, I believe so. MySQL Cluster is already close to doing so, and I believe that VoltDB and Clustrix will do so. The key to scalability with RDBMSs is to avoid SQL and transactions that span nodes, as you say. VoltDB demands that transactions be encapsulated as stored procedures, and allows some control over how tables are sharded over nodes. This allows transactions to be pre-compiled and pre-analyzed to execute on a single node, in general.
Q10. There is also xml DBs, which go beyond relational. Hybridization with relational turned out to be very useful. For example, DB2 has a huge investment in XML, and it is extensively published, and it has also succeeded commercially. Monet DB did substantial work in that area early on as well. How do they relate with “new data stores”?
Rick Cattell: With XML, we have yet another data model, like relational and object-oriented. XML data can be stored in a separate DBMS like Monet DB, or could be transformed to store in another DBMS, as with DB2. The focus of the new NoSQL data stores is generally not a new data model, but new scalability. In fact, they generally have quite simple data models. The “document stores” like MongoDB and CouchDB do allow nested objects, which might make them more amenable to storing XML. But in my experience, the new data stores are being used to store simpler data, like key-value pairs required for user information on web site.
Q11. Choosing a solution: Given this confusing array of alternatives, which data storage system do you choose for your application?
Rick Cattell: This is an even harder question to answer than the ones contrasting the DBMSs themselves, because each application has characteristics that might make you lean one way or another. I made an attempt at answering this in the paper I mentioned, but I only scratched the surface… I concluded that there is no “cookbook” answer to tell you which way to go.