On NoSQL. Interview with Rick Cattell.
” There aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves. Likewise, companies with closed source will have trouble finding customers willing to risk a closed source (or limited open source) solution. It will be interesting to see what happens. But I don’t see NoSQL going away, there is a well-established following.” –Rick Cattell.
I have asked Rick Cattell, one of the leading independent consultants in database systems, a few questions on NoSQL.
Q1. For years, you have been studying the NoSQL area and writing articles about scalable databases. What is new in the last year, in your view? What is changing?
Rick Cattell: It seems like there’s a new NoSQL player every month or two, now!
It’s hard to keep track of them all. However, a few players have become much more popular than the others.
Q2. Which players are those?
Rick Cattell: Among the open source players, I hear the most about MongoDB, Cassandra, and Riak now, and often HBase and Redis. However, don’t forget that the proprietary players like Amazon, Oracle, and Google have NoSQL systems as well.
Q3. How do you define “NoSQL”?
Rick Cattell: I use the term to mean systems that provide a simple operations like key/value storage or simple records and indexes, and that focus on horizontal scalability for those simple operations. Some people categorize horizontally scaled graph databases and object databases to be “NoSQL” as well. However, those systems have very different characteristics. Graph databases and object databases have to efficiently break connections up over distributed servers, and have to provide operations that somehow span servers as you traverse the graph. Distributed graph/object databases have been around for a while, but efficient distribution is a hard problem. The NoSQL databases simply distribute (or shard) each data type based on a primary key; that’s easier to do efficiently.
Q4. What other categories of systems do you see?
Rick Cattell: Well, there are systems that focus on horizontal scaling for full SQL with joins, which are generally called “NewSQL“, and systems optimized for “Big Data” analytics, typically based on Hadoop map/reduce. And of course, you can also sub-categorize the NoSQL systems based on their data model and distribution model.
Q5. What subcategories would those be?
Rick Cattell: On data model, I separate them into document databases like MongoDB and CouchBase, simple key/value stores like Riak and Redis, and grouped-column stores like HBase and Cassandra. However, a categorization by data model is deceptive, because they also differ quite a bit in their performance and concurrency guarantees.
Q6: Which systems perform best?
Rick Cattell: That’s hard to answer. Performance is not a scale from “good” to “bad”… the different systems have better performance for different kinds of applications. MongoDB performs incredibly well if all your data fits in distributed memory, for example, and Cassandra does a pretty good job of using disk, because of its scheme of writing new data to the end of disk files and consolidating later.
Q7: What about their concurrency guarantees?
Rick Cattell: They are all over the place on concurrency. The simplest provide no guarantees, only “eventual consistency“. You don’t know which version of data you’ll get with Cassandra. MongoDB can keep a “primary” replica consistent if you can live with their rudimentary locking mechanism.
Some of the new systems try to provide full ACID transactions. FoundationDB and Oracle NoSQL claim to do that, but I haven’t yet verified that. I have studied Google’s Spanner paper, and they do provide true ACID consistency in a distributed world, for most practical purposes. Many people think the CAP theorem makes that impossible, but I believe their interpretation of the theorem is too narrow: most real applications can have their cake and eat it too, given the right distribution model. By the way, graph/object databases also provide ACID consistency, as does VoltDB, but as I mentioned I consider them a different category.
Q8: I notice you have an unpublished paper on your website, called 2x2x2 Requirements for Scalability. Can you explain what the 2x2x2 means?
Rick Cattell: Well, the first 2x means that there are two different kinds of scalability: horizontal scaling over multiple servers, and vertical scaling for performance on a single server. The remaining 2×2 means that there are two key features needed to achieve the horizontal and vertical scaling, and for each of those, there are two additional things you have to do to make the features practical.
Q9: What are those key features?
Rick Cattell: For horizontal scaling, you need to partition and replicate your data. But you also need automatic failure recovery and database evolution with no downtime, because when your database runs on 200 nodes, you can’t afford to take the database offline and you can’t afford operator intervention on every failure.
To achieve vertical scaling, you need to take advantage of RAM and you need to avoid random disk I/O. You also need to minimize the overhead for locking and latching, and you need to minimize network calls between servers. There are various ways to do that. The best systems have all eight of these key features. These eight features represent my summary of scalable databases in a nutshell.
Q10: What do you see happening with NoSQL, going forward?
Rick Cattell: Good question. I see a lot of consolidation happening… there are too many players! There aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves. Likewise, companies with closed source will have trouble finding customers willing to risk a closed source (or limited open source) solution.
It will be interesting to see what happens. But I don’t see NoSQL going away, there is a well-established following.
R. G. G. “Rick” Cattell is an independent consultant in database systems.
He previously worked as a Distinguished Engineer at Sun Microsystems, most recently on open source database systems and distributed database scaling. Dr. Cattell served for 20+ years at Sun Microsystems in management and senior technical roles, and for 10 years in research at Xerox PARC and at Carnegie-Mellon University. Dr. Cattell is best known for his contributions in database systems and middleware, including database scalability, enterprise Java, object/relational mapping, object-oriented databases, and database interfaces. He is the author of several dozen papers and five books, and a co-inventor of six U.S. patents.
At Sun he instigated the Enterprise Java, Java DB, and Java Blend projects, and was a contributor to a number of Java APIs and products. He previously developed the Cedar DBMS at Xerox PARC, the Sun Simplify database GUI, and SunSoft’s CORBA-database integration.
He is a co-founder of SQL Access (a predecessor to ODBC), the founder and chair of the Object Data Management Group (ODMG), the co-creator of JDBC, the author of the world’s first monograph on object/relational and object databases, a recipient of the ACM Outstanding PhD Dissertation Award, and an ACM Fellow.
– ODBMS.org Free Downloads and Links
In this section you can download free resources covering the following topics:
Big Data and Analytical Data Platforms
Cloud Data Stores
NoSQL Data Stores
Graphs and Data Stores
Entity Framework (EF) Resources
Object-Relational Impedance Mismatch
NewSQL, XML, RDF Data Stores, RDBMS
Follow ODBMS.org on Twitter: @odbmsorg