Robert Greene on “New and Old Data stores”
I am back covering the topic “New and Old Data stores”.
I asked several questions to Robert Greene, CTO and V.P. Open Source Operations at Versant.
Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?
Robert Greene: Well, it’s a question of innovation in the face of need. When relational databases were invented, applications and their models were simpler, data was smaller, concurrent users were less. There was no internet, no wireless devices, no global information systems. In the mid 90’s, even Larry Ellison stated that complexly related information, at the time largely in niche application areas like CAD, did not fit well with the relational model. Now, complexity is pervasive in nearly all applications.
Further, the relational model is based on a runtime relationship execution engine, re- calculating relations based on primary-key, foreign-key data associations even though the vast majority of data relationships remain fixed once established. When data continues to grow at enormous rates, the approach of re-calculating the relations becomes impractical. Today even normal applications start to see data at sizes which in the past were only seen in data warehousing solutions, the first data management space which embraced a non-relational approach to data management.
So, in a generation when millions of users are accessing applications linked to near real-time analytic algorithms, at times operating over terabytes of data, innovation must occur to deal with these new realities.
Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them?
Robert Greene:The answer to this could require a book, but let’s try to distil into the fundamentals.
I think the biggest difference is the programming model. There is some overlap, so you don’t see clear distinctions, but for each type: object database, distributed file systems, key-value stores, document stores and graph stores, the manner in which the user stores and retrieves data varies considerably. The OODB uses language integration, the distributed file systems use map-reduce, key-value stores use data keys, document stores use keys and query based on indexed meta data overlay, graph stores use a navigational expression language. I think it is important to point out that “store” is probably a more appropriate label than “database” for many of these technologies as most do not implement the classical ACID requirements defined for a database.
Beyond programming model, these technologies vary considerably in architecture, how they actually store data, retrieve it from disk, facilitate backup, recovery, reliability, replication, etc.
Q3. How new data stores compare with respect to relational databases?
Robert Greene: As described above, they have a very different programming model than the RDB. Though in some ways, they are all subsets of the RDB, but their specialization allows them to do what they do ( at times ) better than the RDB.
Most of them are utilizing an underlying architecture which I call, “the oldest scalability architecture of the relational database”. It’s the use of the key-value/blob architecture. The RDB has long suffered performance under scalability and historically many architects have gotten around those performance issues by removing the JOIN operation from the implementation. They manage identity from the application space and store information in either single tables and/or blobs of isolatable information. This comparison is obvious for key-value stores. However, you can also see this approach in the document store, which is storing its information as key-JSON objects. The keys to those documents ( JSON blob objects ) must be managed by user implemented layers in the application space. Try to implement a basic collection reference, you will find yourself writing lots of custom code. Of course, JSON objects also have meta data which can be extracted and indexed, allowing document stores to provide better ways at finding data, but the underlying architecture is key-value.
Q4. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed “object” cache over multiple machines. How do these new data stores compare with object-oriented databases?
Robert Greene: They compare similarly in that they achieve better scalability than the RDB by utilizing identity management in the application layer similarly to the way done with the object database. However, the approach is significantly less opaque, because for those NoSQL stores, the management of the identity is not integrated into the language constructs and abstracted away from the user API as it is with the object database. Plus, there is a big difference in the delivery of the ACID properties of a database. The NoSQL databases are almost exclusively non-transactional unless you use them in only the narrowest of use cases.
Q5. With the emergence of cloud computing, new data management systems have surfaced. What is in your opinion of the direction in which cloud computing data management is evolving? What are the main challenges of cloud computing data management?
Robert Greene: Unquestionably, the world is moving to a platform as a service computing model (PaaS). Databases will play a role in this transition in all forms. The challenges in delivering on data management technology which is effective in these “cloud” computing architectures turn out to be very similar to effectively delivering technology for the new n-core chip architectures. They are challenges related to distributed data management, whether it is across machines or across cores, splitting the problem into pieces and managing the distributed execution in the fact of concurrent updates. Then the often overlooked aspect in these discussions is the operational element. How to effectively develop, debug, manage and administer the production deployments of this technology within distributed computing environments.
Q6 What are cloud stores omitting that enable them to scale so well?
Robert Greene: I think architecture plays the biggest role in their ability to scale. It is that application identity managed approach to data retrieval, data distribution, semi-static data relations. These are things they actually have in common with object databases, which incidentally, you also find in some of the worlds largest, most demanding application domains. I think that is the biggest scalability story for those technologies. If you look past architecture then it comes down to some of the sacrifices made in the area of fully supporting the ACID requirements of a database. Taking the “eventually consistent” approach, this in some cases, makes a tremendous amount of sense if you can afford probabilistic results instead of determinism
Q7. Will cloud store projects end up with support for declarative queries and declarative secondary keys?
Robert Greene: I am sure you will see this as literally all database technologies which will remain relevant, will live in the cloud.
Q8. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?
Robert Greene: I agree with the theory. Reality though does introduce some practical limitations during implementation. Technology is doing a remarkable job of removing those bottlenecks. For example, you can now get non-volatile memory appliances which are 5T in size effectively eliminating disk I/O as the what was historically the #1 bottleneck in database systems. Still, architecture will continue in the future to play the strongest role in performance and scalability. Relational databases and other implementations which need to runtime calculate relationships based on data values over growing volumes of data will remain performance challenged.
Q9. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes.
Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?
Robert Greene: Yes of course, they already do in vendors like Netezza, Greenplum, AsterData. The question is will they perform well in the face of those scalability requirements. This distinction between performance and scalability is often overlooked.
However, I think this notion that you cannot scale well if your transactions span many nodes is non-sense. It is a question of implementation. Just because a database has 100 nodes, does not mean that all transactions will operate on data within those 100 nodes. Transactions will naturally partition and span some percentage of nodes especially with regards to relevant data. Access in a multi-node system can be parallelized in all aspects of a transaction. Further, at a commit boundary, the overwhelming case is that the number of nodes involved where data is inserted, changed, deleted and/or logically dependent, is some small fraction of all the physical nodes in the system. Therefore, advanced 2-phase commit protocols can do interesting things like rolling back non active nodes and parallelizing protocol handshaking and using asynchronous I/O and handshaking to finalize the commit. Is it complicated, yes, but is it too complicated to work, not by a long shot.
Q10. There is also xml DBs, which go beyond relational. Hybridization with relational turned out to be very useful. For example, DB2 has a huge investment in XML, and it is extensively published, and it has also succeeded commercially. Monet DB did substantial work in that area early on as well. How do they relate with “new data stores”?
Robert Greene: I really look at XML databases as large index engines. I have seen implementations of these which look very much like document stores. The main difference being that they are generally indexing everything. Where the document stores appear to be much more selective about the meta data exposed for indexing and query. Still, I think the challenge for XML db’s is the mismatch in it’s use within the programming paradigm. Developers think of XML as data interchange and transformation technology. It is not perceived as transactional data management and storage and developers don’t program in XML, so it feels clunky for them to figure out how to wrap it into their logical transactions. I suspect if feels a little less clunky if what you are dealing with are documents. Perhaps they should be considered the original document stores.
Q11. Choosing a solution: Given this confusing array of alternatives, which data storage system do you choose for your application?
Robert Greene: I choose the right tool for the job. This is again one of those questions which deserves several books. There is no 1 best solution for all applications, deciding factors can be complicated, but here is what I think about as major influencing factors. I look at it from the perspective of whether the application is data driven or model driven.
If it is model driven, I lean towards ODB or RDB.
If it is data driven, I lean towards NoSQL or RDB.
If the project is model driven and has a complex known model, ODB is a good choice because it handles the complexity well. If the project is model driven and has a simple known model, RDB is a good choice, because you should not be performance penalized if the model is simple, there are lots of available choices and people who know how to use the technology.
If the project is data driven and the data is small, RDB is good for the prior reasons. If the project is data driven and the data is huge, then NoSQL is a good choice because it takes a better architectural approach to huge data allowing the use of things like map reduce for parallel processing and/or application managed identity for better data distribution.
Of course, even within these categorizations you have ranges of value in different products. For example, MySQL and Oracle are both RDB, so which one to choose? Similarly, db4o and Versant are both ODB’s, so which one should you choose? So, I also look at the selection process from the perspective of 2 additional requirements: Data Volume, Concurrency. Within a given category, these will help narrow in on a good choice. For example, if you look at the company Oracle, you naturally consider that MySQL is less data scalable and less concurrent than the Oracle database, yet they are both RDB. Similarly, if you look at the company Versant, you would consider db4o to be less data scalable and less concurrent than the Versant database, yet they are both ODB.
Finally, I say you should test, evaluate any selection within the context of your major requirements. Get the core use cases mocked up and put your top choices to the test, it is the only way to be sure.