Don White on “New and old Data stores”.
“New and old Data stores” : This time I asked Don White a number of questions.
Don White is a senior development manager at Progress Software Inc., responsible for all feature development and engineering support for ObjectStore.
RVZ
Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?
Don White : Speaking from an OODB perspective, OODBs grew out of recognition that the relational model is not the best fit for all application needs. OODBs continue to deliver their traditional value which is transparency in handling and optimizing moving rich data models between storage and virtual memory. The emergence of common ORM solutions must have provide benefit for some RDB based shops, where I presume they had need to use object oriented programming for data that already fits well into an RDB. There is something important to understand, if you fail to leverage what your storage system is good at then you are using the wrong tool or the wrong approach. Relational model wants to model with relations, wants to perform joins, an application’s data access pattern should expect to query the database the way the model wants to work. ORM mapping for an RDB that tries to query and build one object at a time will have real poor performance. If you try to query in bulk to avoid costs of model transitions then you likely have to live with compromises in less than optimal locking and/or fetch patterns. A project with model complexity that pursues OOA/OOD to for a solution will find implementation easier with an OOP and will find storage of that data easier and more effective with an OODB. As for newer Data Stores that are neither OODB nor RDB, and they appear to be trying to fill a need that provides a storage solution that is less than general database solution. Not trying to be everything to everybody allows for different implementation tradeoffs to be made.
Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them?
Don White: This probably needs to be answered by people involved with the products trying to distinguish themselves. I lump document stores in the NoSQL category. However there does seem to be some common themes or subclass types within new NoSQL stores document stores and key value stores. Each subclass seems to have a different way of declaring groups of related information and differences in exposing how to find and access stored information. In case it is not obvious you can argue an OODB has some characteristics of the NoSQL stores, although any discussion will have to clearly define the scope of what is included in the NoSQL discussion.
Q3. How new data stores compare with respect to relational databases?
Don White: In general there seems to be recognition that Relational based technology has difficulty making tradeoffs managing fluid schema requirements and how to optimize access to related information.
Q4. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed “object” cache over multiple machines. How do these new data stores compare with object-oriented databases?
Don White: These new data stores are not Object Oriented. Some might provide language bindings to Object Oriented languages but they are not preserving OOA/OOD as implemented in an OOP all the way through to the storage model. The new data systems are very data centric and are not trying to facilitate the melding of data and behavior. These new storage systems present a specific model abstractions and provide their own specific storage structure. In some cases they offer schema flexibility, but it is basically used to just manage data and not for building sophisticated data structures with type specific behavior. One way of keeping modeling abilities in perspective, you can use an OODB as a basis to build any other database system, NoSQL or even relational. The simple reason is an OODB can store any structure a developer needs and/or can even imagine. A document store, name/value pair store or RDB store, all present a particular abstractions for a data store, but under the hood there is an implementation to serve that abstraction. No matter what that implementation looks like for the store it could be put into an OODB. Of course the key is determining if the data store abstraction presented works for your actual model and application space.
The problem with an OODB is not everyone is looking to build a database of their own design and they prefer someone else to supply the storage abstraction and worry about the details to make the abstraction work. Not
to say the only way to interface with an OODB is a 3GL program, but the most effective way to use an OODB is when the storage model matches the needs of the in-memory model. That is a very leading statement because it really is forcing a particular question, why would you want to store data differently than how you intend to use it? I guess the simple answer is when you don’t know how you are going to use your data, so if you don’t know how you are going to use it then why is any data store abstraction better than another? If you want to use an OO model and implementation then you will find a non OODB is a poor way of handling that situation.
To generalize it appears the newer stores make different compromises in the management of the data to suit their intended audience. In other words they are not developing a general purpose database solution so they are
willing to make tradeoffs that traditional database products would/should/could not make. The new data stores do not provide traditional database query language support or even strict ACID transaction support. They do provide an abstractions for data storage and processing capabilities that leverage the idiosyncrasies of their chosen implementation data structures and/or relaxations in strictness of the transaction model to try to make gains in processing.
Q5. With the emergence of cloud computing, new data management systems have surfaced. What is in your opinion of the direction in which cloud computing data management is evolving? What are the main challenges of cloud computing data management?
Don White: One challenge is just trying to understand what is really meant by cloud computing. In some form it is how to leverage computing resources or facilities available through a network. Those resources could be software or hardware, leveraging those resources requires nothing to be installed on the accessing device, you only need a network connection. The network is the virtual mainframe and any device used to access the network is the virtual terminal endpoint. You have the same concerns of trying to leverage the computing power of a virtual mainframe as a real local machine, how to optimize computing resources, how to share them among many users and how to keep them running. You have interesting upside with all the possible scalability but with the power and flexibility comes new levels of management complexity. You have to consider how algorithms for processing and handling data can be distributed and coordinated. When you involve more than one machine to do anything then you have to consider what happens when any node or connecting piece fails along the way.
Q6 What are cloud stores omitting that enable them to scale so well?
Don White: Strict serialized transaction processing for one. I think you will find the more complex a data model needs to be, the more need there is for strict serialized transactions. You can’t expect to navigate relationships cleanly if you don’t promise to keep all data strictly serialized.
The data and/or the storage abstractions used in the new models seem devoid of any sophisticated data processing and relationship modeling. What is being managed and distribute is simple data, where algorithms and the data needing to be managed can be easily partitioned/dispersed and required processing is easily replicated with basic coordination requirements. It is easy to imagine how to process queries that can replicated in bulk across simple data stored in structures that are amenable to be split apart.
Why are serialized transactions important? It makes sure this is one view of history and is necessary to maintain integrity among related data. Some systems try to pass off something less than serializable isolation as
adequate for transaction processing, however allowing updates to occur without the prerequisite read locks risks trying to use data that is not correct. If you are using pointers rather than indirect references as part
of your processing, the things you point to have to exist to run. Once you materialize a key/value based relationship as a pointer then there has to be commitment to not only the existence of the relationship (thing pointed to) but also the state of the data involved in the relationship that allows the existence to be valid.
Q7. Will cloud store projects end up with support for declarative queries and declarative secondary keys?
Don White: Can’t answer that. It will be a shame if these systems end up having to build many things that are available in other database systems that could have given them that for free.
Q8. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?
Don White: I don’t have any argument with the overheads identified, however I would say I don’t want to use SQL, a non-procedural way of getting to data, when I can solve my problem faster by using navigation of data structures specifically geared to solve my targeted problem. I have seen customers put SQL interfaces on top of specialized models stored in an OODB. They use SQL through ODBC as a standard endpoint to get at the data, but the implementation model under the hood is a model the customer implemented that performs queries faster than what a traditional relational implementation could do.
Q9. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes.
Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?
Don White: Hmm, what is the barrier? What makes SQL hard to span nodes? I suppose one inherit problem is an RDB is built around the relational model, which is based on joining relations. If processing is going to spread across many nodes then where does joining take place. So either there possibly single points of failure or some layer of complicated partitioning that has to be managed to figure out how to join data together.
Q10. There is also xml DBs, which go beyond relational. Hybridization with relational turned out to be very useful. For example, DB2 has a huge investment in XML, and it is extensively published, and it has also succeeded commercially. Monet DB did substantial work in that area early on as well. How do they relate with “new data stores”?
Don White: I would think one thing that has to be addressed is how you store and process non text information that is ultimately represented as text in XML. String based models are a poor means to manage relationships and numeric information. There are also costs in trying to make sure the information is valid for the real data type you want it to be.
A product would have to decide on how to handle types that are not meant to be textual. For example you can’t expect to accurately compare/restrict floating point numbers that are represented as text, certainly storing numbers as text is an inefficient storage model. Most likely you would want to leverage parsed XML for your processing, so if the data is not stored in a parsed format then you will have to pay for parsing when moving the data to and from storage model. XML can be used to store trees of information, but not all data is easily represented with XML.
Common data modeling needs like graphs and non containment relationships among data items would be a challenge. When evaluating any type of storage system it should be based on the type of data model needed and how it will be used.
Q11. Choosing a solution: Given this confusing array of alternatives, which data storage system do you choose for your application?
Don White: Make sure you choose a tool for the job at hand. I think the one thing we know is the Relational Model has been used to solve lots of problems, but it has not the end all and be all of data storage solutions. Other data storage model can offer advantages for more than niche situations.