“Marrying objects with graphs”: Interview with Darren Wood.
Is it possible to have both objects and graphs?
This is what Objectivity has done. They recently launched InfiniteGraph, a Distributed Graph Database for the Enterprise. Is InfiniteGraph a NoSQL database? How does it relate to their Objectivity/DB object database?.
To know more about it, I have interviewed Darren Wood, Chief Architect, InfiniteGraph, Objectivity, Inc.
Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?
Wood: I think at some level, there have always been requirements for data stores that don’t fit the traditional relational data model.
Objectivity (and many others) have built long term successful businesses meeting scale and complexity requirements that generally went beyond what was offered by the RDBMS market. In many cases, the other significant player in this market was not a generally available alternative technology at all, but “home grown” systems designed to meet a specific data challenge. This included everything from application managed file based systems to the more well known projects like Dynamo (Amazon) and BigTable (Google).
This trend is not in any way a failing of RDBMS products, it is simply a recognition that not all data fits squarely into rows and columns and not all data access patterns can be expressed precisely or efficiently in SQL.
Over time, this market has simply grown to a point where it makes sense to start grouping data storage, consistency and access model requirements together and create solutions that are designed specifically to meet them.
Q2. There has been recently a proliferation of New data stores, such as document stores, and NoSQL databases: What are the differences between them?
Wood: Although NoSQL has been broadly categorized as a collection of Graph, Document, Key-Value, and BigTable style data stores, it is really a collection of alternatives which are best defined by the use case for which they are most suited.
Dynamo derived Key-Value stores for example, mostly trade off data consistency for extreme horizontal scalability with a high tolerance to host failure and network partitioning. This is obviously suited to
systems serving large numbers of concurrent clients (like a web property), which can tolerate stale or inconsistent data to some extent.
Graph databases are another great example of this, they treat relationships (edges) as first class citizens and organize data physically to accelerate traversal performance across the graph. There is also typically a very “graph oriented” API to simplify applications that view their data as a graph. This is a perfect example of how
providing a database with a specific data model and access pattern can dramatically simplify application development and significantly improve performance.
Q3. Objectivity has recently launched InfiniteGraph, a Distributed Graph Database for the Enterprise. Is InfiniteGraph DB a NoSQL graph database? How does it relate to your Objectivity/DB object database?
Wood: InfiniteGraph had been an idea within our company for some time. Objectivity has a long and successful track record providing customers with a scalable, distributed data platform and in many cases the underlying data model was a graph. In various customer accounts our Systems Engineers would assist building custom applications to perform high speed data ingest and advanced relationship analytics.
For the most part, there was a common “graph analytics” theme emerging from these engagements and the differences were really only in the specifics of the customer’s domain or industry.
Eventually, an internal project began with a focus on management and analysis of graph data. It took the high performance distributed data engine from Objectivity/DB and married it to a graph management and analysis platform which makes development of complex graph analytic applications significantly easier. We were very happy with what we had achieved in the first iteration and eventually offered it as a public beta under the InfiniteGraph name. A year later we are now into our third public release and adding even more features focused around scaling the graph in a distributed environment.
Q4. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed object cache over multiple machines. How do these new data stores compare with object-oriented databases?
Wood: I think that most of the technologies you mention generally have unique properties that appeal to some segment of the database market. In same cases its the data model (like the flexibility of document model databases) and in others it is the ease of use or deployment and simple interface that will attract users (which is often said about CouchDB). Systems architects that evaluate various solutions will
invariably balance ease of use and flexibility with other requirements like performance, scalability and supported consistency models.
Object Databases will be treated in much the same way, they handle complex object models really well and minimize the impedance mismatch between your application and the database, so in cases where this is
an important requirement, they will always be considered a good option.
Of course not all OODBMS implementations are the same, so even within this genre there are significant differences to help make a clear choice for a particular use case.
Q5. With the emergence of cloud computing, new data management systems have surfaced.
What is in your opinion the direction in which cloud computing data management is evolving? What are the main challenges of cloud computing data management?
Wood: This is an area of great interest to us. Coming from a distributed data background, we are well positioned to take advantage of the trend for sure. I think there are a couple of major things that distinguish traditional “distributed systems” from the typical cloud environment.
Firstly, products that live in the cloud need to be specifically designed for tolerance to host failures and network partitions or inconsistencies. Cloud platforms are invariably built on low cost commodity hardware and provide a virtualized environment that offers a lower class of reliability seen in dedicated “enterprise class”
hardware. This essentially requires that availability be built into the software to some extent, which translates to replication and redundancy in the data world.
Another important requirement in the cloud is ease of deployment. The elasticity of a cloud environment generally leads to “on the fly” provisioning of resources, so spinning up nodes needs to be a simple from a deployment and configuration perspective. When you look at the technologies (like Dynamo based KV stores) that have their roots in early cloud based systems, there is a real focus in these areas.
Q6. Will cloud store projects end up with support for declarative queries and declarative secondary keys?
Wood: If you look at most of the KV type stores out there, a lot of them are now turning some focus on what is being termed as “search”. Cross population indexing and complex queries were not the primary design goal of these systems, however many users find “some” capability in this area is necessary, especially if it is being used exclusively as the persistence engine of the system.
An alternative to this is actually using multiple data stores side by side (so called polyglot persistence) and directing queries at the system that has been best designed to handle it. We are see a lot of this in the Graph Database market.
Q7. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?
Wood: I agree totally that NoSQL has nothing to do with SQL, it’s an unfortunate term which is often misunderstood. It is simply about choosing the data store with just the right mix of trade offs and characteristics that are they closest match to your applications requirements. The problem was, for the most part people are familiar with RDBMS/SQL, so NoSQL became a “place” for other lesser known data
stores and models to call home (hence the unfortunate name).
A good example is ACID, mentioned in the above abstract. ACID in itself is one of these choices, some use cases require it and others don’t. Arguing about the efficiency of its implementation is somewhat of a mute point if it isn’t a requirement at all ! In other cases, like graph databases, the physical data model can dramatically effect its performance for certain types of navigational queries which the RDBMS data model and SQL query language are simply not designed for.
I think this post was a reaction to the idea that NoSQL somehow threatened RDBMS and SQL, when that isn’t the case at all. There are still a large proportion of data problems out there that are very well suited to the RDBMS model and the plethora of implementations.
Q8. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes. Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?
Wood: Certainly partitioning and replication of RDBMS can be used to great effect where the application suits the relational model well, however this doesn’t change its suitability for a specific task. Even a set of indexed partitions doesn’t make sense if a KV store is all that is required. Using a distributed hashing algorithm doesn’t require execution of lookups on every node, so there is no reason to pay an
overhead for a generalized query when it is not required. Of course this doesn’t devalue the existence of a partitioned RDBMS at all, since there are many applications where this would be a perfect solution.
Recent Related Interviews/Videos: