Graphs vs. SQL. Interview with Michael Blaha
“For traditional business applications, the schema is known in advance, so there is no need to use a graph database which has weaker enforcement of integrity. If instead, you’re dealing with at best a generic model to which it conforms, then a schema-oriented approach does not provide much. Instead a graph-oriented approach is more natural and easier to develop against.”— Michael Blaha
Graphs, SQL and Databases. On this topic I have interviewed our expert Michael Blaha.
Q1. A lot of today’s data can be modeled as a heterogeneous set of “vertices” connected by a heterogeneous set of “edges”, people, events, items, etc. related by knowing, attending, purchasing, etc. This world view is not new as the object-oriented community has a similar perspective on data. What is in your opinion the main difference with respect to a graph-centric data world?
Michael Blaha: This world view is also not new because this is the approach Charlie Bachman took with network databases many years ago. I can think of at least two major distinguishing aspects of graph-centric databases relative to relational databases.
(1) Graph-centric databases are occurrence-oriented while relational databases are schema-oriented. If you know the schema in advance and must ensure that data conforms to it, then a schema-oriented approach is best. Examples include traditional business applications, such as flight reservations, payroll, and order processing.
(2) Graph-centric databases emphasize navigation. You start with a root object and pull together a meaningful group of related objects. Relational databases permit navigation via joins, but such navigation is more cumbersome and less natural. Many relational database developers are not adept at performing such navigation.
Q2. The development of scalable graph applications such for example for Facebook, and Twitter require different kind of databases than SQL. Most of these large Web companies have built their own internal graph databases. But what about other enterprise applications?
Michael Blaha: The key is the distinction between being occurrence-oriented and schema-oriented. For traditional business applications, the schema is known in advance, so there is no need to use a graph database which has weaker enforcement of integrity. If instead, you’re dealing with at best a generic model to which it conforms, then a schema-oriented approach does not provide much. Instead a graph-oriented approach is more natural and easier to develop against.
Q3: Marko Rodriguez and Peter Neubauer in an interview say that “the benefit of the graph comes from being able to rapidly traverse structures to an arbitrary depth (e.g., tree structures, cyclic structures) and with an arbitrary path description (e.g. friends that work together, roads below a certain congestion threshold). We call this data processing pattern, the graph traversal pattern. This mind set is much different from the set theoretic notions of the relational database world. In the world of graphs, everything is seen as a walk’s traversal”. What is your take on this?
Michael Blaha: That’s a great point and one that I should have mentioned in my answer to Q1. Relational databases have poor handling of recursion. I will note that the vendor products have extensions for this but they aren’t natural and are an awkward graft onto SQL. Graph databases, in contrast, are great with handling recursion. This is a big advantage of graph databases for applications where recursion arises.
Q4. Is there any synergy between graphs and conventional relational databases?
Michael Blaha: Graphs are also important for relational databases, and more so than some persons may realize…
— Graphs are clearly relevant for data modeling. An Entity-Relationship data model portrays the database structure as a graph.
— Graphs are also important for expressing database constraints. The OMG’s Object Constraint Language (OCL) expresses database constraints using graph traversal. The OCL is a textual language so it can be tedious to use, but it is powerful. The Common Warehouse Metamodel (CWM) specifies many fine constraints with the OCL and is a superb example of proper OCL usage.
— Even though the standard does not emphasize it, the OCL is also an excellent language for database traversal as a starting point for database queries. Bill Premerlani and I explained this in a past book (Object-Oriented Modeling and Design for Database Applications).
— Graphs are also helpful for characterizing the complexity of a relational database design. Robert Hilliard presents an excellent technique for doing this in his book (Information-Driven Business).
Q5. You say that graphs are important for data modeling, but at the end you do not store graphs in a relational database but tables, and you need joins to link them together… Graph databases in contrast cache what is on disk into memory and vendors claim that this makes for a highly reusable in-memory cache. What is your take on this?
Michael Blaha: Relational databases play many optimization games behind the covers. So in general, possible performance differences are often not obvious. I would say that the difference in expressiveness is what determines suitable applications for graph and relational databases and performance is a secondary issue, except for very specialized applications.
Q6: What are advantages of SQL relative to graph databases?
Michael Blaha: Here are some advantages of SQL:
— SQL has a widely-accepted standard.
— SQL is a set-oriented language. This is good for mass-processing of set-oriented data.
— SQL databases have powerful query optimizers for handling set-oriented queries, such as for data warehouses.
— The transaction processing behavior of relational databases (the ACID properties) are robust, powerful, and sound.
— SQL has extensive support for controlling data access.
Q7: What are disadvantages of SQL relative to graph databases?
Michael Blaha: Here are some disadvantages of SQL:
— SQL is awkward for processing the explosion of data that can result from starting with an object and traversing a graph.
SQL, at best, awkwardly handles recursion.
— SQL has lots of overhead for multi-user locking that can make it difficult to access individual objects and their data.
— Advanced and specialty applications often require less rigorous transaction processing with reduced overhead and higher throughput.
Q8: For which applications is SQL best? For which applications are graph databases best?
— SQL is schema based. Define the structure in advance and then store the data. This is a good approach for conventional data processing such as many business and financial systems.
— Graph databases are occurrence based. Store data and relationships as they are encountered. Do not presume that there is an encompassing structure. This is a good approach for some scientific and engineering applications as well as data that is acquired from Web browsers and search engines.
Q9. What about RDF quad/triple stores?
Michael Blaha: I have not paid much attention to this. RDF is an entity-attribute-value approach. From what I can tell, it seems occurrence based and not schema based and my earlier comments apply.
Michael Blaha is a partner at Modelsoft Consulting Corporation.
Blaha received his doctorate from Washington University in St. Louis, Missouri in Chemical Engineering with his dissertation being about databases. Both his academic background and working experience involve engineering and computer science. He is an alumnus of the GE R&D Center in Schenectady, New York, working there for eight years. Since 1993, Blaha has been a consultant and trainer in the areas of modeling, software architecture, database design, and reverse engineering. Blaha has authored six U.S. patents, four books, and many papers. Blaha is an editor for IEEE Computer as well as a member of the IEEE-CS publications board. He has also been active in the IEEE Working Conferences on Reverse Engineering.
– On Big Graph Data. August 6, 2012
– Applying Graph Analysis and Manipulation to Data Stores. June 22, 2011
ODBMS.org Resources on Graphs and Data Stores
Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes
Follow ODBMS.org on Twitter: @odbmsorg
Very interesting view about Graph Databases. We, at Sparsity Technologies, are the developers of DEX (http://www.sparsity-technologies.com/dex)
How DEX relates to Graph databases? Please explain,
Thank you for the interview. It’s very interesting for me. And it’s very usual.
First of all I cannot understand at all how it is possible to oppose the term “schema” to the term “graph”. Michael said that graphs are well-known since a time of network databases. But the network databases always have a scheme. If I have data schema described in OO terms and I will get a graph of objects linked with references. So there is no opposition between graphs and schemas.
Secondly. Michael said that “Relational databases permit navigation via joins”.. I would correct this phrase as “Today relational DBMS permit navigation via evident joins” because generally relational DBMS can permit navigation without evident joins but with very usual navigation paths. Relations do not oppose to the navigation, it’s just a drawback of existing RDBMS.
And I think it’s better to use term “DBMS” when we are speaking about data manipulation abilities because “database” is just big piece of data.
A good news for me that “possible performance differences (between relational DBMS and other ones – EG) are often not obvious”. It means when “difference in expressiveness” is the only problem. But (once again) this is not a general problem and this is not a SQL problem even (it’s just a language which can be slightly extended to get new possibilities). You can visit my site http://rxo-project.com to be convinced of this.
I would like to discuss the pros and cons of different approaches for “storage system” in relation to a basic reference model. Such an approach has been proposed by Peter Chen in his famous paper “The Entity-Relationship Model – Toward a Unified View of Data”. His ideas provide us with the means to express the differences between DMBSs or their shortcomings in comparison to the generic model. The profile of a “storage system” can be described by (1) the existence / model of the schema, (2) the model of the instance level and (3) the functionality of the system (as e.g. ACID-support, access-language, performance). Depending on the purpose and circumstances a data store may live well without a schema and can operate without ACID or formal access-language – if it is prepared to pay the price for such a profile.
A graph data base has (1) no schema – or at least not a formally defined schema. There is a sort of implicit use of a “schema” in each query – otherwise it would be impossible to define a generic search: You specify a search “Friends” that “work_together” – and not “amici” “lavorando insieme” – because you know that there are entity-types releated by relationship-types with exactly these names. On the instance level (2) graph data bases know about entities with attributes and about relationships but without attributes. They provide(3) the D(urability) of the ACID-package and a toolbox to access the instance data, but no formal, schema-related language.
A relational DBMS has a (1) schema with entities and process-related relationships (constraints). The constraints can be derived from the conceptual model, but are not a complete representation of the conceptual (entity-relationship) model. On the instance-level (2) an RDBMS knows about entities (tables) and attributes (columns) but there is no equivalent to a relationship. The relationships of the conceptual model are emulated on the instance level by foreign keys or relation-tables for m:n relationships or relationships with attributes. The SQL-query-language (3) reflects the tabular schema and reestablishes relations via the interpretation of special attributes (foreign keys) in a join-operation. It is common practice to use the conceptual model only for the initial design and the creation of the first cut database; it is never really tested and not included in the evitable changes of a database. It finally “remains pinned to the wall” (Blakely) as an outdated document.
The integration of (1) the (entity-relationship) based schema and (2) a congruent instance-model leads to an erDBMS. Such a DBMS stores the relationships between entity-instances and the attributes of relationships in the database. This has an impact on performance – it slows down the write process and speeds up the read access. (3) An erDBMS provides full ACID-support and has an access-language (erSQL) related to the conceptual schema. It navigates (similar to a graph database) through the structure and collects attributes controlled by an SQL-like select-statement. Schemas are extensible; changes to the schema which affect the instance data require also changes to the instance data: the structural consistency of schema and instance data is automatically enforced. An erDBMS is certainly not another form of an ORM which attempts to bridge the semantic gap between the conceptual model and the (relational) instance data.
SQL-databases are undoubtedly the market leader – Stonebraker has sufficiently explained why this is the case. This may change with the combination of the erDBMS-model using in-memory storage. Our tests with Metasafe and VoltDB (as persistence layer) show promising results.
I would explain two my points.
Point one (an graph vs. schema).
Let me explain why I think that there is no opposition between graphs and schemas. When I see picture like this I understand that only two tables are enough in relational DB to present this graph even it will be million times more.
We have a constant simple schema of data which describe complex graph (because of the simplicity the schema may be not defined manifestly but it exists anyway). The graph describes an object domain, the schema describe data of this object domain. These descriptions are differ, but not opposite. They are orthogonal. For me there is no sense to compare them directly or in form of systems, which implement them.
Point two (“Why everybody wants and doesn’t want to replace RDBMS”).
For me current relational DBMS has only one drawback. Offering advantages of RM, they force users to describe object domain in terms of relational data model. “If you want to use relations you have to create the relations manifestly”. For me RM itself is very formal, and according to RM, all advantages of this formality are available when all data are presented as a set of relations. Once again: all data have to be _presented_ as a set of relations (not described manifestly! So RM doesn’t pretend ever to be used to describe something (inc. object domains).
Better ways exist to describe objects domains. But it doesn’t mean that it’s necessary to throw out relational DBMS, because RM really has a big number of advantages. Unfortunately specialists who understand that RM is not good to describe object model (once again RM doesn’t pretend to do it) try to forget about RM in a new DBMS at all. As a result tons of the new DBMS appear which offer to user a kind of revolution both for mentality and to practical work
But all, what a most of RDBMS users need, is just an ability to describe object domain in more natural way and, then, present the object domain data in relational form.
I don’t see the picture in my previous message after “When I see picture like this”. I use “img” tag to insert it/ I try to do it once again and give a link. It’s just an usual graph.
If you have yet investigated triplestores I would strongly recommend you do. They are essentially graph databases. You mention:
Michael Blaha: Here are some advantages of SQL:
– SQL has a widely-accepted standard.
IMHO one advantage of triplestores over other graph stores if they use RDF and SPARQL – both widely accepted (open) standards from the W3C. There are many triplestores to use from so you have no vendor lock in.