On Big Graph Data.
” The ultimate goal is to ensure that the graph community is not hindered by vendor lock-in” –Marko A. Rodriguez.
“There are three components to scaling OLTP graph databases: effective edge compression, efficient vertex centric query support, and intelligent graph partitioning” — Matthias Broecheler.
Titan is a new distributed graph database available in alpha release. It is an open source Apache project maintained and funded by Aurelius. To learn more about it, I have interviewed Dr. Marko A. Rodriguez and Dr. Matthias Broecheler cofounders of Aurelius.
Q1. What is Titan?
Q2. Who needs to handle graph-data and why?
MARKO: Much of today’s data is composed of a heterogeneous set of “things” (vertices) connected by a heterogeneous set of relationships (edges) — people, events, items, etc. related by knowing, attending, purchasing, etc. The property graph model leveraged by Titan espouses this world view. This world view is not new as the object-oriented community has a similar perspective on data.
However, graph-centric data aligns well with the numerous algorithms and statistical techniques developed in both the network science and graph theory communities.
Q3. What are the main technical challenges when storing and processing graphs?
MATTHIAS: At the interface level, Titan strives to strike a balance between simplicity, so that developers can think in terms of graphs and traversals without having to worry about the persistence and efficiency details. This is achieved by both using the Blueprint’s API and by extending it with methods that allow developers to give Titan “hints” about the graph data. Titan can then exploit these “hints” to ensure performance at scale.
Q4. Graphs are hard to scale. What are the key ideas that make it so that Titan scales? Do you have any performance metrics available?
MATTHIAS: There are three components to scaling OLTP graph databases: effective edge compression, efficient vertex centric query support, and intelligent graph partitioning.
Edge compression in Titan comprises various techniques for keeping the memory footprint of each edge as small as possible and storing all edge information in one consecutive block of memory for fast retrieval.
Vertex centric queries allow users to query for a specific set of edges by leveraging vertex centric indices and a query optimizer.
Graph data partitioning refers to distributing the graph across multiple machines such that frequently co accessed data is co-located. Graph partitioning is a (NP-) hard problem and this is an aspect of Titan where we will see most improvement in future releases.
The current alpha release focuses on balanced partitioning and multi-threaded parallel traversals for scale.
MARKO: To your question about performance metrics, Matthias and his colleague Dan LaRocque are currently working on a benchmark that will demonstrate Titan’s performance when tens of thousands of transactions are concurrently interacting with Titan. We plan to release this benchmark via the Aurelius blog.
[Edit: The benchmark is now available here. ]
Q5. What is the relationships of Titan with other open source projects you were previously involved with, such as TinkerPop? Is Titan open source?
MARKO: Titan is a free, open source Apache2 project maintained and funded by Aurelius . Aurelius (our graph consulting firm) developed Titan in order to meet the scalability requirements of a number of our clients.
In fact, Pearson is a primary supporter and early adopter of Titan. TinkerPop, on the other hand, is not directly funded by any company and as such, is an open source group developing graph-based tools that any graph database vendor can leverage.
With that said, Titan natively implements the Blueprint 2 API and is able to leverage the TinkerPop suite of technologies: Pipes, Gremlin, Frames, and Rexster.
We believe this demonstrates the power of the TinkerPop stack — if you are developing a graph persistence store, implement Blueprints and your store automatically gets a traversal language, an OGM (object-to-graph mapper) framework, and a RESTful server.
Q6. How is Titan addressing the problem of analyzing Big Data at scale?
MATTHIAS: Titan is an OLTP database that is optimized for many concurrent users running short transactions, e.g. graph updates or short traversals against one huge graph. Titan significantly simplifies the development of scalable graph applications such as Facebook, Twitter, and the like.
Interestingly enough, most of these large companies have built their own internal graph databases.
We hope Titan will allow organizations to not reinvent the wheel. In this way, companies can focus on the value their data adds, not on the “plumbing” needed to process that data.
MARKO: In order to support the type of global OLAP operations typified by the Big Data community, Aurelius will be providing a suite of technologies that will allow developers to make use of global graph algorithms. Faunus is a Hadoop-connector that implements a multi-relational path algebra developed by myself and Joshua Shinavier. This algebra allows users to derive smaller, “semantically rich” graphs that can then be effectively computed on within the memory confines of a single machine. Fulgora will be the in-memory processing engine. Currently, as Matthias has shown in prototype, Fulgora can store ~90 billion edges on a 64-Gig RAM machine for graphs with a natural, real-world topology. Titan, Faunus, and Fulgora form Aurelius’ OLAP story
Q7. How do you handle updates?
MATTHIAS: Updates are bundled in transactions which are executed against the underlying storage backend. Titan can be operated on multiple storage backends and currently supports Apache Cassandra, Apache HBase and Oracle BerkeleyDB.
The degree of transactional support and isolation depends on the chosen storage backend. For non-transactional storage backends Titan provides its own locking system and fine grained locking support to achieve consistency while maintaining scalability.
Q8. Do you offer support for declarative queries?
MARKO: Titan implements the Blueprints 2 API and as such, supports Gremlin as its query/traversal language. Gremlin is a data flow language for graphs whereby traversals are prescriptively described using path expressions.
MATTHIAS: With respects to a declarative query language, the TinkerPop teams is currently in the design process of a graph-centric language called “Troll.” We invite anybody interested in graph algorithms and graph processing to help in this effort.
We want to identify the key graph use cases and then build a language that addresses those most effectively. Note that this is happening in TinkerPop and any Blueprints-enabled graph database will ultimately be able to add “Troll” to their supported languages.
Q9. How does Titan compare with other commercial graph databases and RDF triple stores?
MARKO: As Matthias has articulated previously, Titan is optimized for thousands of concurrent users reading and writing to a single massive graph. Most popular graph databases on the market today are single machine databases and simply can’t handle the scale of data and number of concurrent users that Titan can support. However, because Titan is a Blueprints-enabled graph database, it provides that same perspective on graph data as other graph databases.
In terms of RDF quad/triple stores, the biggest obvious difference is the data model. RDF stores make use of a collection of triples composed of a subject, predicate, and object. There is no notion of key/value pairs associated with vertices and edges like Blueprints-based databases. When one wants to model edge weights, timestamps, etc., RDF becomes cumbersome. However, the RDF community has a rich collection of tools and standards that make working with RDF data easy and compatible across all RDF vendors.
For example, I have a deep appreciation for OpenRDF.
Similar to OpenRDF, TinkerPop hopes to make it easy for developers to migrate between various graph solutions whether they be graph databases, in-memory graph frameworks, Hadoop-based graph processing solutions, etc.
The ultimate goal is to ensure that the graph community is not hindered by vendor lock-in.
Q10. How does Titan compare with respect to NoSQL data stores and NewSQL databases?
MATTHIAS: Titan builds on top of the innovation at the persistence layer that we have seen in recent years in the NoSQL movement. At the lowest level, a graph database needs to store bits and bytes and therefore has to address the same issues around persistence, fault tolerance, replication, synchronization, etc. that NoSQL solutions are tackling.
Rather than reinventing the wheel, Titan is standing on the shoulders of giants by being able to utilize different NoSQL solutions for storage through an abstract storage interface. This allows Titan to cover all three sides of the CAP theorem triangle — please see here.
Q11. Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?
MATTHIAS: We absolutely agree with Mike on this. The relational model is a way of looking at your data through tables and SQL is the language you use when you adopt this tabular view. There is nothing intrinsically inefficient about tables or relational algebra. But its important to note that the relational model is simply one way of looking at your data. We promote the graph data model which is the natural data representation for many applications where entities are highly connected with one another. Using a graph database for such applications will make developers significantly more productive and change the way one can derive value from their data.
Dr. Marko A. Rodriguez is the founder of the graph consulting firm Aurelius. He has focused his academic and commercial career on the theoretical and applied aspects of graphs. Marko is a cofounder of TinkerPop and the primary developer of the Gremlin graph traversal language.
Dr. Matthias Broecheler has been researching and developing large-scale graph database systems for many years in both academia and in his role as a cofounder of the Aurelius graph consulting firm. He is the primary developer of the distributed graph database Titan.
Matthias focuses most of his time and effort on novel OLTP and OLAP graph processing solutions.
Resources on Graphs and Data Stores
Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes