“Applying Graph Analysis and Manipulation to Data Stores.”

by Roberto V. Zicari on June 22, 2011

” This mind set is much different from the set theoretic notions of the relational database world. In the world of graphs, everything is seen as a walk—a traversal. ” — Marko Rodriguez and Peter Neubauer.
__________________________________________________

Interview with Marko Rodriguez and Peter Neubauer.

The open source community is quite active in the area of Graph Analysis and Manipulation, and their applicability to new data stores. I wanted to know more about an open source initiative called “Tinkerpop”.
I have interviewed Marko Rodriguez and Peter Neubauer, who are the ledears of the Tinkerpop” project.

RVZ

Q1. You recently started a project called “Tinkerpop”. What is it?

Marko Rodriguez and Peter Neubauer:
TinkerPop is an open-source graph software group. Currently, we provide a stack of technologies (called the TinkerPop stack) and members contribute to those aspects of the stack that align with their expertise. The stack starts just above the database layer (just above the graph persistence layer) and connects to various graph database vendors — e.g. Neo4j, OrientDB, DEX, RDF Sail triple/quad stores, etc.

The graph database space is a relatively nascent space. At the time that TinkerPop started back in 2009, graph database vendors were primarily focused on graph persistence issues–storing and retrieving a graph structure to and from disk. Given the expertise of the original TinkerPop members (Marko, Peter, and Josh), we decided to take our research (from our respective institutions) and apply it to the creation of tools one step above the graph persistence layer. Out of that effort came Gremlin — the first TinkerPop project. In late 2009, Gremlin was pieced apart into multiple self contained projects: Blueprints and Pipes.
From there, other TinkerPop products have emerged which we discuss later.

Q2. Who currently work on “Tinkerpop”?

Marko Rodriguez and Peter Neubauer:
The current members of TinkerPop are Marko A. Rodriguez (USA), Peter Neubauer (Sweden), Joshua Shinavier (USA), Stephen Mallette (USA), Pavel Yaskevich (Belarus), Derrick Wiebe (Canada), and Alex Averbuch (New Zealand).
However, recently, while not yet an “official member” (i.e. picture on website), Pierre DeWilde (Belgium) has contributed much to TinkerPop through code reviews and community relations. Finally, we have a warm, inviting community where users can help guide the development of the TinkerPop stack.

Q3. You say, that you intend to provide higher-level graph processing tools, APIs and constructs? Who needs them? and for what?

Marko Rodriguez and Peter Neubauer:
TinkerPop facilitates the application of graphs to various problems in engineering. These problems are generally defined as those that require expressivity and speed when traversing a joined structure. The joined structure is provided by a graph database. With a graph database, a user can does not arbitrarily join two tables according to some predicate as there is no notion of tables.
There only exists a single atomic structure known as the graph. However, in order to unite disparate data, a traversal is enacted that moves over the data in order to yield some computational side-effect — e.g. a search, a score, a rank, a pattern match, etc.
The benefit of the graph comes from being able to rapidly traverse structures to an arbitrary depth (e.g., tree structures, cyclic structures) and with an arbitrary path description (e.g. friends that work together, roads below a certain congestion threshold). Moreover, this space provides a unique way of thinking about data processing.
We call this data processing pattern, the graph traversal pattern.
This mind set is much different from the set theoretic notions of the relational database world. In the world of graphs, everything is seen as a walk—a traversal.

Q4. Why using graphs and not objects and/or classical relations? What about non normalized data structures offered by NoSQL databases?

Marko Rodriguez and Peter Neubauer:
In a world where memory is expensive, hybrid memory/disk technology is a must (colloquially, a database).
A graph database is nothing more than a memory/disk technology that allows for the rapid creation of an in-memory object (sub)graph from a disk-based (full)graph. A traversal (the means by which data is queried/processed) is all about filling in memory those aspects of the persisted graph that are being touched as the traverser moves along the graph’s vertices and edges.
Graph databases simply cache what is on disk into memory which makes for a highly reusable in-memory cache.
In contrast, with a relational database, where any table can be joined with any table, many different data structures are constructed from the explicit tables persisted. Unlike a relational database, a graph database has one structure, itself.
Thus, components of itself are always reusable. Hence, a “highly reusable cache.” Given this description, if a persistence engine is sufficiently fast at creating an in-memory cache, then it meets the processing requirements of a graph database user.

Q5. Besides graph databases, who may need Tinkerpop tools? Could they be useful for users of relational databases as well? or of other databases, like for example NoSQL or Object Databases? If yes, how?

Marko Rodriguez and Peter Neubauer:
In the end, the TinkerPop stack is based on the low-level Blueprints API.
By implementing the Blueprints API and making it sufficiently speedy, any database can, in theory, provide graph processing functionality. So yes, TinkerPop could be leveraged by other database technologies.

Q6. Tinkerpop is composed of several sub projects: Gremlin, Pipes, Blueprints and more. At a first glimpse, it is difficult to grasp how they are related to each other. What are all these sub projects? do they all relate with each other?

Marko Rodriguez and Peter Neubauer:
The TinkerPop stack is described from bottom-to-top:
Blueprints: A graph API with an operational semantics test suite that when implemented, yields a Blueprints-enabled graph database which is accessible to all TinkerPop products.
Pipes: A data flow framework that allows for lazy graph traversing.
Gremlin: A graph traversal language that compiles down to Pipes.
Frames: An object-to-graph mapper that turns vertices and edges into objects and relations (and vice versa).
Rexster: A RESTful graph server that exposes the TinkerPop suite “over the wire.”

Q7. Is there a unified API for Tinkerpop? And if yes, how does it look like?

Marko Rodriguez and Peter Neubauer:
Blueprints is the foundation of TinkerPop.
You can think of Blueprints as the JDBC of the graph database community. Many graph vendors, while providing their own APIs, also provide a Blueprints implementation so the TinkerPop stack can be used with their database. Currently, Neo4j, OrientDB, DEX, RDF Sail, TinkerGraph, and Rexster are all TinkerPop promoted/supported implementations.
However, out there in the greater developer community, there exists an implementation for HBase (GraphBase) and Redis (Blueredis). Moreover, the graph database vendor InfiniteGraph plans to release a Blueprints implementation in the near future.

Q8. In your projects you speak of “dataflow-inspired traversal models”. What is it?

Marko Rodriguez and Peter Neubauer:
Data flow graph processing, in the Pipes/Gremlin-sense, is a lazy iteration approach to graph traversing.
In this model, chains of pipes are connected. Each pipe is a computational step that is one of three types of operations: transform, filter, or side-effect.
A transformation pipe will take data of one type and emit data of another type. For example, given a vertex, a pipe will emit its outgoing edges. A filter pipe will take data and either emit it or not. For example, given an edge, emit it if its label equals “friend.” Finally, a side-effect will take data and emit the same data, however, in the process, it will yield some side-effect.
For example, increment a counter, update a ranking, print a value to standard out, etc.
Pipes is a library of general purpose pipes that can be composed to effect a graph traversal based computation. Finally, Gremlin is a DSL (domain specific language) that supports the concise specification of a pipeline. The Gremlin code base is actually quite small — all of the work is in Pipes.

Q9. How other developers could contribute to this project?

Marko Rodriguez and Peter Neubauer:
New members tend to be users. A user will get excited about a particular product or some tangent idea that is generally useful to the community. They provide thoughts, code, and ultimately, if they “click” with the group (coding style, approach, etc.), then they become members. For example, Stephen Mallette was very keen on advancing Rexster and as such, has and continues to work wonders on the server codebase.
Pavel Yaskevich was interested in the compiler aspects of Gremlin and contributed on that front through many versions. Pavel is also a contributing member to Cassandra’s recent query language known as CQL.
Derrick Wiebe has contributed alot to Pipes and in his day job, needed to advance particular aspects of Blueprints (and luckily, this benefits others). There are no hard rules to membership. Primarily its about excitement, dedication, and expert-level development.
In the end, the community requires that TinkerPop be a solid stack of technologies that is well thought out and consistent throughout. In TinkerPop, its less about features and lines of code as it is about a consistent story that resonates well for those succumbing to the graph mentality.

____________________________________________________________________________________

Marko A. Rodriguez:
Dr. Marko A. Rodriguez currently owns the graph consulting firm Aurelius LLC. Prior to this venture, he was a Director’s Fellow at the Center for Nonlinear Studies at the Los Alamos National Laboratory and a Graph Systems Architect at AT&T.
Marko’s work for the last 10 years has focused on the applied and theoretical aspects of graph analysis and manipulation.

Peter Neubauer:
Peter Neubauer has been deeply involved in programming for over a decade and is co-founder of a number of popular open source projects such as Neo4j, TinkerPop, OPS4J and Qi4j. Peter loves connecting things, writing novel prototypes and throwing together new ideas and projects around graphs and society-scale innovation.
Right now, Peter is the co-founder and VP of Product Development at Neo4j Technology, the company sponsoring the development of the Neo4j graph database.
If you want brainstorming, feed him a latte and you are in business.

______________________________________

For further readings

Graphs and Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes

“Marrying objects with graphs”: Interview with Darren Wood.

“Interview with Jonathan Ellis, project chair of Apache Cassandra”.

“The evolving market for NoSQL Databases: Interview with James Phillips.”

_________________________

From → Uncategorized

No comments yet

“Applying Graph Analysis and Manipulation to Data Stores.”

Leave a Reply Cancel reply

About the author

Archives

Meta

About

Flickr

Search

“Applying Graph Analysis and Manipulation to Data Stores.”

Leave a Reply Cancel reply

About the author

Tags

Archives

Meta

About

Flickr

Search