On Graph databases. Q&A with Chris Gioran
“ I believe that graph databases are a milestone in the evolution of database systems.”
Q1. You are responsible for the curation of the technical roadmap for Neo4j products. Can you share with us what you are currently working on?
Graph databases are a relatively new technology compared with relational database systems. The latter have decades over which their model and techniques have evolved. Their domination over the market has established certain norms around what database systems should offer to be considered serious back-end storage solutions and systems of record. So, at Neo4j, we have two main tracks of work. One is to solve the same basic problems that any database has to solve, effectively and from scratch. For instance, ACID capabilities, reliable storage, performance, resiliency and scale through distribution, advanced administrative tools, and more. The other is to take advantage of and expose a useful surface for all the novel capabilities the graph model presents us with.
One major component to our architecture that continues to evolve and sits on both tracks is Neo4j Fabric. This was introduced with Neo4j 4.0 and is our approach to graph distribution. We chose to approach horizontal scaling, not through store-based sharding techniques but instead to use a higher-level concept: graph composition. Doing this presents a very intuitive surface to the user and is very sympathetic to the peculiarities of a native graph database. At the same time, it requires solving some complex technical issues, mostly centered around distributed query planning. This is symbiotic to the language itself and how graph components can be declared, combined, and queried as a unit. We are constantly making discoveries on this front, and future versions of Neo4j will include a constant stream of performance and usability enhancements for Fabric.
Graph analytics is another interesting area. Graph databases are uniquely positioned to unify analytics and transaction processing on a single platform. It is possible to have both high volume/small size transactional workloads executed on exactly the same database as the one you are running complex large-scale graph algorithms. To that end, we are working on making the surface between the two common, so you can use Cypher to execute both kinds of workloads and have the system decide the best way to treat your query.
Finally, there is Neo4j AuraTM, Neo4j’s fully managed service offering that reduces friction as complex applications shift to the cloud. Neo4j Aura has already proven to be a successful product, and, as with all cloud products, we are investing heavily in cost optimization for customers, expanding our cloud provider coverage, and increasing the automation of the system as much as possible. These improvements also benefit self-managed installations, so all work happens with both use cases in mind.
Q2. How do you capture and deal with the requirements of the platform as they will form in the future?
We have multiple sources of input that we use to inform the future evolution of the platform. Most of these are standard across the industry – field and user feedback, market research, trends in the broader data management space, etc.
We try to understand the basic ideas behind what users try to achieve with the platform or with the approaches they take to solve their problems. Our users get increasingly ambitious when they understand the power of modeling their data as graphs. We consistently see new things they try to do which, once composed into a greater picture, show us what technology we should be developing to solve a much broader class of problems. It is also interesting to note that, more often than not, these solutions coincide with improvements we have already identified are possible just by better understanding how our product works in production. Neo4j Aura has been particularly instrumental in this, where we can see how usage of the platform evolves in real time.
Further, the product engineering team itself. We are experts at what makes a good graph database, and that expertise results in novel ideas and approaches for solving known problems and discovering what new technologies can be developed. These are long-running tracks of work that need to be put in the context of short and medium-term requirements, have their surfaces refined, and go through multiple iterations before making it as features.
Eventually, these long-term processes will either be the core implementation for a class of features already requested or we present them as an innovation that moves the graph database space forward.
Q3. What does it mean to use graphs to store data and how is this different from other databases?
There are two ways to answer this question – talk about implementation or the model. They are connected, of course, but for simplicity, let’s treat them separately.
On the implementation side, a native graph database stores the connection between the nodes as materialized entities. Effectively, this means that joins are precomputed, stored on disk, and exposed as first-class citizens with their own identifiers, constraints, and properties. This allows for paths of arbitrary lengths to be traversed using complicated predicates without imposing the prohibitive memory consumption that an on-the-fly computed join would. However, this also requires a completely different approach to query planning, paging in memory, transaction state handling, what indexes are relevant, and even affects how distribution across machines happens. Solving all these issues results in systems that are quite novel in how they approach data storage and, of course, present a new way of modeling applications.
From the user’s perspective, the ability to treat relationships as first-class citizens has multiple implications for how an application is modeled and implemented. For example, in the absence of join tables, there is no limit to how complex a self-join operation may be. Most relational systems will limit the number of joins possible in a query, but with a graph database, that limitation is no longer present. This means paths can be arbitrarily long and interconnected without worrying about performance issues. Relationships can be manipulated independently from their nodes, without the need to recreate them as part of a query. Paths can be traversed and have complex predicates determine the direction of the traversal without prior knowledge of, or concern about, how complicated the actual underlying graph will be. This is not just a performance issue but a fundamentally new way of thinking about what it means to interact with a database.
As an additional example of this, let’s consider the case of graph algorithms. For analytics and AI/ML applications, graphs have proven to be the best data structure to work with. State-of-the-art data science pipelines will first perform a transformation of the data to a graph and then will run their algorithms on top of that. This is not a coincidence. Graph algorithms are both intuitive and fast precisely because they can operate on connected data without going through the extra step of computing connections or exposing the fact to the user. They require less effort from the user, letting them focus on what matters – getting knowledge out of their data.
Q4. Complexity increases over time. How do you gain insights from complex datasets by extracting knowledge at the data model level?.
In short, it is inevitable to get insights from datasets. The more complex it is, the more you can learn from it. All you have to do is look. In the graph model, every piece of data the user adds gives us a bit of insight into the real world. A path is made slightly longer, the distribution of values in a property becomes more concrete, or two subgraphs are connected or split apart. By allowing everything to potentially connect to everything, the user is free to explore and update their dataset in an almost ad-hoc fashion. But because they have to abide by the graph structure, connections will be present, allowing for declarative data retrieval that still produces structured results. To put it another way, the user can add arbitrary simple paths in the graph. Any connections created because of this are still discoverable without having to explicitly manage them through, for example, foreign keys. This means that the value of the data going into the database is greater than before it got there because it becomes part of a bigger whole.
At a data model level, every added relationship creates a new path, a new way to get from one part of the graph to another. This means that in a graph database, the user doesn’t just add data – they add structure. That relationship can be treated as a structural component, a way to get from one subgraph to another without necessarily looking at its contents. The result is that, for example, a shortest path query may produce different results as time goes by without the user needing to rewrite the query. This can happen without needing to know about the writes that caused the result to change. Those writes may even introduce new relationship types or whole new subgraphs – to the reader, this doesn’t matter, and it shouldn’t matter. Structure and its discoverability are what matters, and that is one of the great abstractions that graph databases offer.
This is just an example from an OLTP scenario, where the quality of results returned increases as data gets added to the database and the structure gets richer. But, as a more straightforward use case, Neo4j Bloom comes to mind. Data visualization is a distinguishing feature for graph databases and Bloom allows you to see the full dataset you are working with. Being able to inspect the data creates a completely different perception of the underlying structure of the database. Clusters of data become obvious, subgraphs are readily identifiable, and you can see the dimensions of the dataset you are working with. The more data you add, the more complex it gets, and thus, the result is more structured and, with it, more knowledgeable.
Q5. Graph composition is one of the most advanced concepts for graph databases. How is it useful in practice?
Graph composition is a design pattern that plays directly to the strengths of a graph database. It is a combination of the lack of necessity for a mandatory schema and the natural composability of graphs. Any two graphs can be brought together and, if the user can identify a way to traverse from one to the other, for example, based on some join condition or domain mapping, the parts will immediately form a bigger whole. Interestingly, the reverse is also true for exactly the same reasons – two sides of a coin, in a way. Any graph can be decomposed into component graphs and have the join or mapping made explicit in a query.
This pattern is as fundamental as it is powerful. What it needs is a system with a native way to support it at the query and administrative level, which is how we came up with Neo4j Fabric. By allowing the user to declaratively define the graph “stitching” operation across a collection of otherwise disjointed graphs, it is possible to treat them as a single unit, made up of pieces that are logically isolated. By “logically isolated,” I mean that each component is a self-contained unit that can be queried independently, even if it contains a subset of the complete dataset. In this way, users can split up their data in databases in ways driven explicitly and purely by their domain instead of relying on some store-level mapping tool that uses incomplete pieces of information (like range-based hash functions or similar). But there are two additional benefits that may be easy to overlook.
The first is that this approach makes all decisions a concern for the query runtime. By having a declarative way of composing and decomposing graphs, all information available to the runtime, such as data metrics, plan optimization rules, and cluster topology, are available for distributing the query and gathering results. This is a fundamentally more powerful technique than any static method, including store-level sharding, since it allows the system to adapt more dynamically to many more information streams about the environment.
The second benefit is that it makes federation and distribution of data a purely modeling concern. All methods of data distribution require input from the user to let the system know the data roots, distribution keys, traversal methods, and so on. Using graph composition as the primitive, all these concerns are concentrated at the declarative level and provide much more flexibility. Graphs can be created ad hoc and brought together dynamically, have aggregations created over them and stored in new databases for global analytics, and even split out in pieces as appropriate. In a way, this lets the user evolve their system as they get more sources of data, refactor their model, and understand more about its structure. It represents, in many ways, a departure from the traditional method of achieving scalability by elevating it from an administrative concern to a knowledge management one.
Q6. What do you think is the future potential of graph databases?
I believe that graph databases are a milestone in the evolution of database systems. If we look at the topic historically, the first major evolution to happen was the transition from the first systems, like IBM’s IMS to System R. The introduction of relational algebra and SQL made data management declarative and allowed for query planner/optimizers to be created so the system itself could reason about data access. This allowed for optimal disk usage and a huge performance boost by taking this concern away from the user. Further, it required both the creation of novel algorithms and systems architecture and the user to give information ahead of time in the form of a schema.
Since then, things have stagnated. The result was a quickly evolving application ecosystem being forced to twist the relational model to fit increasingly incompatible development models and techniques, new languages, and an increasing amount of data being stored. The NoSQL explosion during the late 2000s was proof that something needed to change, and a whole new generation of systems came out that showed many ideas about the future of databases.
Τhe most interesting model to come out of that period is the graph database. The graph model combines an algebraic structure that allows for query optimization with a much richer structure that lets the programmers manipulate data in ways that are compatible with the way they design their applications. There is no more object-relational mismatch and no necessity for ahead of time declared schema.
In the ‘80s, that wouldn’t work. There was not enough knowledge about optimizing queries, no technology for Just-In-Time compilation, and no multithreading support. Today we have all these things we can take advantage of to remove even more cognitive overhead from the user – instead letting them focus on thinking about their domain model and how to extract knowledge. I think that’s the direction we need to build on.
We keep coming up with ideas on how to do even more on behalf of the user – creating new ways to optimize queries, discovering methods of distributing and combining data, formatting transformations for import/export, and making graphs a vital component of every data
processing pipeline. We can see the potential for graph databases in general, and Neo4j in particular, to be a one-stop-shop for every data operation. We already see that both real-time and heavy-duty OLTP workloads are equally well served by such systems, and we know that we have barely scratched the surface of what is possible. As we develop new methods of declarative graph data manipulation, things will only get more interesting.
Qx Anything else you wish to add?
The topics we covered in this Q&A are explored more thoroughly in our series Neo4j Under The Hood. There you can listen to me cover the ideas that underlie complexity and knowledge and how they are treated differently in a graph database, as well as how Neo4j takes advantage of and exposes these concepts.
Chris Gioran is the Chief Architect at Neo4j. He has been building database management systems for more than a decade and is currently curating the technical roadmap for all of Neo4j’s data management platform. Originally from Athens, Greece, he currently lives in Malmö, Sweden, working from the engineering HQ of Neo4j.
Sponsored by Neo4j.