On Graph Databases. Interview with Emil Eifrem
“The IoT will have many, many trillions of connections, particularly considering it’s not just the devices that are connected, but people, organizations, applications, and the underlying network” –Emil Eifrem.
I have interviewed Emil Eifrem, CEO of Neo Technology. Among the topics we discussed: Graph Databases, the new release of Neo4j, and how graphs relate to the Internet of Things.
Q1. Michael Blaha said in an interview: “The key is the distinction between being occurrence-oriented and schema-oriented. For traditional business applications, the schema is known in advance, so there is no need to use a graph database which has weaker enforcement of integrity. If instead, you’re dealing with at best a generic model to which it conforms, then a schema-oriented approach does not provide much. Instead a graph-oriented approach is more natural and easier to develop against.”
What is your take on this?
Emil Eifrem: While graphs do excel where requirements and/or data have an element of uncertainty or unpredictability, many of the gains that companies experience from using graph databases don’t require the schema to be dynamic. What make graph databases suitable is problems where relationships between the data, and not just the data, matter.
That said, I agree that a graph-oriented approach is incredibly natural and easier to develop against. We see this again and again, with development cycle times reduced by 90% in some cases. Performance is also a common driver.
Q2. You recently released Neo4j 2.2. What are the main enhancements to the internal architecture you have done in Neo4j 2.2 and why?
Emil Eifrem: Neo4j 2.2 makes huge strides in performance and scalability. Performance of Cypher queries is up to 100 times faster than before, thanks to a Cost-Based Optimizer, that includes Visual Query Plans as a tuning aid.
Read scaling for highly concurrent workloads can be as much as 10 times higher with the new In-Memory Page Cache, which helps users take better advantage of modern hardware. Write scaling is also significantly higher for highly concurrent transactional workloads. Our engineering team found some clever ways of increasing throughput, by buffering writes to a single transaction log, rather than blocking transactions one at a time where each transaction committed to two transaction logs (graph and index). The last internal architecture change was to integrate a bulk loader into the product. It’s blindingly fast. We use Neo4j internally and a load that took many hours transactionally runs in four minutes with the bulk loader. It operates at throughputs of a million records per second, even for extremely large graphs.
Besides all of the internal improvements, this release also includes a lot of the top-requested developer features from the community in the developer tooling, such as built-in learning and visualization improvements.
Q3. With Neo4j 2.2 you introduce a new page cache. Could you please explain what is new with this page cache?
Emil Eifrem: Neo4j has two levels of caching. In earlier versions, Neo4j delegated the lower level cache to the operating system, by memory mapping files. OS memory mapping is optimized for a wide range of workloads.
As users have continued to push into bigger & bigger workloads, with more and more data, we decided it was time to build a specialized cache, built specially for Neo4j workloads. The page cache uses an LRU-K algorithm, and is auto-configured and statistically optimized, to deliver vastly improved scalability in highly concurrent workloads. The result is much better read scaling in multi-core environments that maintains the ultra-fast performance that’s been the hallmark of Neo4j.
Q4. How is this new page cache helping overcoming some of the limitations imposed by current IO systems? Do you have any performance measurements to share with us?
Emil Eifrem: The benefits kick in progressively as you add cores and threads. In the labs we’ve seen up to 10 times higher read throughput compared to previous versions of Neo4j, in large simulations. We also have some very positive reports from the field indicating similar gains.
Q5. What enhancements did you introduce in Neo4j 2.2 to improve both transactional and batch write performance?
Emil Eifrem: Write throughput has gone up because of two improvements. One is the fast-write buffering architecture. This lets multiple transactions flush to disk at the same time, in a way that improves throughput without sacrificing latency. Secondly, there is a change to the structure of the transaction logs. Prior to 2.2, writes used to be committed one at a time with two-phase commit for both the graph and its index. With the unified transaction log, multiple writes can be committed together, using a more efficient approach than before, for ensuring ACIDity between the graph and indexes.
For bulk initial loading, there’s something entirely different, a utility called “neo4j-import” that’s designed to load data at extremely high rates. We’ve seen complex graphs with tens of billions of nodes and relationships loading at rates of 1M records per second.
Q6. You introduced a cost-based query planner, Cypher, which uses statistics about data sets. How does it work? What statistics do you use about data sets?
Emil Eifrem: In 2.2 we introduced both a cost-based optimizer and a visual query planner for Cypher.
The cost-based optimizer gathers statistics such as the total number of nodes by label and calculates the most efficient query path based not just on information about the question being asked, but information about the data patterns in the graph. While some Cypher read queries perform just as fast as they did before, others can be 100 times faster.
The visual query planner provides insight into how the Neo4j optimizer will execute a query, helping users write better and faster queries because Cypher is more transparent.
Q7. Gartner recently said that “Graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data-driven operations and decisions after the design of data capture.”
Graph analysis does not necessarily imply the need a dedicated graph database. Do you have any comment?
Emil Eifrem: Gartner is making a business statement, not a technology statement. Graph analysis refers to a business activity. The best tool we know for carrying out relevant and valuable graph analysis problems is graph databases.
The real value of using a dedicated graph database is in the power to use data relationships easily. Because data and its relationships are stored and processed as they naturally occur in a graph database, elements such as index-free adjacency lead to ultra-accurate and speedy responses to even the most complex queries.
Businesses that build operational applications on the right graph database experience measurable benefits: better performance overall, more competitive applications that incorporate previously impossible-to-include real-time features, easier development cycles that lead to faster time-to-market, and higher revenues thanks to speedier innovation and sharper fraud detection.
Q8. How do you position Neo4j with respect to RDBMS which handles XML and RDF data, and to NoSQL databases which handle graph-based data?
Emil Eifrem: While it is possible to use Neo4j to model an RDF-style graph, our observation is that most people who have tried doing this have found RDF much more difficult to learn and use than the property graph model and associated query methods. This is unsurprising given that RDF is a web standard created by an organization chartered with world wide web standards (the W3C), which has a very different set of requirements than organizations do for their enterprise databases. We saw the need to invent a model suited for persistent data inside of an enterprise, for use as a database data model.
As for XML, that again is a great data transport and exchange mechanism, and is conceptually similar to what’s done in document databases. But it’s not really a suitable model for database storage: if what you care about is relating things across your network. XML databases experienced some hype early on but never caught on.
While we’re talking about document databases … there’s another point here worth drilling into, which is not the data model, but the consistency model. If you’re dealing with isolated documents, then it’s okay for the scope of the transaction to be limited to one object. This means eventual consistency is okay. With graphs, because things relate to one another, if you don’t ensure that related things get written to the database on an “all or nothing” basis, then you can very quickly corrupt your graph. This is why BASE is sufficient for other forms of NoSQL, but not for graphs.
Q9. Could you please give use some examples on how graph databases could help supporting the Internet of Things (IoT)?
Emil Eifrem: We love Neo4j for the Internet of Things, but we’d like to see it renamed to Internet of Connected Things! After all, the value is in the connections between all of the things, that is, the connections and interactions between the devices.
Two points are worth remembering:
- Devices in isolation bring little value to the IoT; rather, it’s the connections between devices that truly bring forth the latent possibilities.
- We’re not just speaking about tracking billions of connections; the IoT will have many, many trillions of connections, particularly considering it’s not just the devices that are connected, but people, organizations, applications, and the underlying network.
Understanding and managing these connections will be at least as important for businesses as understanding and managing the devices themselves. Imagination is key to unlocking the value of connected things. For example, in a telecommunications or aviation network, the questions, “What cell tower is experiencing problems?” and “Which plane will arrive late?” can be answered much more accurately by understanding how the individual components are connected and impact one another. Understanding connections is also key to understanding dependencies and uncovering cascading impacts.
Q10. What are your top 3 favourite case studies for Neo4j?
Emil Eifrem: My top three use cases are:
- Real-time Recommendations – Personalize product, content and service offers by leveraging data relationships. (dynamic pricing, financial services products, online retail, routing & networks)
- Fraud Detection – Improve existing fraud detection resulting methods by uncovering hidden relationships to discover fraud rings and indirections. (financial services, health care, government, gaming)
- Master Data Management – Improve business outcome through storage and retrieval of complex and ‘hierarchical’ master data. Top MDM data sets across our customer base include: customer (360 degree view), organizational hierarchy, employee (HR), product / product line management, metadata management / data governance, and CMDB, as well as digital assets. (financial services, telecommunications, insurance, agribusiness)
Q11. Anything else you wish to add?
Emil Eifrem: Yes, one thing. As much as we’re a product company, we are very passionate educators and evangelists. Our mission is to help the world make sense of data. We discovered that graphs are an amazing way of doing that, and we’re working hard to share that with the world.
For anyone interested in learning more, our web site offers a lot of great learning resources: talks, examples, free training… we’ve even worked with O’Reilly to offer up their Graph Databases e-book. By the time this article is published, the second edition should be up on http://graphdatabases.com. Any of your readers who are interested, are welcome to come and learn, and become part of the amazing & rapidly growing worldwide graph database community.
It’s been great to speak with you!
Emil is the founder of the Neo4j open source graph database project, the most widely deployed graph database in the world. As the CEO of Neo4j’s commercial sponsor Neo Technology Emil spreads the word about the powers of graphs everywhere. Emil is a co-author of the O’Reilly Media book “Graph Databases” and presents regularly at conferences around the world such as JAOO, JavaOne, QCon, and OSCON.
Follow ODBMS.org on Twitter: @odbmsorg