On Graph Databases. Q&A with Brad Bebee
One of the most exciting things for me about working the graph space is the breadth of use cases that I get to see customers using graphs to solve.
Q1. What is a graph database?
A1. Graphs are a data model that consists of nodes and links, or sometimes called edges and vertices. Graphs are great for storing relationships between data because the edges (links) can be queried directly. A graph database is optimized for storing and retrieving graph data, and supports query languages and APIs that are designed for asking questions over relationships. What makes it a graph database vs. another type of graph processing system is that a graph database also offers customers features like consistency, durability, and high-availability that they are familiar with in traditional RDBMS systems and are required for building modern, global applications.
Q2. Why graph over traditional RDBMS?
A2. Have you ever been a brainstorming meeting where someone builds a new system concept on a white board? Often what you see looks like a graph, e.g., customers interacting with different components and connecting different systems. And then, we take that “graph” we put on the whiteboard and make it into a relational database schema. You could store your graph data in an RDBMS, but queries will be hard to write (many joins) and slow to execute. If you change your data model, you’ll need to re-optimize your RDBMS.
It turns out that for applications that are about relationships and traversing them, for example, knowledge graphs, identity graphs/customer 360, security posture awareness, a graph data model is faster and more straightforward way to represent the data. Because the relationships (links) are first-class objects in the data model, you can query or traverse them directly. You don’t need to create join queries, which can be complex to write and slow to execute.
Also, graph access patterns are different than relational database access patterns. Relational data tends to be highly normalized with regular access patterns. Graph data tends to have random access patterns because there are many paths through the data and the one taken depends on the specific query. Graph databases are optimized for this type of random access that occurs during relationship traversal in a graph query. The types of query optimizations that work for RDBMS don’t always apply for graphs.
Q3. What are the market opportunity and typical use cases for graph databases?
A3. One of the most exciting things for me about working the graph space is the breadth of use cases that I get to see customers using graphs to solve. Everything from Netflix building and scaling data lineage to NBC Universal managing social interaction with their live content to Wiz helping customers managed their cloud security posture. The top use cases we see are customers managing information with knowledge graphs, connecting and managing customers with identity graphs or Customer 360, using relationships to detect fraud, and improving security using graphs. That doesn’t even touch the long tail of financial, life sciences, gaming, and other use cases. If you have a challenge, there’s a graph for you!
I believe that there are far more challenges that can be solved with graphs than customers who understand how and when to use a graph and graph database. The real opportunity is enabling customers to be as familiar with using graphs as they are with relational databases, relational key value stores, and document databases today. Gartner’s research indicates that the adoption of graph technologies is growing: “By 2025, graph technologies will be used in 80% of the data and analytics innovations, up from 10% in 2021, facilitating rapid decision making across the enterprise” “Market Guide: Graph Database Management Solutions”, M. Adrian, A. Jaffri, D. Feinberg, 24 May 2021.
Q4. What are the main differences between the top three most popular graph query languages: Cypher, Gremlin, and SPARQL?
A4. At a high level, there are two major graph models, labeled property graph (LPG) and the Resource Description Framework (RDF) graphs. LPGs consist of nodes and edges with properties. LPGs are supported by a number of graph vendors, including Amazon Neptune, and open-source projects like Apache TinkerPop. However, LPGs aren’t formally described and can have small, but important differences between providers. There is on-going work within the Linked Data Benchmarking Council (LDBC) to add this formalization for LPGs. RDF is a technical recommendation of the World Wide Web Consortium (W3C). The SPARQL Query Language for RDF (SPARQL) and a number of other recommendations collectively make up something called the Semantic Web. RDF graphs consist of triples, e.g., Brad is a Person.
Customers can query property graphs with languages like Apache TinkerPop Gremlin or openCypher. Gremlin provides imperative traversal capabilities that are great when you need specific, programmatic graph operations. openCypher gives you a declarative model to quickly write queries in the style of SQL. Customers can use SPARQL to write declarative graph queries for RDF graphs.
There are differences in the style, syntax, and features of the languages and models. What we’ve learned is that customers are excited about using graphs. The choices of graph models and query languages can hinder more than help developers in their journey to use graphs. We’ve written about our ideas to help bring the graph models together: “Graph? Yes! Which one? Help!”.
Q5. You launched Amazon Neptune in 2018. Why?
A5. We launched Amazon Neptune because customers are excited about graphs!
Seriously, though, we heard from customers that they need a graph database service that supported open source and open standard query languages, while providing the enterprise features and scalability that are needed for global, modern applications.
In the past four years, the use cases and growth that we’ve seen from customers has shown us that it is truly Day 1 for Graphs, and that there are far more challenges that can be solved with graphs than customers who understand how and when to use a graph and graph database.
Q6. What is exactly Amazon Neptune?
A6. Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications. It offers developers the most choice of graph data models and APIs supporting Gremlin and openCypher for labeled property graphs, and SPARQL for RDF graphs. As a cloud-native, fully managed database service, customers are not burdened with hardware setup, provisioning, software patching, or backups.
Q7. Tell us a bit about the main Amazon Neptune’s features.
A7. Neptune is a purpose-built graph database that support strong consistency and provides fast, interactive graph queries in milliseconds over graphs that contain billions of nodes, edges, and properties. All Neptune clusters can be configured high availability, encryption-at-rest, and read-replicas without expensive commercial licensing. Neptune can scale to clusters of up to 128TiBs of data and supports multi-region operations through its global database feature.
We also offer the AWS Graph Notebook which provides an easy way for developers and users to get started building graph applications using Jupyter Notebooks. It works great with Neptune, but also supports other graph databases that use openCypher, Apache TinkerPop, or SPARQL.
Neptune ML is capability of Neptune that uses Graph Neural Networks (GNNs), a machine learning technique purpose-built for graphs, to make easy, fast, and more accurate predictions using graph data. With Neptune ML, you can improve the accuracy of most predictions for graphs by over 50% when compared to making predictions using non-graph methods.
Q8. Who uses Amazon Neptune? Could you give us some customer examples?
A7. Thousands of customers use Neptune every day. I shared some examples above, but let’s look at a few more. Identity graphs and customer 360 are popular graph application. Customers leverage existing investments in customer data platforms (CDP) and other capabilities by using graphs to make relationships. Cox Automotive has used Neptune to create better 360 view of their customer data across digital platforms. ADP has built a graph modeling their human resources to reduce human capital management (HMC) costs. Games 24×7, an online gaming company based in India, has a great fraud use case. They run an online Rummy platform, but needed a way to detect fraud based on collusion during tournaments. They solved it using a graph. Careem, a Middle East based ride sharing company now owned by Uber, combined Machine Learning and Amazon Neptune to detect and stop losses from fraud.
In addition to these customer examples I would like to point out a new, extensive research blog we just published that walks through the process of utilizing graph neural networks (GNN) for fraud detection utilizing Neptune, Amazon SageMaker and Deep Graph Library (DGL).
Q9. What about Amazon Neptune integrations? What is the relationship between Amazon S3 and Amazon Neptune?
A9. A graph is not an island! In fact, most of the graph applications that we see include using a graph alongside other services, databases, or analytics. Graphs are the glue that can help to connect data and make it accessible. A graph database is the way to queries those connections and relationships quickly. Amazon Neptune allows you to store graph data in Amazon S3 and do a fast parallel, bulk load into a Neptune cluster. We also have integrations with Amazon Identity and Access Management (IAM) to provide fine-grained API access control, Amazon OpenSearch for full-text retrieval, Neptune connectors for Amazon Athena, and many more.
Qx. Anything else you wish to add?
Ax. Thanks for taking the time to chat. As you can see, I believe that graphs are awesome and they help customers use the relationships in their data to gain insights. For anyone who’s reading that agrees or wants to find out, let me know @b2ebs (Twitter). Also, we have a free trial for anyone who wants to get started with building graph applications.
………………………………………………
Brad Bebee , General Manager of Amazon Neptune
Brad is the General Manager of Amazon Neptune, AWS’s fully managed graph database service. He believes that graphs are awesome and they help customers use the relationships in their data to gain insights. Prior to joining AWS, he was the CEO of Blazegraph and was an active open-source contributor on the Blazegraph platform. He is a subject matter expert in graph and knowledge representation with experience ranging from the precursors of DARPA’s DAML program to large-scale data analytics. In his career, Brad has served as a CEO, CTO, CFO, managed operating divisions, and performed advanced technology development for commercial and public-sector customers.
Sponsored by AWS