On Graph Databases, Gen AI and the Cloud. Q&A with Jim Webber
Q1. The demand for graph databases is being regarded as essential infrastructure for AI systems. Why?
Large Language Models (LLMs) are incredibly powerful for AI systems, but businesses have to balance the models’ creativity and tendency to hallucinate against the business need for factual data. The way they can do that is by providing context (in the form of a limited number of tokens) to the LLM when it processes requests.
The pattern that initially emerged to support this is called RAG – Retrieval Augmented Generation – where a database full of facts gets put on the retrieval path between an LLM and its user. Typically the outcome of this process is some high quality tokens (facts) from the chosen database and the ability to use those to provide context to the LLM, which helps improve their accuracy.
But you can supplement this another way. Instead of having just a plain relational database, many organisations have opted to upgrade to a knowledge graph, where all the topological/ontological features of the graph are brought forward to make sure they’re getting the best possible set of tokens for the LLM.
Knowledge graphs are proven to provide a significant boost in performance compared to other data models, so much so that graph databases are now regarded as the default choice for RAG. In fact, I would go as far as to say that GraphRAG is eating RAG.
Q2. What about specialised Vector Databases?
I get asked this a lot. Normally I would ask the questioner to reflect on why there aren’t string databases or int databases. It’s because a vector, after all, is just another data type and it’s unusual to have a whole database positioned around a single data type.
We’re seeing this in the market right now. In the last survey I saw a few days ago, MongoDB (a document database) and Neo4j (a graph database, and my employer) were both listed in the top 10 vector stores. I would opine that the other “pure” vector stores in that survey would probably be better at the vector operations than MongoDB and Neo4j. But the advantage that graph technology brings is that vectors are an on-ramp into the knowledge graph – it’s the correlating ID for the RAG era. Once you’re in the knowledge graph, you can leverage topology not just geometry (though geometry is still there – vectors *also* exist if you need them).
What do I mean by all this talk of geometry and topology? In a vector database we say things are similar if they are close to each other in a vector space. Think of it as similar to the trigonometry that you learned at school, albeit with more dimensions. But closeness doesn’t always mean sameness. Apple’s music business, Apple’s tech business, and Apple the fruit are all apple-ish, but are they similar? Depending on your encoding algorithm they might well be.
Compare this to topology. Apple the fruit will be in a completely different part of the graph to Apple’s tech music businesses. They might be closer together, since they’re both part of the same company , but the tech business would probably be closer to say Microsoft or Google, while the music business would be a few hops away from EMI (joined, no doubt, by the Beatles). Now – as Microsoft Research has shown – you can also increase your accuracy automatically by curating your knowledge graphs with algorithms. It’s completely domain-agnostic and makes the GraphRAG accuracy higher.
With all that in mind, with tongue-in-cheek I would ask back – what about specialised vector databases?
Q3. Early this year, ISO published a new standard for database languages, ISO GQL. What does this signify for the industry to you? How will this benefit the development of graph technology?
The ISO GQL effort has been several years in the making. It’s only the second standard that has been published by the working group after SQL, and is a significant milestone for the industry as a whole.
Roughly speaking the ISO working group took the much-loved Cypher query language and turned it into an industrial-strength standard. Impartial observers might remark that doing so was an unlikely outcome, since SQL has absorbed every other data model that has emerged into its own body (XML, objects, etc). But SQL couldn’t really absorb graphs (apart from the subset that is SQL/PGQ for legacy vendors) because they are different and important.
This has created disruption in the market that is potentially a good thing for graph fans, and an interesting one for SQL fans, because ultimately this is the SQL moment for graphs. SQL made relational databases ready for widespread deployment with a single standard API (in principle), giving buyers confidence that they could invest without technological lock-in, and staff programmes of work with professionals who know the standard language. The story is the same for graphs – now we’re no longer competing on proprietary APIs but on the quality of our product offerings. It’s a good thing, because it opens up our industry to innovation below the API while keeping users happy above it.
With ISO GQL levelling the playing field, I expect there to be a significant uptick of use in enterprise in particular, followed by a thinning of vendors – much of what has happened in the post-SQL relational world.
Q4. What are the reasons for enterprise demand for Neo4j’s cloud offering?
I think the market is being pushed in three ways that happen to be converging right now, hence the reason for the demand.
Firstly, Cloud is here. We’re now seeing workloads from highly regulated industries (e.g., banking) moving not just to the cloud but to cloud-based graph databases specifically. Although a relatively new phenomenon, it is a general trend that even the most critical workloads and data of regulated industries can be moved to the right kind of cloud infrastructure. Concerns about data being too sensitive have now been empirically addressed, and users have not been slow to notice.
Secondly,Graphs are no longer a niche. If having an ISO standard for graphs wasn’t enough, it’s very clear that we and others are faring well in the market. Graphs are favoured by Alpha-geeks, but are also a regular part of the enterprise toolkit. The vast majority of the Forbes Global 2000 and Fortune 500 are looking to solve very real problems with graph technology like ours – problems that can’t be easily solved with other tech). They are not companies simply looking to geek out!
And finally, there’s the rise of AI. We’re now in a world where CEOs are demanding a technical strategy around LLMs, which is certainly a first in my career.No CEO ever asked for a middleware strategy or a Web app strategy! But given the eyes of the world are now trained onAI, practitioners are under pressure to show results, with only about 30% of projects making it into production according to IBM. This is where GraphRAG comes to the fore. If organisations are going to make sure their AI doesn’t hallucinate in front of the CEO, they’d better integrate a knowledge graph into the system before the demo.
Q5. What is special about Neo4j AuraDB? A transformation of your Aura cloud database management system (DBMS) portfolio was announced earlier this year. What capabilities does this transformation encompass? What were the drivers behind the changes?
AuraDB is our fully managed, cloud-native graph database service offering scalability, high availability, and developer-friendly features for building graph-based applications. It brings graph data to you, wherever you are, at whatever level of use, without the operational baggage.
As well as providing support for more use cases (improve performance, AI workflows) it also offers a raft of features aimed at the enterprise, including the ability to integrate AuraDB with your preferred cloud ecosystem (e.g. AWS, Azure, Google) and with third parties like Spark and Kafka. For larger deployments it also offers multi-tenancy, compliance tools, and centralised management, too.
It’s ultimately all part of our journey towards being “operationally boring”, which has been driven by extensive customer feedback over many years. Graphs are very exciting, but running them should ultimately be easy. Nobody wants to have an exciting database after all, especially at 4am when you’re on call.
Q6. Do you help supporting large language models (LLMs)? How?
Neo4j is deeply embedded within the LLM community. As I mentioned earlier, GraphRAG is the pre-eminent pattern in use today for integrating knowledge graphs with LLMs, and Neo4j is often the database that hosts the knowledge graph given its extensive integrations with tools like Langchain or even our own Neo4j GraphRAG package for Python. At a more technical level, graphs are often used during the training of LLMs too, to improve understanding of entities, relationships, and context, while graph based-features (such as hierarchy or centrality measures) can also be used to engineer features for fine-tuning. To avoid retraining, Neo4j equally can be deployed as an external memory for LLMs, providing a scalable way to augment language models with up-to-date knowledge.
We know Neo4j enhances conversational AI with grounded responses, or enriched summarisation supported by knowledge graphs. But what’s really exciting to me is the symbiosis between graphs and LLMs. At one level, users can converse in natural language with Neo4j and have LLMs perform text-to-Cypher (where Cypher is our query language, and immediate ancestor of ISO GQL) to create good quality graph queries having to learn Cypher at all.
On the other, LLMs can be used to actually create their own knowledge graphs. This seems insane to me, from an entropic point of view, but the empirical data is unequivocal. You can read a corpus, get the LLM to create a knowledge graph and then use that automatically generated knowledge graph for your GraphRAG and, amazingly, accuracy will improve!
On top of all that, Microsoft Research has famously demonstrated that if you run some automatic algorithms over your automatically generated graph (they chose hierarchical clustering) then accuracy improves even more! Finally, when it comes to writing good prompts for an LLM to create a good knowledge graph, Microsoft Research also has a great pattern for sampling a data set and getting an LLM to suggest good prompts that would result in the generation of a valuable knowledge graph from the subsequent large data set. What’s amazing is none of this is domain-specific and an easy to automate interplay between a knowledge graph and an LLM.
Q7. What is a GraphRAG and why is it important?
I’ve mentioned GraphRAG a lot in my previous responses, but let me recap here.
RAG is where we use a data source to help make LLM responses more accurate by feeding them tokens of context.
GraphRAG is a step further, in that it uses a knowledge graph (usually hosted in a graph database) to make those tokens of context as high-quality as possible to get higher accuracy from the LLM.
Both RAG and GraphRAG are systems development patterns which are widely implemented in middleware and trivial to deploy – if you can deploy a Web app, you can deploy a RAG solution. But as mentioned, I firmly believe that GraphRAG is eating RAG.
Q8. Gartner noted the importance of GraphRAG in its Hype CycleTMFor AI in Software Engineering, 2024 report: “RAG techniques in an enterprise context suffer from problems related to the veracity and completeness of responses caused by limitations in the accuracy of retrieval, contextual understanding, and response coherence. KGs (Knowledge Graphs), a well-established technology, can represent data held within documents and the metadata relating to the documents. Combining both aspects allows RAG applications to retrieve text based on the similarity to the question and contextual representation of the query and corpus, improving response accuracy.” What is your take on this?
Our GraphRAG infrastructure of course understands text chunks – they’re the raw data for most LLM applications. So I’d agree with Gartner at the systems level – you have to show your provenance when asked, but if I was building such a system using Neo4j I’d lean on its affordances and save myself the pain of running in different databases.
Q9. What are your plans to add new GenAI features to Neo4j’s core offering and expanding capabilities for mainstream cloud adoption?
We think that knowledge graphs and frameworks are going to be the foundation on which most GenAI systems will be built in the future. It’s a big reason why we ensure our tech is available where developers are working on those systems, and we are well integrated into the frameworks they are using. There is so much promise in GenAI, but only if we can balance its creativity with dependable data. We think that’s a plausible future, and we’re investing heavily in that area. Stay tuned.
Qx. Anything else you wish to add?
Many of us in the industry have followed your work over a long time, thanks for what you do!
……………………………………………

Jim Webber is Neo4j’s Chief Scientist, where he leads the research group and works on a variety of database topics including query languages and runtimes, temporality, streaming, scale, and fault-tolerance. He is also a Visiting Professor at Newcastle University, UK, and has co-authored several books on graph technology including Graph Databases – 1st and 2nd Editions and Building Knowledge Graphs.