Data Science and Graph databases. Q&A with Alicia Frame
Q1. Can you explain why a data scientist would need a graph database?
Alicia Frame: Data scientists need graph databases when they have connected data. It allows them to think about where they have relationships between their data points and leverage those relationships for inference.
Q2. What level of technical expertise is required by a data scientist to install a graph database?
Alicia Frame: We’re all about making graph databases approachable and accessible to everyone. To get started with Neo4j, you can download Neo4j desktop and have a graph database up and running with a few clicks. If you don’t want to install anything, you can use Aura – our cloud offering — and get up and running with no set up at all.
Once you’ve got Neo4j installed, Cypher (our graph query language) and libraries like our Graph Data Science library are pretty simple to learn. We have browser guides like `:play movies` or `:play graph-data-science` to teach you the basics, and tons of resources on our developer pages.
Graph databases should be accessible to anyone who has graphy data. You don’t need a PhD in network science or to spend six months in a bootcamp just to learn to write a query.
Q3. What makes graph databases more effective at running algorithms?
Alicia Frame: Native graph databases are really efficient for running graph algorithms because they’re built to store graph data in the right shape. You don’t have to create the relationships between data by joining tables together, the relationships are front and center. By running algorithms directly in the database, you avoid the I/O cost of pulling your data out, converting it into another format for some external library, and then having to build an ETL pipeline to store your results and get them into production. We have data structures that are optimized for global graph traversals and aggregations — letting you run computationally complex algorithms on enterprise scale graphs (think tens of billions of nodes!).
In general, I think of graph algorithms as a data scientist’s secret super power. They’re maybe not the first thing you learn in school, but once you understand how to use graph algorithms to describe the topology, connectivity, and structure of your connected data, you suddenly have a new, powerful toolkit available to you. Everyone knows how to run XGBoost or throw some data at a neural network—but with graphs, you have another aspect of your data to explore and leverage for improved model performance.
Q4. How would graph databases change the day to day work of a data scientist?
Alicia Frame: I’m not sure they drastically change the day to day work of a data scientist since you’re still pulling data together, cleaning it, querying it and building models. Graph databases make some of those steps easier and faster, though, because they’re really good at representing connected data. Instead of pulling from 15 databases and knitting the results together in Python, you can represent all your heterogeneous data in one place. And when you’re trying to query or subset your data into the right shape for your analysis, Cypher makes it really fast to pull everything together without wasting time figuring out complicated ER diagrams or waiting on table joins.
Being able to run graph algorithms within the graph database is a game changer. Neo4j’s Graph Data Science library lets you seamlessly run over 50 different graph algorithms without building ETL pipelines to get your RDBMS data into the right shape, and we don’t have the same memory limitations you run into with open source network science libraries like NetworkX or iGraph. By being able to run graph algorithms at scale, we provide a reliable way of incorporating graph based features that describe network topology into machine learning pipelines. The point of using a graph database isn’t to change the way that you work—or blow up everything you spent the last six months building—it’s to help you do your job faster, better, and more reliably.
Alicia Frame is currently the Lead Product Manager and Data Scientist at Neo4j, where she works on the company’s Product Management team to set the roadmap and strategy for developing graph-based machine learning tools. She earned her Ph.D. in computational biology from the University of North Carolina at Chapel Hill and a B.S. in biology and mathematics from the College of William and Mary in Virginia and has over 8 years of experience in enterprise data science at BenevolentAI, Dow AgroSciences, and the EPA.