My Neo4j Summer Road Trip to the World of Healthcare [Part 1]

My Neo4j Summer Road Trip to the World of Healthcare [Part 1]

 By Yaqi Shi, Developer Relations | July 4, 2016

Nothing excites me more than getting on an adventurous road trip to explore the unknown. What’s even better is this adventure doesn’t involve getting sweaty and exhausted to drive long hours in the middle of nowhere.

Yes, I’m talking about a digital road trip. This summer I will rock the world of healthcare using Neo4jto see how each group of stakeholders connect to each other. You are welcome to join me if you are wondering what interesting discoveries I will find and curious how to use a graph database to uncover your own adventurous world. Now, let’s go!

Unlock the World: Data Is Your Magic Tool

No one can deny that among all the major industries in the U.S., healthcare is one of the most complicated in that it involves a wide range of stakeholders from providers to drug manufacturers to legislators as well as patients. Coming from a medical background, I am extremely interested in discovering how each party plays their roles and influence the whole industry.

My journey of exploring the healthcare industry started with a whiteboard. In a sunny laidback afternoon, me and my friends were discussing who plays a role in the healthcare industry and we started to draw each party with their relationships on a whiteboard. The graph started like this:

A Data Model for Healthcare

After 10 minutes of discussion, the graph grew into something look like this:

An Extended Data Model of the Healthcare Industry

Now this looks amazing! The left part of the graph demonstrates the procedures of a clinical encounter between a patient and a provider, the right part of the graph represents the relationships among different stakeholders.

My initial curiosity lies in the right side of the graph, and I am pretty sure there are a lot of public datasets that I can use. By create a graph database, I am hoping to answer interesting questions such as:

  1. Tracking prescription drug abuse history by states, hospitals and providers
  2. Tracking Medicare Part D costs by drugs, providers and states
  3. Looking for lobbying history specifically for individual drug manufacturers and lobbyist information

The next question is how am I going to model such a highly interconnected system to answer these questions in a relational database? While I’m sure I could do it and later query the database by joining all the tables together, I would have to write a 1-2 page long query (!) for each piece of information that I’m curious about.

Well, that doesn’t sound fun, and I don’t want to spend my summer in writing long and complicated SQL queries. But wait, I could totally represent the system in a graph database just like how I drew it all out on a whiteboard!

I spent the next few days teaching myself how to data modeling for a graph database using this awesome online tutorial. Check it out if you are interested, but I am going to show you what I did to model the data in a graph database.

First and Foremost, Draw out the Domain

Searching for the right data is like looking for a piece of a leaf in the wild jungle. By first drawing out the general graph of the system that I am trying to modeling, it gives me the idea of which types of trees I need to look for, so to speak.

In my project, I quickly identified data related to drugs, drug manufactures, providers and lobbyists that are essential for me to construct the system.

Second, Document What is Available

Knowing the targets, I quickly started my journey of data collection. There is no magic about how to quickly locate the targets other than using Google.

After a few days of data collection along with further studying about each stakeholder of the healthcare industry (Wikipedia always answers my questions), I found a few major sites where I can download the data that I need:

Data Type URL
Provider Prescriptions CSV https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Part-D-Prescriber.html
Provider Enumeration System CSV http://download.cms.gov/nppes/NPI_Files.html
FDA Drug Codes CSV http://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm
Drug Manufacturers CSV http://www.fda.gov/drugs/informationondrugs/ucm135778.htm
Lobbying Disclosures XML http://disclosures.house.gov/ld/ldsearch.aspx
Lobbying Contributions XML http://disclosures.house.gov/lc/lcsearch.aspx
Congress, Legislators & Bills Other https://www.govtrack.us/developers

Modeling data in a graph database requires a thorough understanding of both the context as well as the characters of the data. Having the data sources helps me to better define the domain of my model:

  • A healthcare provider prescribes drugs
  • A drug manufacturer produces drugs
  • The drug manufacturer hires a lobbying firm, which files a disclosure
  • The lobbying disclosure relates to one or several issues
  • The lobbying firm makes contributions to a Congress representative’s campaign

Third, Answer Two Questions to Refine Your Model

Having a definition of the domain is still not sufficient to identify the nodes and relationships in a graph model. It’s worth taking a moment to look in detail of the data that I found.

The general idea of mapping data from an RDBMS to a graph is that rows are usually treated as nodes, columns are properties and keys are relationships between nodes. Here are a few screenshots from the “Introduction to Graph Databases” online training course that demonstrate this data modeling process:

RDBMS Tables Represented in a Data Model
An RDBMS JOIN Table Represented in a Data Model
A Hybrid Data Model of an RDBMS and Graph Database
RDBMS JOIN Tables Are Equal to Relationships in a Graph Database
A Graph Database Data Model

When looking at the details of your datasets, answering these two questions is helpful to convert RDBMS tables into a graph data model:

    1. Do the two tables contain columns that share the same values?

      Answering this question is helpful to link two nodes together. The question is easy to answer if you could find the key columns in both tables. But what if the two tables don’t share a foreign key? Then it’s good to document which columns in the tables represent the same object.

      For example: Provider prescription data doesn’t contain FDA drug codes (which can be used as a foreign key). However, both provider prescription data and FDA drug code data contains columns of brand_name and generic_name, thus I’m able to link these two tables together by brand_name and generic_name.

  1. Which columns (i.e., properties) can be extracted as an independent node?

    Remember that columns are usually treated as properties? But sometimes you may want to create a node for some columns, and doing so can be beneficial when querying your data.

    For example: In the lobbying disclosure data, a lobbying firm can be treated as a property of the Disclosure node, but it can also be extracted as an independent node which has a relation FILED pointing to Disclosure and another relation LOCATED_IN to the State node.

    A good reason for treating the lobbying firm as a node rather than a property is that the lobbying firm contains a lot of information such as: namehouseID and location which can be stored as properties for that node. Another reason, perhaps more beneficial, is that it allows you to create a relationship between lobbying firm and state, thus offering the option to query “which lobbying firms and drug manufacturers are located in the same state”.

Last, Create the Graph Schema

Although the graph database is schemaless, it’s good to have a clear idea of how your data will be stored in a graph before you create the database, and solid schema documentation will be really helpful for later development. Here is the graph I created after studying both the context and the characters of the healthcare dataset:

Part 1 of Using Neo4j to Graph the Healthcare Industry

I created this graph in only a few minutes using the Arrows tool (created by Alistair Jones).

I hope you found the first week of our Neo4j healthcare road trip exciting! Starting next week, I will show you how to ETL the data into Neo4j. As you have seen, there are two types of data in my project: CSV and XML, I will start to extract XML data by calling APOC which stands for “Awesome Procedures On Cypher” and load the data into Neo4j by using the py2neo package in Python.

I’ve also linked the project’s GitHub repository here where you can find the detailed schema documentation, current code development and some other fun stuff.


About the Author

Yaqi Shi, Developer Relations

Yaqi Shi ImageYaqi Shi graduated from Capital Medical University in China with a bachelor’s degree of Medicine and is currently working on a Master’s of Science in Health Informatics at the University of San Francisco. She is interested in applying the most advanced technology to promote health. Before joining the Neo4j team, she volunteered with IT Support at the Asian and Pacific Islander Wellness Center and worked as a web assistant at the University of San Francisco School of Management.

Sponsored by  Neo4j

You may also like...