Getting Started with the Microsoft Concept Graph in Neo4j
By Cristina Escalante, COO of SilverLogic | March 8, 2017
What does the study of concepts (or categories, depending on your field of study) tell us about the human mind?
A result of the Probase research project, the Microsoft Concept Graph harnesses billions of web pages and search logs to build a huge graph of relations between words (like “apple”) and their concepts (like “fruit” or “hardware company”). Using this data, the team at Microsoft hopes to build better search engines, spell-checkers, recommendation engine, taxonomies and more.
This blog post will walk through how we can harness Neo4j to delve into the Single Instance Conceptualization dataset proposed by the first release of the Microsoft Concept Graph in late 2016. Specifically, it will walk through importing the data into Neo4j using neo4j-import
and using Cypher to determine when a “apple” means a dessert instead of a particular company. I encourage you to read more in the excellent papers by the Microsoft team here and here.
Concepts embody our knowledge of the kinds of things there are in the world. Tying our past experiences to our present interactions with the environment, they enable us to recognize and understand new objects and events.
– Gregory Murphy, The Big Book of Concepts
The question is then: How do we pass human concepts to machines, and how do we enable machines to conceptualize?
– Microsoft Concept Graph
The Model
Concepts
, Instances
, and IS_A
Relationships
The first release of the Microsoft Concept Graph can be easily summarized as a set of Instance
vertices connected to a set of Concept
vertices by weighted IS_A
edges. Or, in Neo4j terms, Instance
nodes connected to Concept
nodes by IS_A
relationships containing a probability
property denoting the possibility of the Instance belonging to the Concept. As a result, the relationships between an instance and its concepts shows the its distribution over the concept vector space. More scoring functions are included in the datasets’ API.
In the dataset, Instances
are English noun phrases (NPs) and Concepts
are the mental bucket or category the NP may belong to. For example, instances of the concept snake includes the words “boa,” “python,” and “viper,” which are also instances of the concepts of artist (p=0.128), language(p=.557), and car (p=.107), respectively.
Download & Import: The V1 Release
Download link for the Microsoft Concept Graph: https://concept.research.microsoft.com/Home/Download
This first release, called Single Instance Conceptualization, provides the core Is_A
data mined from billions of web pages. It contains 5,376,526 unique concepts, 12,501,527 unique instances, and 85,101,174Is_A
relations.
The data is in a single tab-separated file, 330MB zipped and 1.2GB uncompressed, which we can import with neo4j-import
(so make sure you’re using the .tar version of Neo4j).
The data in the file is organized according Concept, Instance and Probability, like so:
state | california | 18062 |
---|---|---|
supplement | msm glucosamine sulfate | 15942 |
Important: Note that the probability is out of 10^4.
This is a relatively simple graph can be represented like so:
- {“name”
- “fruit”}
- {“name”
- “apple”}
- {“name”
- “company”}
- IS_A
- {“probability”
- 6315}
- IS_A
- {“probability”
- 4353}
# a quick peek at the data head -n 10 data-concept-instance-relations.txt factor age 35167 free rich company datum size 33222 free rich company datum revenue 33185 state california 18062 supplement msm glucosamine sulfate 15942 factor gender 14230 factor temperature 13660 metal copper 11142 issue stress pain depression sickness 11110 variable age 9375 # extract concepts (this can take a few seconds) $ echo "name:ID(Concept)" > concepts.txt $ cat data-concept-instance-relations.txt | cut -d $'\t' -f 1 | sort | uniq >> concepts.txt # extract instances (this can take a few seconds) echo "name:ID(Instance)" > instances.txt cat data-concept-instance-relations.txt | cut -d $'\t' -f 2 | sort | uniq >> instances.txt # create the header row for the relationships import echo $':END_ID(Concept)\t:START_ID(Instance)\tprobability' > is_a.hdr # import into Neo4j $NEO4J_HOME/bin/neo4j-import --into concepts.db --id-type string --delimiter TAB --bad-tolerance 100000 --skip-duplicate-nodes true --skip-bad-relationships true --nodes:Concept concepts.txt --nodes:Instance instances.txt --relationships:IS_A is_a.hdr,data-concept-instance-relations.txt ... IMPORT DONE in 1m 27s 888ms. Imported: 17878053 nodes 33377320 relationships 51255373 properties Peak memory usage: 410.36 MB # Add two Constraints/Indexes echo $' CREATE CONSTRAINT ON (i:Instance) ASSERT i.name IS UNIQUE;\n CREATE CONSTRAINT ON (c:Concept) ASSERT c.name IS UNIQUE;' | $NEO4J_HOME/bin/neo4j-shell -path concepts.db
Now that you’ve created the concepts.db graph, you can move it to $NEO4J_HOME/data/databases and update $NEO4J_HOME/conf/neo4j.conf
to mount concepts.db
:
# The name of the database to mount dbms.active_database=concepts.db
You should now be able to start the Neo4j Browser and see the Concept Graph.
Let’s Explore the Concept Graph
How is the word “apple” represented in the concept space?
MATCH (i:Instance {name:"apple"})-[r:IS_A]->(c:Concept) RETURN i.name AS Instance, tofloat(r.probability)/10000 AS `is a(n)`, c.name AS Concept ORDER BY r.probability DESC LIMIT 10; https://www.dropbox.com/s/t971i04ej991xpk/apple_graph.svg?dl=0 +-------------------------------------------+ | Instance |is a(n) | Concept | +-------------------------------------------+ | "apple" | 0.6315 | "fruit" | | "apple" | 0.4353 | "company" | | "apple" | 0.1152 | "food" | | "apple" | 0.764 | "brand" | | "apple" | 0.750 | "fresh fruit" | | "apple" | 0.568 | "fruit tree" | | "apple" | 0.483 | "crop" | | "apple" | 0.280 | "corporation" | | "apple" | 0.279 | "manufacturer" | | "apple" | 0.257 | "firm" | +-------------------------------------------+
How is the word “pie” represented in the concept space?
MATCH (i:Instance {name:"pie"})-[r:IS_A]->(c:Concept) RETURN i.name AS Instance, tofloat(r.probability)/10000 AS `is a(n)`, c.name AS Concept ORDER BY r.probability DESC LIMIT 10; https://www.dropbox.com/s/7vmnsk1avl9j1b4/pie_graph.svg?dl=0 +------------------------------------+ | Instance | is a(n) | Concept | +------------------------------------+ | "pie" | 0.0256 | "food" | | "pie" | 0.0245 | "dessert" | | "pie" | 0.0208 | "baked goods" | | "pie" | 0.018 | "bakery item" | | "pie" | 0.0105 | "baked good" | | "pie" | 0.0097 | "item" | | "pie" | 0.0087 | "product" | | "pie" | 0.0054 | "food item" | | "pie" | 0.0041 | "sweet" | | "pie" | 0.0041 | "dish" | +------------------------------------+ 10 rows 9321 ms
Adding some context: What Concepts represent both an apples and a pie?
We want to be very sure we’re talking about apple in the sense of the food, not Apple in the sense of the company.
MATCH (a:Instance {name:"apple"})-[r1:IS_A]->(c:Concept)<-[r2:IS_A]-(b:Instance {name:"pie"}) USING INDEX a:Instance(name) USING INDEX b:Instance(name) RETURN c.name AS Concept, tofloat(r1.probability)*tofloat(r2.probability)*10^-8 AS prob ORDER BY prob DESC LIMIT 10; https://www.dropbox.com/s/s27h2uh16k4mwyw/apple_pie.svg?dl=0 +-------------------------------------+ | Concept | prob | +-------------------------------------+ | "food" | 0.00294912 | | "item" | 2.4056000000000001E-4 | | "product" | 1.5747E-4 | | "fruit" | 6.315E-5 | | "snack" | 3.959E-5 | | "food item" | 3.51E-5 | | "dessert" | 3.43E-5 | | "name" | 7.92E-6 | | "dish" | 4.92E-6 | | "case" | 3.92E-6 | +-------------------------------------+ 10 rows 15 ms
Adding some context: What instances are similar to both apples and pies?
We can even go further and check the instances of those concepts and aggregate by them, instead just the relations stored on the IS_A
relationship, allowing us to deduce that things that are both apples and pies are bread-like fruit-based cakes.
MATCH (a:Instance {name:"apple"})-[:IS_A]->(c:Concept)<-[:IS_A]-(b:Instance {name:"pie"}) USING INDEX a:Instance(name) USING INDEX b:Instance(name) MATCH (c)<-[:IS_A]-(o:Instance) WHERE o <> a and o <> b WITH o, count(*) AS freq ORDER BY freq DESC LIMIT 10 RETURN o.name AS Instance, freq; +--------------------+ | Instance | freq | +--------------------+ | "bread" | 115 | | "fruit" | 113 | | "cake" | 110 | | "cookie" | 109 | | "chocolate" | 102 | | "cheese" | 99 | | "vegetable" | 93 | | "egg" | 93 | | "banana" | 91 | | "fish" | 91 | +--------------------+ 10 rows 4900 ms
Conclusion
Although the Microsoft Concept Graph is a currently a bit more sparse than other concept graphs online, the research that created it is a valuable addition to the study of taxonomy and language.
References
- Zhongyuan Wang, Haixun Wang, Ji-Rong Wen, and Yanghua Xiao, An Inference Approach to Basic Level of Categorization, in ACM International Conference on Information and Knowledge Management (CIKM), ACM – Association for Computing Machinery, October 2015.
- Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Zhu, Probase: A Probabilistic Taxonomy for Text Understanding, in ACM International Conference on Management of Data (SIGMOD), May 2012.