Semantics 101

originally published at MarkLogic web site.


Semantics \si-mæn-tiks\ noun

Broadly speaking, semantics is the study of meaning either through language or logic. In the realm of knowledge management, the goal is to embed words or logically link to the most contextually relevant information so as to anticipate the information needs of the user. To achieve this goal in the context of IT, we leverage semantic technologies, which include a broad array of tools and techniques including both Semantic Enrichment and the Semantic Web. MarkLogic is a database platform that enables you to harness the power of the Semantic Web by including an RDF Triple Store that you can query with SPARQL. MarkLogic can store billions of triples and when linked together, all those facts create context and relevance.


“Semantic Web Technology” versus “Semantic Technology”?

MarkLogic Semantics falls in the category of “Semantic Web Technologies”, which is slightly different from what experts refer to as “Semantic Technologies.” However, the distinction is often not clearly made in the market, and even with MarkLogic, it gets somewhat confusing because MarkLogic does work with partners that provide “Semantic Technologies.” For example, MarkLogic partners with Smartlogic to form a complete semantics stack, with MarkLogic storing and managing the triples data, and Smartlogic providing entity enrichment and ontology management. That said, we like to be accurate with the “semantics” of semantics.

Semantic Web Technologies

Semantic Web Technologies refer to a family of specific W3C standards that allow an exchange of related data—whether it resides on the Web or within organizations. It requires a flexible data model (RDF), query tool (SPARQL), and a common markup language (e.g. RDFa, Turtle, N-Triples). RDF allows you to deconstruct pieces of knowledge called triples, which are linked together in a graph-like representation that is without hierarchy.

Semantic Technologies

Semantic Technologies are a variety of linguistic tools and techniques such as Natural Language Processing (NLP) and artificial intelligence to analyze unstructured text to classify and relate it. By identifying the parts of speech (a subject from a predicate, etc.), powerful algorithms can pinpoint entities (people, places, things, time, etc.), concepts, and categories. Once analyzed, text can be further enriched with vocabularies, dictionaries, taxonomies, and ontologies (so regardless which representation is used, assets can be found, eg. Coca-Cola, Coke, KO).


What are examples of RDF and SPARQL?

Example of RDF

RDF (Resource Description Framework ) is the common data format for Linked Data, and using this RDF standard liberates data from the containers it comes in, making it available for more automated processes. The international standards organization W3C recommends RDF, and has been guiding the standards for RDF since 2004. RDF is based on using HTTP URIs to lookup and describe resources.

Example of RDF

<http://example.org/dir/js> <http://xmlns.com/foaf/0.1/name> "John Smith" . 
<http://example.org/dir/js> <http://xmlns.com/foaf/0.1/livesIn> "England" .

Example of a SPARQL Query

W3C also defines SPARQL as a standard query language for RDF. SPARQL was first defined as the standard semantics query language in 2008, and according to W3C Director, Tim Berners-Lee, “Trying to use the Semantic Web without SPARQL is like trying to use a relational database without SQL.”

SELECT ?person ?place
WHERE
{ 
  ?person  ?place .
  ?place  "England" .
}

For more information, visit the MarkLogic Developer website to do an exercise with SPARQL.


What’s a real-world example of semantics?

google-world-cup

Semantics is sometimes hard to grasp as its usually fairly new to people even though the basic technology itself (ie. storing data as triples) has been around for over a decade. In fact, you probably encounter semantics everyday although you may not have known it. Google uses RDFa to publish results in many search results. Google uses Linked Data to automatically provide “rich snippets” of information based on RDF markup in Web pages. For example, a search for “Germany World Cup” shows top-level results driven by semantics. The data resides on other Web pages, but is readable by Google—including metadata about the team, live match results, and video content.


What’s an example of an app using MarkLogic semantics?

google-world-cup

Applied Relevance created an application on MarkLogic called Epinomy, a time series search engine for world economic data.

The challenge Epinomy addressed is figuring how to combine time series data with other unstructured and constantly changing data such as global economic indicator data. For example, the World Bank publishes data for poverty, inflation, and GDP in a format called SKOS SDMX Data Cube format, a triples format for tracking economic indicators and doing statistical analysis. But, there are lots of other economic data sources that are not already formatted for easy analysis. With relational databases, this challenge is difficult and even impossible to solve but with MarkLogic Semantics, new data can be incorporated in days, not months.

Consider the difficulty in trying to search across various data sources for a common term such as “Euro zone.” It means something different from “European Union”, “Europe OECD”, or “Europe.” Or what about a term such as “Small States,” which is different from “Least Developed Countries,” “Lower Middle Income,” or “Low & Middle Income.” Semantics provides the ability to map all of these terms so that a user can perform natural language searches.

Semantics also allows the application to quickly create facets without pre-defining what they should be. Facets, or the categories of results typically grouped down a left-hand column on a webpage, are created in Epinomy completely using triples. It happens dynamically on the fly, is dependent on the content loaded, and is presented fast to the user.

Another challenge is when the same economic data is released multiple times. These multiple “vintages” of the same data would typically be a headache to deal with. Semantics handles the various vintages of data by simply creating new sets of triples tagged as “vintage.” And, the natural language search was also designed so that a search can specifically return those vintage values.


What problems does semantics solve?

Existing Problems

  • Problem: The Web is built to link documents, but not data.
    The Web is a network of HTML documents linked together using HTTP. With this simple framework, the Web has unleashed information like nothing else before. But, information is locked in the Web pages where it was published. And, the confusion is compounded by the sheer increase in data volumes. For this reason, a Google search can deliver millions of results and yet still fail to answer the question asked.
  • Problem: There is no context for understanding data.
    Consider the word “cook”—the computer does not know whether you mean a chef, the act of cooking, or the Cook Islands. And, even if the computer did know that you meant a chef, it would not know that you would also be interested in the particular restaurants that the chef works at in a particular city.
  • Problem: Applications create walled gardens within organizations.
    Applications have historically been built on relational databases with a specific use in mind, creating walled gardens of data that prevent the data from being used for anything beyond its original design. For example, just imagine the difficult task of trying to mashup data from bank statements, mobile phone usage, weather data, and a friend list from Facebook. Similar examples get replayed again and again within organizations around the world.

Semantics as a Solution

  • Solution: Linking data using a universal standard.
    Using RDF as a standard to link data creates a structure that allows discovery of facts that can be universally understood. This means that an application can communicate with another application without a human middle-man. A perfect example is the Google search that returns top-level facts that the user wants to know rather than a list of links to documents.
  • Solution: Linking data within ontologies.
    Semantic ontologies provide context. Ontologies—collections, categories, hierarchies, or taxonomies—relate data by defining different classes for events, people, or things. For example, consider a classification such as plants, which has sub-classifications such as flowers and shrubs. Then consider a “rose” in this context, which means the flower, and not the actress Rose Byrne. In addition to helping build better navigation and search experiences, ontologies are also helpful in publishing more relevant content and making sense of metadata.
  • Solution: Linking data together to be searched holistically.
    Semantics is predicated on the relationships between data, which makes it an ideal tool to link and search across both structured and unstructured data by using its standard query language, SPARQL. This is particularly useful when you want to create sophisticated queries that span multiple data sets. An example would be, “Provide all of the health insurance beneficiaries that earned over $100,000 and lived in Atlanta, GA in the year 2010”—combining data about insurance, income, geography, and time.

RDF Triple Stores versus Graph Databases?

MarkLogic includes an RDF Triple Store, which you could say is a type of graph database because both store linked data. But, the distinction is often muddy and the question often comes up as to how RDF Triple Stores are different from graph databases. There are many similarities, and when looking at a data visualization, it is often impossible to know what type of database is even being used because they can produce similar looking end user experiences.

Both graph databases and Triple Stores are designed to store linked data but they are different tools and should be used for different purposes. In general, Triple Stores are better at finding things (and sub-graphs) in the graph, and graph databases are better for doing queries or analysis that concern the whole graph. A second major difference is that Triple Stores are built around standards whereas graph stores are not.

How They Are Similar

  • Graph databases and RDF Triple Stores store linked data and focus on the relationships between the data. Data points are called nodes, and the relationship between data points are called edges
  • A web of nodes and edges can be put together into interesting visualizations—a defining characteristic of graph databases and Triple Stores

How They Are Different

  • RDF Triple Stores are better at finding things (and sub-graphs) in the graph, and graph databases are better for doing queries or analysis that concern the whole graph
  • RDF Triple Stores are built around common standards defined by W3C for the data format (RDF) and query language (SPARQL) whereas graph databases have no standard language or data model. Other graph databases support languages such as G, GraphLog, GOOD, SoSQL, BiQL, SNQL, TinkerPop, Gremlin, and more. One popular graph database, Neo4J, can store RDF triples and use SPARQL but generally focuses on its own proprietary language, Cypher
  • RDF Triple Stores focus solely on storing rows of RDF triples whereas graph databases can store various types of graphs, including undirected graphs, weighted graphs, hypergraphs, etc.
  • RDF Triple Stores are edge-centric whereas graph databases are node, or property, centric. RDF Triple Stores are really just a list of graph edges, many of which are ‘properties’ of a node and not critical to the graph structure itself
  • With RDF Triple Stores, the cost of traversing an edge tends to be logarithmic (MarkLogic avoids this issue using its triple index and shared-nothing architecture), whereas graph databases are better optimized for graph traversals (degrees of separation or shortest path algorithms)
  • RDF Triple Stores provide inferences on data but graph databases do not (e.g., an inference would be, “If John lives in London, and London is in England, then John lives in England”)
  • RDF Triple Stores are more synonymous with the “semantic web” and the standardized universe of knowledge being stored as RDF triples on DBpedia and other sources whereas graph databases are seen as less universal and more purpose-built for specific applications

What makes MarkLogic’s Triple Store different?

Documents + Data + Triples

MarkLogic Semantics adds the capabilities of an Enterprise Triple Store to its existing document store and database. One of the key reasons that semantics has not become more popular is because most other Triple Stores are stand-alone technologies and nothing else. MarkLogic is a true database platform that can natively store and query documents (JSON, XML, Geospatial data), andrelationships (RDF Triples). This is a powerful combination that allows you to unify your data into a single system and draw powerful insights and inferences from your data more quickly at scale.

Combination Queries

The unique ability to handle multiple data types means you can do combination queries that span across different data types. With MarkLogic you can query across documents, facts, and metadata, and present results “in context.” This is done over REST or from a server-side program written in JavaScript or XQuery.

For example, imagine you work in a call center. Someone calls and say “some maniac in a blue van just tried to run me down—I got the first three letters of his license plate: ABC.” You could look up ABC* in the Vehicle licensing database. That would give you lots of results, and probably wouldn’t help much. The question is how we can use all of the additional information we have to narrow our results. If there really is a maniac in a blue van, there’s probably another incident report that will give you more information. That incident report would mention a “blue van” and it would be around the same time and place. And, if it has the license plate, it’ll start with “ABC.” But, that’s a really hard query!

With MarkLogic you can query across all these kinds of information—a date range, a geospatial query, a full-text search, and a triples query, all in a single simple efficient query. You can use all the context you have to search across all your information to get just exactly what you need to complete your task.

Enterprise Features

MarkLogic’s Triple Store comes with all of the features MarkLogic has built and proven over the past decade—including ACID transactions, scalability and elasticity, high availability and disaster recovery, and certified security.

With semantics in particular, MarkLogic’s certified security provides a really unique differentiator. MarkLogic carries a Common Criteria Security Certification and runs in classified government systems. It’s role-based security model is extremely granular. With Semantics, it enables you to define exactly which users are able to see which triples.

Massive Scale

MarkLogic has a specialized triple index to ensure querying triples is fast. The triple index is actually one of more than 30 different indexes, and MarkLogic has focused on indexing content for search since its first release over ten years ago.

MarkLogic also has a triple cache to help better manage the use of memory to ensure optimal performance at scale. Some Triple Stores insist you store the whole Triple Store index in memory, but the MarkLogic Triple Store does not need to fit in memory.

Both of these features—the specialized Triple Store and the triple cache—make MarkLogic a scalable, elastic, high performance Triple Store. With other Triple Stores, volume quickly becomes an issue. Some Triple Stores try to scale with clustered systems, but for parallel query only—that is, they can have three node clusters but only if each node has the same data on it. MarkLogic’s shared nothing architecture helps it cluster easily, whether storing and querying documents or triples.

MarkLogic can stores about 1 Billion triples per node, at about 350 bytes per triple, and can scale to hundreds of billions of triples.


Where can I learn more?

We have an entire section in our documentation devoted to developing applications with semantics, as well as tutorials to walk you through development.

Documentation: Semantics Developer’s Guide Tutorial: SPARQL 101 

Sponsored by MarkLogic