On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen
“The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point”
–Kingsley Uyi Idehen.
I have interviewed Kingsley Idehen founder and CEO of OpenLink Software. The main topics of this interview are: the Semantic Web, and the Virtuoso Hybrid Data Server.
Q1. The vision of the Semantic Web is the one where web pages contain self describing data that machines will be able to navigate them as easily as humans do now. What are the main benefits? Who could profit most from the Semantics Web?
Kingsley Uyi Idehen: The vision of a Semantic Web is actually the vision of the Web. Unbeknownst to most, they are one and the same. The goal was always to have HTTP URIs denote things, and by implication, said URIs basically resolve to their meaning  .
Paradoxically, the Web bootstrapped on the back of URIs that denoted HTML documents (due to Mosaic’s ingenious exploitation of the “view source” pattern ) thereby accentuating its Web of hyper-linked Documents (i.e., Information Space) aspect while leaving its Web of hyper-linked Data aspect somewhat nascent.
The nascence of the Web of hyper-linked Data (aka Web of Data, Web of Linked Data etc.) laid the foundation for the “Semantic Web Project” which naturally evoled into “The Semantic Web” meme. Unfortunately, “The Semantic Web” meme hit a raft of issues (many self inflicted) that basically disconnected it from its original Web vision and architecture aspect reality.
The Semantic Web is really about the use of hypermedia to enhance the long understood entity relationship model  via the incorporation of _explicit_ machine- and human-comprehensible entity relationship semantics via the RDF data model. Basically, RDF is just about an enhancement to the entity relationship model that leverages URIs for denoting entities and relations that are described using subject->predicate->object based proposition statements.
For the rest of this interview, I would encourage readers to view “The Semantic Web” phrase as meaning: a Web-scale entity relationship model driven by hypermedia resources that bear entity relationship model description graphs that describe entities and their relations (associations).
To answer your question, the benefits of the Semantic Web are as follows: fine-grained access to relevant data on the Web (or private Web-like networks) with increasing degrees of serendipity .
Q2. Who is currently using Semantic Web technologies and how? Could you please give us some examples of current commercial projects?
Kingsley Uyi Idehen: I wouldn’t used “project” to describe endeavors that exploit Semantic Web oriented solutions. Basically, you have entire sectors being redefined by this technology. Examples range from “Open Government” (US, UK, Italy, Spain, Portugal, Brazil etc..) all the way to publishing (BBC, Globo, Elsevier, New York Times, Universal etc..) and then across to pharmaceuticals (OpenPHACTs, St. Judes, Mayo, etc.. ) and automobiles (Daimler Benz, Volkswagen etc..). The Semantic Web isn’t an embryonic endeavor deficient on usecases and case studies, far from it.
Q3. Virtuoso is a Hybrid RDBMS/Graph Column store. How does it differ from relational databases and from XML databases?
Kingsley Uyi Idehen:: First off, we really need to get the definitions of databases clear. As you know, the database management technology realm is vast. For instance, there isn’t anything such thing as a non relational database.
Such a system would be utterly useless beyond an comprehendible definition, to a marginally engaged audience. A relational database management system is typically implemented with support for a relational model oriented query language e.g., SQL, QUEL, OQL (from the Object DBMS era), and more recently SPARQL (for RDF oriented databases and stores). Virtuoso is comprised of a relational database management system that supports SQL, SPARQL, and XQuery. It is optimized to handle relational tables and/or relational property graphs (aka. entity relationship graphs) based data organization. Thus, Virtuoso is about providing you with the ability to exploit the intensional (open world propositions or claims) and extensional (closed world statements of fact) aspects of relational database management without imposing either on its users.
Q4. Is there any difference with Graph Data stores such as Neo4j?
Kingsley Uyi Idehen: Yes, as per my earlier answer, it is a hybrid relational database server that supports relational tables and entity relationship oriented property graphs. It’s support for RDF’s data model enables the use of URIs as native types. Thus, every entity in a Virtuoso DBMS is endowed with a URI as its _super key_. You can de-reference the description of a Virtuoso entity from anywhere on a network, subject to data access policies and resource access control lists.
Q5. How do you position Virtuoso with respect to NoSQL (e.g Cassandra, Riak, MongoDB, Couchbase) and to NewSQL (e.g.NuoDB, VoltDB)?
Kingsley Uyi Idehen: Virtuoso is a SQL, NoSQL, and NewSQL offering. Its URI based _super keys_ capability differentiates it from other SQL, NewSQL, and NoSQL relational database offerings, in the most basic sense. Virtuoso isn’t a data silo, because its keys are URI based. This is a “deceptively simple” claim that is very easy to verify and understand. All you need is a Web Browser to prove the point i.e., a Virtuoso _super key_ can be placed in the address bar of any browser en route to exposing a hypermedia based entity relationship graph that navigable using the Web’s standard follow-your-nose pattern.
Q6. RDF can be encoded in various formats. How do you handle that in Virtuoso?
Kingsley Uyi Idehen: Virtuoso supports all the major syntax notations and data serialization formats associated with the RDF data model. This implies support for N-Triples, Turtle, N3, JSON-LD, RDF/JSON, HTML5+Microdata, (X)HTML+RDFa, CSV, OData+Atom, OData+JSON.
Q7. Does Virtuoso restrict the contents to triples?
Kingsley Uyi Idehen: Assuming you mean: how does it enforce integrity constraints on triple values?
It doesn’t enforce anything per se. since the principle here is “schema last” whereby you don’t have a restrictive schema acting as an inflexible view over the data (as is the case with conventional SQL relational databases). Of course, an application can apply reasoning to OWL (Web Ontology Language) based relation semantics (i.e, in the so-called RBox) as option for constraining entity types that constitute triples. In addition, we will soon be releasing a SPARQL Views mechanism that provides a middle ground for this matter whereby the aforementioned view can be used in a loosely coupled manner at the application, middleware, or dbms layer for applying constraints to entity types that constitute relations expressed by RDF triples.
Q8. RDF can be represented as a direct graph. Graphs, as data structure do not scale well. How do you handle scalability in Virtuoso? How do you handle scale-out and scale-up?
Kingsley Uyi Idehen: The fundamental mission statement of Virtuoso has always be to destroy any notion of performance and scalability as impediments to entity relationship graph model oriented database management. The crux of the matter with regards to Virtuoso is that it is massively scalable due for the following reasons:
• fine-grained multi-threading scoped to CPU cores
• vectorized (array) execution of query commands across fine-grained threads
• column-store based physical storage which provides storage layout and data compaction optimizations (e.g., key compression)
• share-nothing clustering that scales from multiple instances (leveraging the items above) on a single machine all the way up to a cluster comprised of multiple machines.
The scalability prowess of Virtuoso are clearly showcased via live Web instances such as DBpedia and the LOD Cloud Cache (50+ Billion Triples). You also have no shortage of independent benchmark reports to compliment the live instances:
•50 – 150 Billion scale Berlin SPARQL Benchmark (BSBM) report (.pdf)
Q9. Could you give us some commercial examples where Virtuoso is in use?
Kingsley Uyi Idehen: Elsevier, Globo, St. Judes Medical, U.S. Govt., EU, are a tiny snapshot of entities using Virtuoso on a commercial basis.
Q10. Do you plan in the near future to develop integration interfaces to other NoSQL data stores?
Kingsley Uyi Idehen: If a NewSQL or NoSQL store supports any of the following, their integration with Virtuoso is implicit: HTTP based RESTful interaction patterns, SPARQL, ODBC, JDBC, ADO.NET, OLE-DB. In the very worst of cases, we have to convert the structured data returned into 5-Star Linked Data using Virtuoso’s in-built Linked Data middleware layer for heterogeneous data virtualization.
Q11. Virtuoso supports SPARQL. SPARQL is not SQL, how do handle querying relational data then?
Kingsley Uyi Idehen: Virtuoso support SPARQL, SQL, SQL inside SPARQL and SPARQL inside SQL (we call this SPASQL). Virtuoso has always had its own native SQL engine, and that’s integral to the entire product. Virtuoso provides an extremely powerful and scalable SQL engine as exemplified by the fact that the RDF data management services are basically driven by the SQL engine subsystem.
Q12. How do you support Linked Open Data? What advantages are the main benefits of Linked Open Data in your opinion?
Kingsley Uyi Idehen: Virtuoso enables you expose data from the following sources, courtesy of its in-built 5-star Linked Data Deployment functionality:
• RDF based triples loaded from Turtle, N-Triples, RDF/XML, CSV etc. documents
• SQL relational databases via ODBC or JDBC connections
• SOAP based Web Services
• Web Services that provide RESTful interaction patterns for data access.
• HTTP accessible document types e.g., vCard, iCalendar, RSS, Atom, CSV, and many others.
Q13. What are the most promising application domains where you can apply triple store technology such as Virtuoso?
Kingsley Uyi Idehen: Any application that benefits from high-performance and scalable access to heterogeneously shaped data across disparate data sources. Healthcare, Pharmaceuticals, Open Government, Privacy enhanced Social Web and Media, Enterprise Master Data Management, Big Data Analytics etc..
Q14. Big Data Analysis: could you connect Virtuoso with Hadoop? How does Viruoso relate to commercial data analytics platforms, e.g Hadapt, Vertica?
Kingsley Uyi Idehen: You can integrate data managed by Hadoop based ETL workflows via ODBC or Web Services driven by Hapdoop clusters that expose RESTful interaction patterns for data access. As for how Virtuoso relates to the likes of Vertica re., analytics, this is about Virtuoso being the equivalent of Vertica plus the added capability of RDF based data management, Linked Data Deployment, and share-nothing clustering. There is no job that Vertica performs that Virtuoso can’t perform.
There are several jobs that Virtuoso can perform that Vertica, VoltDB, Hadapt, and many other NoSQL and NewSQL simply cannot perform with regards to scalable, high-performance RDF data management and Linked Data deployment. Remember, RDF based Linked Data is all about data management and data access without any kind of platform lock-in. Virtuoso locks you into a value proposition (performance and scale) not the platform itself.
Q15. Do you also benchmark loading trillion of RDF triples? Do you have current benchmark results? How much time does it take to querying them?
Kingsley Uyi Idehen: As per my earlier responses, there is no shortage of benchmark material for Virtuoso.
The benchmarks are also based on realistic platform configurations unlike the RDBMS patterns of the past which compromised the utility of TPC benchmarks.
Q16. In your opinion, what are the main current obstacles for the adoption of Semantic Web technologies in the Enterprise?
Kingsley Uyi Idehen:The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point.
. — 5-Star Linked Data URIs and Semiotic Triangle
. — what do HTTP URIs Identify?
. — View Source Pattern & Web Bootstrap
. — Unified View of Data using the Entity Relationship Model (Peter Chen’s 1976 dissertation)
. — Serendipitous Discovery Quotient (SDQ).
Kingsley Idehen is the Founder and CEO of OpenLink Software. He is an industry acclaimed technology innovator and entrepreneur in relation to technology and solutions associated with data management systems, integration middleware, open (linked) data, and the semantic web.
Kingsley has been at the helm of OpenLink Software for over 20 years during which he has actively provided dual contributions to OpenLink Software and the industry at large, exemplified by contributions and product deliverables associated with: Open Database Connectivity (ODBC), Java Database Connectivity (JDBC), Object Linking and Embedding (OLE-DB), Active Data Objects based Entity Frameworks (ADO.NET), Object-Relational DBMS technology (exemplified by Virtuoso), Linked (Open) Data (where DBpedia and the LOD cloud are live showcases), and the Semantic Web vision in general.
–ODBMS.org free resources on : Relational Databases, NewSQL, XML Databases, RDF Data Stores
Follow ODBMS Industry Watch on Twitter: @odbmsorg