“Semantic technologies may be unfamiliar, but when you have used them for a while you will realise they are no harder than many other technologies…in fact I would argue they are easier.”– John Goodwin.
On the topics of Semantic web technologies, ontology engineering, and linked data, I have interviewed John Goodwin. John is Principal Scientist in the Research Department of Ordnance Survey, which is Great Britain’s National Mapping Authority.
Q1. You are a senior data scientist at the Ordnance Survey, Great Britain’s national mapping agency. What is your role there?
John Goodwin: I am a Principal Scientist in the Research Department of Ordnance Survey, which is Great Britain’s National Mapping Authority [note: we are authorities now…not agencies].
I have worked in research for Ordnance Survey for over 10 years now, and my research was mainly focused in semantic web technologies, ontology engineering and linked data. The Principal Scientist role is a fairly new one for me, and as part of this role I am now responsibly for a stream of research work around data management, data delivery and web services. This involves looking at new and novel technologies that ensure we have the correct infrastructure and data models to meet the challenges of the future. Furthermore, it is investigating new ways we can serve our our data to the end customer.
Q2. Do you have a Big Data problem at the at the Ordnance Survey? Could you please give us some examples of Big Data Use Cases?
John Goodwin:Hmmm, that is debatable. Ordnance Survey certainly has big ‘data problems’ but I don’t know if they qualify as ‘big data’ problems. I have heard Big Data defined as any data that won’t fit into Excel (which is a definition I personally hate), and if that is the case then we certainly have ‘Big Data’. Ordnance Survey currently stores information about half a billion topographic features, and 27.5 million geocoded address (with around 500,000 changes a year). So we may not have the sheer volume of data that some folk have, but I believe the combination of volume and complexity means that performing analysis over this data or running queries would certainly be a ‘Big Data’ problem.
For example, if you wanted to calculate the number of postboxes in Scotland, find the length of all roads in Great Britain you could be waiting some time using traditional database solutions.
Q3. The vision of the Semantic Web is the one where web pages contain self describing data that machines will be able to navigate them as easily as humans do now. What are the main benefits? Who could profit most from the Semantics Web?
John Goodwin: I think an immediate benefit is the ability to provide more structured data to search engines so that they can provide better search services. Structured web content means more meaningful search results and offers new ways to summarise and present summaries of pages in a search engine.
Q4. Who is currently using Semantic Web technologies and how? Could you please give us some examples of current commercial projects?
John Goodwin: One interesting example is a company called Garlik (now part of Experian). Garlik provides services to protect people from identify theft and financial fraud. They use semantic web technologies to integrate a number of different datasets, and provide a flexible way to integrate new datasets so they can perform queries across these datasets to find potential victims more easily. The BBC are a great users of linked data technology and used triplestore technologies as part of their content management systems for their World Cup and Olympics websites. Again the flexibility of the technology, and ability to link data across the whole of the BBC proved invaluable.
We are using linked data technologies at Ordnance Survey in research projects to look at way of integrating our data with third party data.
The major search engines are back an initiative called schema.org which will provide a unified schema for structure data in web content, and this has the potential (as mentioned above) to provide a richer search experience.
Q5. Do you use Linked Data? What are the main benefits of Linked Data in your opinion?
John Goodwin: I am a big supporter of linked data, and this has been the focus of my work for the last few years. I have used it in research projects and also produced the Ordnance Survey linked data.
Linked data is great for data integration – a common data language makes it easy (or rather easier) to bring a number of disparate datasets together. It is also more flexible than traditional relational database technologies. Like other NoSQL technologies linked data can be seen as ‘schemaless’ to some extent. This means if you want to change the datamode by, say, adding new attributes or properties it is very easy to do so. Furthermore, and this is a more personal thing, I find graphs to be the most natural way to think about data. It feel far more intuitive and I have to say I think querying graph data using SPARQL is far easier than querying relation data using SQL (especially if you have a lot of joins).
Q6. What data management technologies are best suited to model and query Linked Open Data?
John Goodwin: Linked Open Data is built around W3C standards such as RDF (resource description frame) – which is the data language of choice on the linked data web (although some people like to debate whether or not RDF is needed for linked data). RDF is to the web of data as HTML is to the web of documents…or at least that is how I see it. RDF has its own query language called SPARQL. A large number of programming libraries (e.g. Jena) are emerging to handle RDF. Furthermore, RDF can be stored in databases called triplestores and there are many triplestores to choose from. I am not in a position to advocate one triplestore over another but there are a large number of great technologies being developed by SMEs and more traditional database vendors alike. Furthermore, there are a number of open source options. We have experimented with a number of them at Ordnance Survey.
Q7. How do you integrate data from different sources that are not in Linked Open Data format (e.g. relational, raw data, etc.)?
John Goodwin: So far by converting the underlying data to RDF. Most relational data is a simple script away from being RDF. Tools do exist to help ‘triplify’ the data, but if I am honest I find that most of the time it is easier to write a quick Python script to do the job.
OpenRefine is a useful tool that lets you clean up csv data and has a plugin that allow export of data to RDF. OpenRefine additionally has the benefit of being able to work with reconciliation APIs. If a linked data site offers a reconciliation API you can use it with OpenRefine to, for example, convert a column of cities names or postcodes to URIs in the Ordnance Survey linked data. This is useful when you need to create explicit links to other datasets. For example, if you had a spreadsheet with place names like ‘City of Southampton’ you could use OpenRefine and the Ordnance Survey linked data reconciliation API to turn ‘City of Southampton’ into its URI.
Q8. What are the most promising application domains where you can apply RDF triple store technology such as AllegroGraph and Virtuoso?
John Goodwin: I think any domain where you either want to integrate lots of disparate datasets or you want a data model that is flexible, and where schema evolution might be a problem. I think geospatial is a promising domain as ‘everything happens somewhere’ and location provide a useful integration hub for many datasets. Semantic web technologies have also been used widely in the bioinformatics domain. The BBC are another great usecase – they use the technology to integrate data across their whole enterprise. This brings together data from news, radio, sport, television and music and allows new and exciting ways to explore the data.
I dare say it is also a technology that will prove useful/interesting to certain three letter American government agencies that have made the news recently
Q9. Do you use Data Analytics at the Ordnance Survey and for what?
John Goodwin: I would say currently we don’t really – thought it depends what you mean by analytics. We are largely concerned with collecting and maintaining data, and then shipping this out as products and services. We have experimented with an IBM® Netezza® appliance to perform queries over our data that would have taken too much time in traditional databases to answer questions such as ‘how many post boxes are there in Great Britain?’.
Q10. Can you do data analytics using Linked Open Data? If yes how?
John Goodwin: I think again it depends what is meant by analytics. Linked data offers a great way to bring lots of datasets together and then, maybe, materialise a view of those integrated datasets that could then be used to perform some analytics. Many people are doing ‘graph analytics’, and given that linked data is a graph I think there is some interesting work to be done in looking at the intersections of graph/network theory and linked data.
Q11. What are the main current obstacles for the adoption of Semantic Web technologies in the Enterprise?
John Goodwin: I think two main obstacles. The first one is a perception that RDF and linked data are hard, and somehow we need to overcome with perception. Lots of things in the ICT domain are hard…RDBMS is hard, C++ is hard etc. Semantic technologies may be unfamiliar, but when you have used them for a while you will realise they are no harder than many other technologies…in fact I would argue they are easier. I know a lot of developers who have moved onto using SPARQL and after a few months using it find it much easier to understand that SQL. Furthermore, I think it is harder to hire people with expertise in these technologies – there are still more people skilled up in traditional RDBMS and other newer NoSQL technologies like MongoDB.
I think the second obstacle is that semantic web technologies are, obviously, not going to be as mature as a good old relational database. There are some great triplestores out there, and there are enterprises who have successfully incorporated them (the BBC are a great example) but being a relatively new technology I suspect many enterprises are nervous to invest.
John Goodwin went to university at Royal Holloway and Bedford New College (University of London – based in Egham, Surrey) and graduated in 1992 with a 1st class honours degree in mathematics. Following that he moved to Cambridge and studied Part III of the Mathematics Tripos at the Department of Applied Maths and Theoretical Physics (University of Cambridge) where he obtained a Certificate of Advanced Study in Mathematics. John then moved to the University of Southampton to start his PhD.
He graduated in 1997 with a PhD in “The Cauchy Problem in Spacetimes with Closed Timelike Curves” (which can very roughly be paraphrased as ‘do timemachines blow up when you turn them on?’). In 1998 John left academia to start work at Ordnance Survey (located at postcode SO16 0AS) as a systems developer. He left Ordnance Survey in 2000 to start work at a small software company called Neusciences where he gained experience in various A.I. techniques. After just ten months at Neusciences John returned to Ordnance Survey to work in the research department where his research concentrated on the semantic web, ontologies and linked data. On the back of this research John produced the current Ordnance Survey linked data. He is currently still at Ordnance Survey and working as a Principal Scientist, where he leads research (at a technical and strategic level) into data managment, data delivery and services.
John currently chair the UK location council linked data working group and participate in the UK Government Linked Data working group.
– On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen. May 13, 2013
– Graphs vs. SQL. Interview with Michael Blaha. April 11, 2013
– On Big Graph Data. August 6, 2012
ODBMS.org: free resources on Graphs and Data Stores
Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes
Follow ODBMS.org on Twitter: @odbmsorg