Semantic Publishing Benchmark (SPB)
Semantic Publishing Benchmark (SPB) is an LDBC benchmark for testing the performance of RDF engines inspired by the Media/Publishing industry. In particular, LDBC worked with British Broadcasting Corporation (BBC) to define this benchmark, for which BBC donated workloads, ontologies and data. The publishing industry is an area where significant adoption of RDF is taking place.
There have been many academic benchmarks for RDF but none of these are truly industrial-grade. The SPB combines a set of complex queries under inference with continuous updates and special failover tests for systems implementing replication.
SPB performance is measured by producing a workload of CRUD (Create, Read, Update, Delete) operations which are executed simultaneously. The benchmark offers a data generator that uses real reference data to produce datasets of various sizes and tests the scalability aspect of RDF systems. The benchmark workload consists of (a) editorial operations that add new data, alter or delete existing (b) aggregation operations that retrieve content according to various criteria. The benchmark also tests conformance for various rules inside the OWL2-RL rule-set.
The SPB specification contains the description of the benchmark and the data generator and all information about its software components can be found on the SPB developer page.
Semantic Publishing Benchmark (SPB) Audited Results for Scale Factors SF1 – 64M, SF3 – 256M and SF5 – 1G triples are shown below.
Scale Factor
|
Interactive (Q/s) | Updates (ops/sec) |
Analytical
|
Cost
|
Software
|
Hardware
|
Test Sponsor
|
Date |
1 | 100.85 | 10.19 | n.a. | €37,504 | GraphDB EE6.2 | Xeon1650v3 6-core 3.5Ghz 96GB RAM | ONTOTEXT AD | 2015Apr26 |
1 | 142.7588 | 10.6725 | n.a | €35,323 | GraphDB SE 6.3 alpha | CPU Intel Xeon E5-1650 v3 3.5Ghz,15MB L3 cache, s2011 | ONTOTEXT AD | 2015Jun10 |
3 | 29.90 | 9.50 | n.a. | €37,504 | GraphDB EE6.2 | Xeon1650v3 6-core 3.5GHz 96GB RAM | ONTOTEXT AD | 2015Apr26 |
3 | 54.6364 | 9.4967 | n.a. | €35,323 | GraphDB SE 6.3 alpha | CPU Intel Xeon E5-1650 v3 3.5Ghz,15MB L3 cache, s2011 | ONTOTEXT AD | 2015Jun10 |
1 | 149.0385 | 156.8325 | n.a. | $20,213 (€17,801 rate of 21/06/2015) | Virtuoso Opensource Version 7.50.3213 | Intel Xeon E5-2630, 6x 2.30GHz, Sockel 2011, boxed, 192 GB RAM | OpenLink Software | 2015Jun09 |
3 | 80.6158 | 92.7072 | n.a | $20,213 (€17,801 rate 21/06/2015) | Virtuoso Opensource Version 7.50.3213 | Intel Xeon E5-2630, 6x 2.30GHz, Sockel 2011, boxed, 192 GB RAM |
OpenLink Software | 2015Jun09 |
3 | 115.3838 | 109.8517 | n.a | $24,528 (€21,601 rate 21/06/2015) | Virtuoso Opensource Version 7.50.3213 | Amazon EC2, r3.8xlarge | OpenLink Software | 2015Jun10 |
5 | 32.2789 | 72.7192 | n.a. | $20,213 (€17,801 rate 21/06/2015) | Virtuoso Opensource Version 7.50.3213 | Intel Xeon E5-2630, 6x 2.30GHz, Sockel 2011, boxed, 192 GB RAM |
OpenLink Software | 2015Jun09 |
5 | 45.8101 | 55.4467 | n.a | $24,528 (€21,601 rate 21/06/2015) | Virtuoso Opensource Version 7.50.3213 | Amazon EC2, r3.8xlarge | OpenLink Software | 2015Jun10 |
The query substitution parameters for the scale factors SF1 and SF3 are given in the table below:
SF | Query Substitution Parameters |
1 | generated.64.7z |
3 | generated.256.7z |
For Developers
The Semantic Publishing Benchmark v2.0 (SPB) is a LDBC benchmark for RDF database engines inspired by the Media/Publishing industry, particularly by the BBC’s Dynamic Semantic Publishing approach.
The application scenario considers a media or a publishing organization that deals with large volume of streaming content, namely news, articles or “media assets”. This content is enriched with metadata that describes it and links it toreference knowledge – taxonomies and databases that include relevant concepts, entities and factual information. This metadata allows publishers to efficiently retrieve relevant content, according to their various business models. For instance, some, like the BBC, can use it to maintain rich and interactive web-presence for their content, while others, e.g. news agencies, would be able to provide better defined content feeds, etc.
From a technology standpoint, the benchmark assumes that an RDF database is used to store both the reference knowledge (mostly static) and the metadata (that grows constantly, to stay in synch with the inflow of streaming content). The main interactions with the repository are (i) updates, that add new metadata or alter it, and (ii) queries, that retrieve content according to various criteria.
New features of SPB 2.0:
- Larger sizes of Reference Data – added reference data entities from DBpedia (Companies, Events, Persons)
- Largerer amount of Geonames locations – added geonames ids of locations around all Europe
- Added owl:sameAs mappings between geonames ids and DBpedia locations
- Add two new queries in the basic interactive query-mix, querying the relations between entities in reference data
- Requires inference support for (RDFS – subPropertyOf, subClassOf, OWL – TransitiveProperty, SymmetricProperty, sameAs)
SPB consists of a Data Generator for producing synthetic data, a Query Driver which offers two workloads: basic and advanced and a set of real reference knowledge data and ontologies provided by The BBC, DBpedia and GeoNames.
Main components of the benchmark software are:
- https://github.com/ldbc/ldbc_spb_bm_2.0/blob/master/doc The SPB documentation including a Full Disclosure Report template
- https://github.com/ldbc/ldbc_spb_bm_2.0 Contains as one integral unit all necessary components for the benchmark software :
- Data generator which can produce consistent data in parallel and at different scales allowing for experimenting with various scales sizes
- Query driver which executes both workloads:
- basic – consisting of an interactive query-mix for evaluation RDF systems in most common use-cases
- advanced – consisting of interactive and analytical query-mixes, adding additional complexity to the query workload e.g. faceted, analytical and drill-down queries
- Reference datasets and Ontologies – a set of reference data and ontologies provided by The BBC, DBpedia and used in the process of generating the synthetic data
- Validation of query results used to validate query results from both workloads
The SPB data generator produces scalable in size synthetic large data. Synthetic data consists of a large number of annotations of media assets that refer entities found in reference datasets. An annotation (also called creative work) can be defined as a meta-data about a real entity or entities. Meta-data consists of various properties e.g. description, date of creation, tagged entities etc. The data generator models three types of relations in produced synthetic data :
- Clustering of data: clustering effect is produced by generating creative works about a single entity for a period of time. The number of creative works starts at a high peak and follows a smooth decay
- Correlations of entities: correlations are produced by generating creative works about two (three) entities from reference data for a period of time. Each of the entities is tagged by creative works solely at the beginning and end of correlation period, while at the middle of it all entities are tagged together
- Random tagging of entities: random distribution of tagged entities are created thus simulating random ‘noise’ in generated data
The SPB Data generator can generate data at various scales defined by the benchmark user starting from 1M triples to Billions. Generated data is saved to files with proper RDF serialization format and split in chunks also defined by the user.
Generated synthetic data can be loaded in the benchmarked RDF system either by using the test driver or manually. Once loaded, various statistics about loaded data are analyzed and query substitution parameters are generated (and saved to files)
The SPB Query driver starts the workloads by simultaneous execution of two types of agents : editorial (executing insert/update/delete operations) and aggregation agents (executing select/construct/describe operations). All agents run in parallel thus simulating a real multi-user exploitation of the RDF system under test.
LDBC came out of a EU FP7 project and is now a non-profit organization sustained by its members and sponsored by Oracle Labs and IBM.
CONTACTS
email: info AT ldbcouncil DOT org
ldbcouncil.org