Social Network Benchmark
The Social Network Benchmark consists in fact of three distinct benchmarks on a common dataset, since there are three different workloads. Each workload produces a single metric for performance at the given scale and a price/performance metric at the scale. The full disclosure further breaks down the composition of the metric into its constituent parts, e.g. single query execution times.
- Interactive Workload. The Interactive SNB workload is the first one we are releasing, in draft stage. It is defined in plain text, yet we have example implementations in neo4j’s Cypher, SPARQL and SQL. The interactive workloads tests a system’s throughput with relatively simple queries with concurrent updates. One could call the Interactive Workload an OLTP workload, but while queries typically touch a small fraction of the database, this can still be up to hundreds of thousands of values (the two-step neighborhood of a person in the social graph, often).
- Business Intelligence Workload. There is a first stab at this workload formulated in SPARQL, tested against Openlink Virtuoso. The BI workload consists of complex structured queries for analyzing online behavior of users for marketing purposes. The workload stresses query execution and optimization. Queries typically touch a large fraction of the data and do not require repeatable read. The queries will be concurrent with trickle load (not out yet). Unlike the interactive workload, the queries touch more data as the database grows.
- Graph Analytics Workload. This workload is not yet available. It will test the functionality and scalability of the SUT for graph analytics that typically cannot be expressed in a query language. The workload is still under development, but will consist of algorithms like PageRank, Clustering and Breadth First Search. The analytics is done on most of the data in the graph as a single operation. The analysis itself produces large intermediate results. The analysis is not expected to be transactional or to have isolation from possible concurrent updates.
All the SNB scenarios share a common scalable synthetic data set, generated by a state-of-the art data generator. We strongly believe in a single dataset that makes sense for all workloads, that is, the interactive and BI workloads will traverse data that has sensible PageRank outcomes, and graph clustering structure, etc. This is in contrast to LinkBench, released by the team of Facebook that manages the OLTP workload on the Facebook Graph, which closely tunes to the low-level MySQL query patterns Facebook sees, but whose graph has a structure that is unrealistic (no community structures of correlations between values and structure).
Social Network Benchmark (SNB) Audited Results
|10||101.20||€30,427||Sparksee 5.1.1||2*(Xeon 2630v3 8-core 2.4GHz) 256GB RAM||Sparsity Technologies SA||2015Apr27|
|30||1287.17||€20,212||Virtuoso 07.50.3213v7fasttrack||2*(Xeon2630 6-core 2.4GHz) 192GB RAM||OpenLink Software||2015Apr27|
|30||86.50||€30,427||Sparksee 5.1.1||2*(Xeon 2630v3 8-core 2.4GHz) 256GB RAM||Sparsity Technologies SA||2015Apr27|
|100||1200.00||€20,212||Virtuoso 07.50.3213v7fasttrack||2*(Xeon2630 6-core 2.4GHz) 192GB RAM||OpenLink Software||2015Apr27|
|100||81.70||€37,927||Sparksee 5.1.1||2*(Xeon 2630v3 8-core 2.4GHz) 256GB RAM||Sparsity Technologies SA||2015Apr27|
|300||635||€20,212||Virtuoso 07.50.3213v7fasttrack||2*(Xeon2630 6-core 2.4GHz) 192GB RAM||OpenLink Software||2015Apr27|
The Social Network Benchmark (SNB) consists of a data generator that generates a synthetic social network, used in three workloads: Interactive, Business Intelligence and Graph Analytics. Currently, only the Interactive Workload has been released in draft stage. A preview of the read-only part of the Business Intelligence Workload is also available.
The main SNB components are:
- https://github.com/ldbc/ldbc_snb_docs The SNB benchmark specification document
- https://github.com/ldbc/ldbc_snb_datagen The data generator exploits parallelism via Hadoop, so you can address huge-sized problems using cluster hardware. Even if you do not posses a cluster, you can use a local install of Hadoop in pseudo-distributed mode to take advantage of multi-core parallelism in a single computer.
- https://github.com/ldbc/ldbc_driver The query driver is used to generate the Interactive Workload, which consists concurrent inserts and read queries, in a certain mix. This program is parallel, though currently only multi-core – a cluster version will be added later. Generating the inserts in parallel is not trivial, since the graph structure is complex and we do not want to insert e.g. a post before a user has registered or before two users have friended (otherwise, referential integrity constraints might get violated). The query driver keeps track of progress of all its parallel clients generating inserts and synchronizes them where necessary.
- https://github.com/ldbc/ldbc_snb_interactive_vendors Vendor-specific driver implementations for the Interactive Workload, provided as examples.
The SNB data generator is a further development of the S3G2 correlated graph generator. It generates a social network with power-law structure that additionally has correlations between values and correlations between structure and value. An example of the former (correlations between attribute values) is that you find that people from a certain country have a distribution of first- and last-names where those typical for that country are more prevalent. An example of correlations between structure and values is that people who studied in the same university or have the same interest are more likely to be friends. Most data volume in the social network is not in the friends graph, but in the posts. These posts contain plausible topic-centered textual data, taken from DBpedia, because the conversations in the discussions read DBpedia pages to each other, paragraph by paragraph. The topics of the discussion are skewed towards the interests of the forum owner (hence also correlated). In all, this data generator is state-of-the-art and has been used e.g. in the SIGMOD 2014 programming contest, which focused on graph analytics.
As post-processing step for the data generator, two steps are performed:
- splitting the generated data at a timepoint: all interactions that took place before that time will be bulkloaded, whereas everything after it will be inserted online as part of the Interactive Workload. The query driver inserts the events in parallel, yet ensures referential integrity, i.e. when adding a friendship edge, it has ensured that the friend already exists in the network. Note that it is not trivial to guarantee this with parallel client sessions and a complex graph structure (one cannot partition the workload among the clients without still having dependencies).
- analyzing the generated data in order to derive parameter bindings for the read queries. In order to produce comparable query plans and query runtimes for the same query with different parameters, we actively look for parameters which lead to similar sized (intermediate) query results. This step is necessary to keep query behavior understandable even though the graph structure of the SNB is complex and irregular and the data distributions are skewed and correlated.
LDBC came out of a EU FP7 project and is now a non-profit organization sustained by its members and sponsored by Oracle Labs and IBM.
email: info AT ldbcouncil DOT org