Fifth Workshop on Big Data Benchmarking (5th WBDB) August 5-6, 2014.

by Roberto Zicari · Published August 14, 2014 · Updated January 3, 2015

– The Fifth Workshop on Big Data Benchmarking (5th WBDB) August 5-6, 2014, Potsdam, Germany:

Program

TUESDAY, AUGUST 5, 2014

SESSION 1, Chair: Chaitan Baru
0900	0915	Welcome & Introduction to WBDB	Chaitan Baru; Kai Sachs; Matthias Uflacker	San Diego Supercomputer Center; SAP; HPI
0915	1015	Keynote Talk: An Approach to Benchmarking Industrial Big Data Applications [abstract] [video]	Umesh Dayal	Hitachi
1015	1050	In-Memory Processing in Healthcare and Life Sciences	Dominik Bertram	SAP
1050	1115	BREAK
SESSION 2, Chair: Tilmann Rabl
1115	1145	TPCx-HS: Industry’s First Standard for Benchmarking Big Data Systems	Raghunath Nambiar	Cisco
1145	1215	Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?	Avrilia Floratou, Fatma Ozcan and Berni Schiefer	IBM
1215	1245	Extending The OLTP-Bench Framework for Big Data Systems	Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino and Philippe Cudré-Mauroux	U. of Fribourg, CMU, Microsoft
1245	1345	LUNCH
SESSION 3, Chair: Raghu Nambiar
1345	1425	LDBC: Linked Data Benchmark Council	Andrey Gubichev	TU Munich
1425	1445	SQL on Hadoop Benchmark	Nitin Guleria, Mei Ya Chan and Maryam Samizadeh	University of Toronto
1445	1505	Benchmarking Virtualized Hadoop Clusters	Todor Ivanov, Roberto Zicari and Alejandro Buchmann	Goethe Universität Frankfurt am Main, Technische Universität Darmstadt
1505	1530	Towards A Complete BigBench Implementation	Tilmann Rabl, Michael Frank, Manuel Danisch, Bhaskar Gowda and Hans-Arno Jacobsen	University of Toronto, bankmark, Intel
1545	1600	BREAK
SESSION 4
1600	1700	Discussion — BigBench, and other benchmarks	Breakout groups
WORKSHOP DINNER
1715	2215	Boat Ride, Reception, and Dinner	Venue: Schloss Glienicke
WEDNESDAY AUGUST 6, 2014
SESSION 5, Chair: Kai Sachs
0900	0950	Keynote Talk: A TU Delft Perspective on Benchmarking Big Data in the Data Center	Alexander Iosup	TU Delft
0950	1020	BW-EML SAP Standard Application Benchmark	Heiko Gerwens and Tobias Kutning	SAP
1020	1050	FoodBroker – Generating Synthetic Datasets for Graph-Based Business Analytics	André Petermann, Martin Junghanns, Robert Müller and Erhard Rahm	University of Leipzig
1050	1115	BREAK
SESSION 6, Chair: Enno Folkerts
1115	1140	And all of a sudden: Main Memory Is Less Expensive Than Disk	Martin Boissier, Carsten Meyer, Matthias Uflacker and Christian Tinnefeld	HPI
1140	1200	Big Data and the Network	Eyal Gutkind	Mellanox
1200	1230	PopulAid: In-Memory Data Generation for Customized Benchmarks	Ralf Teusner, Michael Perscheid, Malte Appeltauer, Jonas Enderlein, Thomas Klingbeil and Michael Kusber	HPI, SAP ICP
1230	1340	LUNCH
SESSION 7, Chair: Matthias Uflacker
1340	1410	Benchmarking Elastic Query Processing on Big Data	Dimitri Vorona, Florian Funke, Alfons Kemper and Thomas Neumann	TU Munich
1410	1500	Benchmarking IoT	Ashok Joshi, Raghunath Nambiar and Michael Brey	Oracle, Cisco
1500	1530	BREAK
SESSION 8
1530	1645	Discussiuons — Charter and Agenda for the SPEC RG	Breakout groups
1645	1700	Wrap Up

Keynote Speakers

An Approach to Benchmarking Industrial Big Data Applications

Umesh Dayal – Vice-President and Senior Fellow, Big Data Lab, Hitachi America Ltd.

Through the increasing use of interconnected sensors, instrumentation, and smart machines, and the proliferation of social media and other open data, industrial operations and physical systems are generating ever increasing volumes of data of many different types. At the same time, advances in computing, storage, communications, and big data technologies are making it possible to store, process, and analyze enormous volumes of data at scale and at speed. The convergence of Operations Technology (OT) and Information Technology (IT), powered by innovative analytics, holds the promise of using insights derived from these rich types of data to better manage our systems, resources, environment, health, social infrastructure, and industrial operations. Opportunities to apply innovative analytics abound in many industries (manufacturing, power distribution, oil and gas exploration and production, telecommunications, healthcare, agriculture, mining, to name a few) and similarly in Government (homeland security, smart cities, transportation, accountable care). In developing several such applications over the years, we have come to realize that extant benchmarks for decision support, streaming data, or event processing are not adequate for industrial big data applications, because they do not reflect the range of data and analytics processing characteristic of such applications. In this talk, we will outline an approach we are taking to define a benchmark that is motivated by a typical industrial operations scenario. We will describe the main issues we are considering for the benchmark, including the typical data and processing requirements; representative queries and analytics operations over a mix of streaming and stored, structured and unstructured data; and a system architecture.

A TU Delft Perspective on Benchmarking Big Data in the Data Center

Alexandru Iosup – Delft University of Technology, Delft, the Netherlands

Big Data–loosely defined as the processing and preservation of data that may be too high-volume, volatile, or varied for regular data management systems–has become a topic of interest for a variety of domains, such as e-Goverment, e-Science, and online gaming. Big Data is the outcome of more and larger living labs, more demanding and culturally diverse customers, the advent of big science, and the almost complete automation of many large-scale processes. To cope with the data deluge, we have already started to build complex hardware and software ecosystems. A hundred flowers bloomed and continue to do so, which is promising for the field, but poses significant challenges in selecting an adequate ecosystem and in tuning it for in-house workloads. These challenges would greatly benefit from understanding the performance of these ecosystems. In contrast to other data processing fields, notably traditional databases, there is no common benchmarking approach for Big Data. We propose that detailed performance evaluation and modeling focusing on specific application domains, crystallized later on in benchmarks, could help address this situation. In this presentation we focus on three important topics in Big Data processing elastic Big Data processing, graph processing, and time-based analytics. For each, we propose a method for evaluating the performance of Big Data processing platforms, and apply the method in real-world experiments on multi-year, multi-TB data sets. We show unique quantitative information and comparative results. We also show that even relatively small datasets can pose significant challenges to today’s Big Data processing tools, when the processing toolchain is complex and posit that the “next V” in Big Data processing could be the vicissitudes of complex processing.

Accepted Presentations

And all of a sudden: Main Memory Is Less Expensive Than Disk

Martin Boissier, Carsten Meyer, Matthias Uflacker and Christian Tinnefeld

Until today the wisdom for storage still is: storing data in main memory is more expensive than storing on disks. While this is true for the price per byte, the picture looks different for price per band- width. For data driven applications with high throughput demands, I/O bandwidth can easily become the major bottleneck. Comparing costs for different storages in relation to bandwidth requirements shows that the old wisdom of inexpensive disks and expensive main memory is no longer valid in every situation. The higher the bandwidth requirements become, the more cost effective main memory is. And all of sudden: main memory is less expensive than disk.

In this paper we argue that upcoming database workloads will have in- creasing bandwidth requirements and thus favor in-memory databases as they are less expensive. We are going to discuss mixed enterprise workloads in comparison to traditional transactional workloads and show with a simple cost evaluation, that main memory databases can turn out to incur lower total costs of ownership than their disk-based counterparts.

Benchmarking Elastic Query Processing on Big Data

Dimitri Vorona, Florian Funke, Alfons Kemper and Thomas Neumann

Existing analytical query benchmarks, such as TPC-H, often assess database system performance on on-premises hardware installations. On the other hand, benchmarks for cloud-based analytics measure elasticity, but often focus on simpler queries and semi-structured data. With our benchmark draft we attempt to bridge the gap by challenging analytical platforms to answer complex queries on structured business data leveraging the flexible infrastructure of the cloud.

Benchmarking IoT

Ashok Joshi, Raghunath Nambiar and Michael Brey

The Internet of Things (IoT) is the network of physical objects accessed through the Internet, as defined by technology analysts and visionaries. These objects contain embedded technology to interact with internal states or the external environment. In other words, when objects can sense and communicate, it changes how and where decisions are made, and who makes them.

This paper will look into the “benchmarking” aspects of Internet of Things (IoT).

Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?

Avrilia Floratou, Fatma Ozcan and Berni Schiefer

Benchmarks are important tools to evaluate systems, as long as their results are clear, transparent, reproducable and they are conducted with candor and due diligence. Today, many vendors of SQL-on-Hadoop products use the data generators and the queries of existing TPC benchmarks, but fail to adhere to the rules, producing results that are neither transparent, nor reproducable. As the SQL-on-Hadoop movement continues to gain more traction, it is important to bring some order to this “wild west” of benchmarking. First, everyone should agree on the rules. On that front, new rules and policies should be defined to satisfy the demands of the new generation SQL systems. The new benchmark evaluation schemes should be cheap, effective and robust to embrace the variety of SQL-on-Hadoop systems and their corresponding vendors. Second, existing TPC benchmarks may not be sufficient to evaluate the features and performance of these systems. In this paper, we discuss the problems we observe in the current practices of benchmarking, and argue that if we want to bring standardization in the space, all the SQL-on-Hadoop vendors should reach an agreement on the benchmarking rules and processes and should adhere to these rules when conducting experiments and publishing performance results.

Benchmarking Virtualized Hadoop Clusters

Todor Ivanov, Roberto Zicari and Alejandro Buchmann

This work investigates the performance of Big Data applications in virtualized Hadoop environments. An evaluation and comparison of the performance of applications running on a virtualized Hadoop cluster with separated data and computation layers against standard Hadoop installation is presented. Our experiments showed that computation intensive (i.e. CPU bound) workloads perform up to 43% better on a Data-Compute Hadoop cluster compared to standard Hadoop installation.

BW-EML SAP Standard Application Benchmark

Heiko Gerwens and Tobias Kutning

The focus of this presentation is on the latest addition to the BW SAP Standard Application Benchmarks, the BW-EML benchmark. The benchmark was developed as a modern successor to the previous BW benchmarks. With near real-time and ad-hoc reporting capabilities on big data volumes the BW-EML benchmarks matches the demands of modern business warehouse customers. The development of the benchmark faced the challenge of two contradicting goals. On the one hand the reproducibility of benchmark results is a key requirement. On the other hand the variability in the query workload was necessary to reflect the requirements for ad-hoc reporting. The presentation will give an insight to how these conflicting goals could be reached with the BW-EML benchmark

Extending The OLTP-Bench Framework for Big Data Systems

Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino and Philippe Cudré-Mauroux

The efforts put to build novel systems that cope with pressing big data challenges are often hindered by the process of reinventing the benchmarking wheel. In fact, researchers and developers alike, are still limited to a small number of workloads, typically inadequate to their specific case, and they often spend an unnecessary amount of time defining and implementing an new benchmark to showcase their solution. This is due to the lack of a universal and extendable benchmarking infrastructure. In this talk, we present OLTP-Bench, an extensible “batteries included” DBMS benchmarking testbed that aims at facilitating the integration of new ad-hoc benchmarks. OLTP-Bench is a configurable workload driver, allowing precise control of the desired transaction rate, mixture and workload skew during an experiment. Moreover, OLTP-Bench facilitates the process of running and documenting benchmarking experiments thanks to a set of utilities for running distributed clients and monitoring tools. We report on our experience building OLTP-Bench, porting fifteen popular benchmark and running them in the cloud.

FoodBroker – Generating Synthetic Datasets for Graph-Based Business Analytics

André Petermann, Martin Junghanns, Robert Müller and Erhard Rahm

We present FoodBroker, a new data generator for benchmarking graph-based business intelligence systems and approaches. It covers two realistic business processes and their involved master and transactional data objects. The interactions are correlated in controlled ways to enable non-uniform distributions for data and relationships. The generated dataset can be arbitrarily scaled and allows comprehensive graph- and pattern-based analysis.

PopulAid: In-Memory Data Generation for Customized Benchmarks

Ralf Teusner, Michael Perscheid, Malte Appeltauer, Jonas Enderlein, Thomas Klingbeil and Michael Kusber

During software development, it is often necessary to access real customer data in order to validate requirements and performance thoroughly. However, company and legal policies often restrict access to such sensitive information. Without real data, developers have to either create their own customized test data manually or rely on standardized benchmarks. While the first tends to lack scalability and edge cases, the latter solves these issues but cannot reflect the productive data distributions of a company.

In this paper, we propose PopulAid as a tool that allows developers to create customized benchmarks. We offer a convenient data generator that incorporates specific characteristics of real-world applications to generate synthetic data. So, companies have no need to reveal sensible data but yet developers have access to important development artifacts. We demonstrate our approach by generating a customized benchmark with medical information for developing SAP’s healthcare solution.

SQL on Hadoop Benchmark

Nitin Guleria, Mei Ya Chan and Maryam Samizadeh

Big data is an area of considerable interest by industry, academia and a large user base. There have been various products emerging to store and analyze the data. As the products are diverse, there is a need to evaluate and compare the performance of these systems. This is possible by the process of benchmarking.

In this paper, we present the benchmarking of three popular SQL on Hadoop systems: Shark, Presto and Impala. The benchmark covers a data model demonstrating the volume aspect of big data systems containing structured data. The data model is based on StackOverflow data dumps. The workload is modelled around a set of queries against the data model.

We illustrate the results of benchmarking and comparing SQL like query languages Shark, Presto and Impala for big data system Hadoop using Oracle Virtualbox in a Linux single node set up. We extract, transform and load the StackOverflow data dumps, test the workload by executing the queries in the three systems and evaluate the query time for each of the queries responses.

The Emergence of Modified Hadoop online based MapReduce technology in Cloud Environment

Shaikh Muhammad Allayear, Mohammad Salahuddin, Delwar Hossain and Park Sung Soon

As the web, social networking, and smartphone application have been popular, the data has grown drastically everyday. Thus, such data is called big data. The exponential growth of data first presented challenges to cutting-edge businesses such as Goggle, Yahoo, Amazon, Microsoft, Facebook, Twitter etc. Data volumes to be processed by cloud applications are growing much faster than computing power. This growth demands new strategies for processing and analyzing information. Hadoop MapReduce has become a powerful computation model addresses to these problems. MapReduce is a programming model that enables easy development of scalable parallel applications to process vast amount of data on large cluster. Through a simple interface with two function map and reduce, this model facilities parallel implementation of many real world task such as data processing for search engine and machine learning. Earlier version of Hadoop MapReduce has several performance problems like connection map to reduce task, data overload and time consumption problem. In this paper, we proposed a modified MapReduce architecture that is MapReduce Agent (MRA) that resolve those performance problem. Our developed MRA can reduce completion time, improve system utilization and give better performance. MRA consists of multi-connection which concerns about error recovery with Q-chained load balancing system. In this paper we also discuss various application and implementations of MapReduce programming model in cloud environments.

Towards A Complete BigBench Implementation

Tilmann Rabl, Michael Frank, Manuel Danisch, Bhaskar Gowda and Hans-Arno Jacobsen

BigBench was the first proposal for an end to end big data analytics benchmark. It features a set of 30 realistic queries based on real big data use cases. In this paper, we present updates on our development of a complete implementation on the Hadoop ecosystem. We will focus on the changes that we have made to data set, scaling, refresh process, and metric.

TPCx-HS: Industry’s First Standard for Benchmarking Big Data Systems

Raghunath Nambiar

The designation “Big Data” has become a mainstream buzz phrase across many industries as well as research circles. Big Data was identified as one of the top areas for benchmark development at the most recent TPC Technology Conference on Performance Evaluation and Benchmarking. However, today many vendors are making performance claims that are not easily verifiable in the absence of a neutral industry-wide benchmark.

With this in mind, the TPC has created a Big Data Working Group (TPC-BDWG) tasked with developing industry standards for benchmarking Big Data systems. The Workshop Series on Big Data Benchmarking (WBDB) has significantly influenced TPC’s direction in looking at developing set of standards for benchmarking Big Data systems.

The first benchmark from the TPC is TPCx-HS, designed to stresses both hardware and software including Hadoop run-time, Hadoop Filesystem API compatible systems and MapReduce layers. This workload can be used to asses a broad range of system topologies and implementation of Hadoop clusters. The TPCx-HS can be used to asses a broad range of system topologies and implementation methodologies in a technically rigorous and directly comparable, in a vendor-neutral manner.

Fifth Workshop on Big Data Benchmarking (5th WBDB) August 5-6, 2014.

Program

TUESDAY, AUGUST 5, 2014

WEDNESDAY AUGUST 6, 2014

Keynote Speakers

Accepted Presentations

You may also like...

Resources

Search

News

Events

Archives

Sponsored By

HPCC Systems from LexisNexis Risk Solutions

KX

InterSystems

MySQL/Oracle

SingleStore

Supporters

McObject

NEXTGRES

Raima

Scality

Volt Active Data