Fifth Workshop on Big Data Benchmarking (5th WBDB) August 5-6, 2014.

The Fifth Workshop on Big Data Benchmarking (5th WBDB) August 5-6, 2014, Potsdam, Germany:

– Videos of all talks.

 

Program

TUESDAY, AUGUST 5, 2014

SESSION 1, Chair: Chaitan Baru
0900 0915 Welcome & Introduction to WBDB

Chaitan Baru; Kai Sachs; Matthias Uflacker San Diego Supercomputer Center; SAP; HPI
0915 1015 Keynote Talk: An Approach to Benchmarking Industrial Big Data Applications [abstract] [video] Umesh Dayal Hitachi
1015 1050 In-Memory Processing in Healthcare and Life Sciences Dominik Bertram SAP
1050 1115 BREAK
SESSION 2, Chair: Tilmann Rabl
1115 1145 TPCx-HS: Industry’s First Standard for Benchmarking Big Data Systems Raghunath Nambiar Cisco
1145 1215 Benchmarking SQL-on-Hadoop Systems: TPC or not TPC? Avrilia Floratou, Fatma Ozcan and Berni Schiefer IBM
1215 1245 Extending The OLTP-Bench Framework for Big Data Systems Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino and Philippe Cudré-Mauroux U. of Fribourg, CMU, Microsoft
1245 1345 LUNCH
SESSION 3, Chair: Raghu Nambiar
1345 1425 LDBC: Linked Data Benchmark Council Andrey Gubichev TU Munich
1425 1445 SQL on Hadoop Benchmark Nitin Guleria, Mei Ya Chan and Maryam Samizadeh University of Toronto
1445 1505 Benchmarking Virtualized Hadoop Clusters Todor Ivanov, Roberto Zicari and Alejandro Buchmann Goethe Universität Frankfurt am Main, Technische Universität Darmstadt
1505 1530 Towards A Complete BigBench Implementation Tilmann Rabl, Michael Frank, Manuel Danisch, Bhaskar Gowda and Hans-Arno Jacobsen University of Toronto, bankmark, Intel
1545 1600 BREAK
SESSION 4
1600 1700 Discussion — BigBench, and other benchmarks Breakout groups
WORKSHOP DINNER
1715 2215 Boat Ride, Reception, and Dinner Venue: Schloss Glienicke

WEDNESDAY AUGUST 6, 2014

SESSION 5Chair: Kai Sachs
0900 0950 Keynote Talk: A TU Delft Perspective on Benchmarking Big Data in the Data Center Alexander Iosup TU Delft
0950 1020 BW-EML SAP Standard Application Benchmark Heiko Gerwens and Tobias Kutning SAP
1020 1050 FoodBroker – Generating Synthetic Datasets for Graph-Based Business Analytics André Petermann, Martin Junghanns, Robert Müller and Erhard Rahm University of Leipzig
1050 1115 BREAK
SESSION 6, Chair: Enno Folkerts
1115 1140 And all of a sudden: Main Memory Is Less Expensive Than Disk Martin Boissier, Carsten Meyer, Matthias Uflacker and Christian Tinnefeld HPI
1140 1200 Big Data and the Network Eyal Gutkind Mellanox
1200 1230 PopulAid: In-Memory Data Generation for Customized Benchmarks Ralf Teusner, Michael Perscheid, Malte Appeltauer, Jonas Enderlein, Thomas Klingbeil and Michael Kusber HPI, SAP ICP
1230 1340 LUNCH
SESSION 7, Chair: Matthias Uflacker
1340 1410 Benchmarking Elastic Query Processing on Big Data Dimitri Vorona, Florian Funke, Alfons Kemper and Thomas Neumann TU Munich
1410 1500 Benchmarking IoT Ashok Joshi, Raghunath Nambiar and Michael Brey Oracle, Cisco
1500 1530 BREAK
SESSION 8
1530 1645 Discussiuons — Charter and Agenda for the SPEC RG Breakout groups
1645 1700 Wrap Up

Keynote Speakers

An Approach to Benchmarking Industrial Big Data Applications

Umesh Dayal – Vice-President and Senior Fellow, Big Data Lab, Hitachi America Ltd.

Through the increasing use of interconnected sensors, instrumentation, and smart machines, and the proliferation of social media and other open data, industrial operations and physical systems are generating ever increasing volumes of data of many different types. At the same time, advances in computing, storage, communications, and big data technologies are making it possible to store, process, and analyze enormous volumes of data at scale and at speed.  The convergence of Operations Technology (OT) and Information Technology (IT), powered by innovative analytics, holds the promise of using insights derived from these rich types of data to better manage our systems, resources, environment, health, social infrastructure, and industrial operations. Opportunities to apply innovative analytics abound in many industries (manufacturing, power distribution, oil and gas exploration and production, telecommunications, healthcare, agriculture, mining, to name a few) and similarly in Government (homeland security, smart cities, transportation, accountable care). In developing several such applications over the years, we have come to realize that extant benchmarks for decision support, streaming data, or event processing are not adequate for industrial big data applications, because they do not reflect the range of data and analytics processing characteristic of such applications. In this talk, we will outline an approach we are taking to define a benchmark that is motivated by a typical industrial operations scenario. We will describe the main issues we are considering for the benchmark, including the typical data and processing requirements; representative queries and analytics operations over a mix of streaming and stored, structured and unstructured data; and a system architecture.

 

A TU Delft Perspective on Benchmarking Big Data in the Data Center

Alexandru Iosup – Delft University of Technology, Delft, the Netherlands

Big Data–loosely defined as the processing and preservation of data that may be too high-volume, volatile, or varied for regular data management systems–has become a topic of interest for a variety of domains, such as e-Goverment, e-Science, and online gaming. Big Data is the outcome of more and larger living labs, more demanding and culturally diverse customers, the advent of big science, and the almost complete automation of many large-scale processes. To cope with the data deluge, we have already started to build complex hardware and software ecosystems. A hundred flowers bloomed and continue to do so, which is promising for the field, but poses significant challenges in selecting an adequate ecosystem and in tuning it for in-house workloads. These challenges would greatly benefit from understanding the performance of these ecosystems. In contrast to other data processing fields, notably traditional databases, there is no common benchmarking approach for Big Data. We propose that detailed performance evaluation and modeling focusing on specific application domains, crystallized later on in benchmarks, could help address this situation. In this presentation we focus on three important topics in Big Data processing elastic Big Data processing, graph processing, and time-based analytics. For each, we propose a method for evaluating the performance of Big Data processing platforms, and apply the method in real-world experiments on multi-year, multi-TB data sets. We show unique quantitative information and comparative results. We also show that even relatively small datasets can pose significant challenges to today’s Big Data processing tools, when the processing toolchain is complex and posit that the “next V” in Big Data processing could be the vicissitudes of complex processing.

 

Accepted Presentations

And all of a sudden: Main Memory Is Less Expensive Than Disk

Martin Boissier, Carsten Meyer, Matthias Uflacker and Christian Tinnefeld

Until today the wisdom for storage still is: storing data in main memory is more expensive than storing on disks. While this is true for the price per byte, the picture looks different for price per band- width. For data driven applications with high throughput demands, I/O bandwidth can easily become the major bottleneck. Comparing costs for different storages in relation to bandwidth requirements shows that the old wisdom of inexpensive disks and expensive main memory is no longer valid in every situation. The higher the bandwidth requirements become, the more cost effective main memory is. And all of sudden: main memory is less expensive than disk.

In this paper we argue that upcoming database workloads will have in- creasing bandwidth requirements and thus favor in-memory databases as they are less expensive. We are going to discuss mixed enterprise workloads in comparison to traditional transactional workloads and show with a simple cost evaluation, that main memory databases can turn out to incur lower total costs of ownership than their disk-based counterparts.

 

Benchmarking Elastic Query Processing on Big Data

Dimitri Vorona, Florian Funke, Alfons Kemper and Thomas Neumann

Existing analytical query benchmarks, such as TPC-H, often assess database system performance on on-premises hardware installations. On the other hand, benchmarks for cloud-based analytics measure elasticity, but often focus on simpler queries and semi-structured data. With our benchmark draft we attempt to bridge the gap by challenging analytical platforms to answer complex queries on structured business data leveraging the flexible infrastructure of the cloud.

 

Benchmarking IoT

Ashok Joshi, Raghunath Nambiar and Michael Brey

The Internet of Things (IoT) is the network of physical objects accessed through the Internet, as defined by technology analysts and visionaries. These objects contain embedded technology to interact with internal states or the external environment. In other words, when objects can sense and communicate, it changes how and where decisions are made, and who makes them.

This paper will look into the “benchmarking” aspects of Internet of Things (IoT).

 

Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?

Avrilia Floratou, Fatma Ozcan and Berni Schiefer

Benchmarks are important tools to evaluate systems, as long as their results are clear, transparent, reproducable and they are conducted with candor and due diligence. Today, many vendors of SQL-on-Hadoop products use the data generators and the queries of existing TPC benchmarks, but fail to adhere to the rules, producing results that are neither transparent, nor reproducable. As the SQL-on-Hadoop movement continues to gain more traction, it is important to bring some order to this “wild west” of benchmarking. First, everyone should agree on the rules. On that front, new rules and policies should be defined to satisfy the demands of the new generation SQL systems. The new benchmark evaluation schemes should be cheap, effective and robust to embrace the variety of SQL-on-Hadoop systems and their corresponding vendors. Second, existing TPC benchmarks may not be sufficient to evaluate the features and performance of these systems. In this paper, we discuss the problems we observe in the current practices of benchmarking, and argue that if we want to bring standardization in the space, all the SQL-on-Hadoop vendors should reach an agreement on the benchmarking rules and processes and should adhere to these rules when conducting experiments and publishing performance results.

 

Benchmarking Virtualized Hadoop Clusters

Todor Ivanov, Roberto Zicari and Alejandro Buchmann

This work investigates the performance of Big Data applications in virtualized Hadoop environments. An evaluation and comparison of the performance of applications running on a virtualized Hadoop cluster with separated data and computation layers against standard Hadoop installation is presented. Our experiments showed that computation intensive (i.e. CPU bound) workloads perform up to 43% better on a Data-Compute Hadoop cluster compared to standard Hadoop installation.

 

BW-EML SAP Standard Application Benchmark

Heiko Gerwens and Tobias Kutning

The focus of this presentation is on the latest addition to the BW SAP Standard Application Benchmarks, the BW-EML benchmark. The benchmark was developed as a modern successor to the previous BW benchmarks. With near real-time and ad-hoc reporting capabilities on big data volumes the BW-EML benchmarks matches the demands of modern business warehouse customers. The development of the benchmark faced the challenge of two contradicting goals. On the one hand the reproducibility of benchmark results is a key requirement. On the other hand the variability in the query workload was necessary to reflect the requirements for ad-hoc reporting. The presentation will give an insight to how these conflicting goals could be reached with the BW-EML benchmark

 

Extending The OLTP-Bench Framework for Big Data Systems

Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino and Philippe Cudré-Mauroux

The efforts put to build novel systems that cope with pressing big data challenges are often hindered by the process of reinventing the benchmarking wheel. In fact, researchers and developers alike, are still limited to a small number of workloads, typically inadequate to their specific case, and they often spend an unnecessary amount of time defining and implementing an new benchmark to showcase their solution. This is due to the lack of a universal and extendable benchmarking infrastructure. In this talk, we present OLTP-Bench, an extensible “batteries included” DBMS benchmarking testbed that aims at facilitating the integration of new ad-hoc benchmarks. OLTP-Bench is a configurable workload driver, allowing precise control of the desired transaction rate, mixture and workload skew during an experiment. Moreover, OLTP-Bench facilitates the process of running and documenting benchmarking experiments thanks to a set of utilities for running distributed clients and monitoring tools. We report on our experience building OLTP-Bench, porting fifteen popular benchmark and running them in the cloud.

 

FoodBroker – Generating Synthetic Datasets for Graph-Based Business Analytics

André Petermann, Martin Junghanns, Robert Müller and Erhard Rahm

We present FoodBroker, a new data generator for benchmarking graph-based business intelligence systems and approaches. It covers two realistic business processes and their involved master and transactional data objects. The interactions are correlated in controlled ways to enable non-uniform distributions for data and relationships. The generated dataset can be arbitrarily scaled and allows comprehensive graph- and pattern-based analysis.

 

PopulAid: In-Memory Data Generation for Customized Benchmarks

Ralf Teusner, Michael Perscheid, Malte Appeltauer, Jonas Enderlein, Thomas Klingbeil and Michael Kusber

During software development, it is often necessary to access real customer data in order to validate requirements and performance thoroughly. However, company and legal policies often restrict access to such sensitive information. Without real data, developers have to either create their own customized test data manually or rely on standardized benchmarks. While the first tends to lack scalability and edge cases, the latter solves these issues but cannot reflect the productive data distributions of a company.

In this paper, we propose PopulAid as a tool that allows developers to create customized benchmarks. We offer a convenient data generator that incorporates specific characteristics of real-world applications to generate synthetic data. So, companies have no need to reveal sensible data but yet developers have access to important development artifacts. We demonstrate our approach by generating a customized benchmark with medical information for developing SAP’s healthcare solution.

 

SQL on Hadoop Benchmark

Nitin Guleria, Mei Ya Chan and Maryam Samizadeh

Big data is an area of considerable interest by industry, academia and a large user base. There have been various products emerging to store and analyze the data. As the products are diverse, there is a need to evaluate and compare the performance of these systems. This is possible by the process of benchmarking.

In this paper, we present the benchmarking of three popular SQL on Hadoop systems: Shark, Presto and Impala. The benchmark covers a data model demonstrating the volume aspect of big data systems containing structured data. The data model is based on StackOverflow data dumps. The workload is modelled around a set of queries against the data model.

We illustrate the results of benchmarking and comparing SQL like query languages Shark, Presto and Impala for big data system Hadoop using Oracle Virtualbox in a Linux single node set up. We extract, transform and load the StackOverflow data dumps, test the workload by executing the queries in the three systems and evaluate the query time for each of the queries responses.

 

The Emergence of Modified Hadoop online based MapReduce technology in Cloud Environment

Shaikh Muhammad Allayear, Mohammad Salahuddin, Delwar Hossain and Park Sung Soon

As the web, social networking, and smartphone application have been popular, the data has grown drastically everyday. Thus, such data is called big data. The exponential growth of data first presented challenges to cutting-edge businesses such as Goggle, Yahoo, Amazon, Microsoft, Facebook, Twitter etc. Data volumes to be processed by cloud applications are growing much faster than computing power. This growth demands new strategies for processing and analyzing information. Hadoop MapReduce has become a powerful computation model addresses to these problems. MapReduce is a programming model that enables easy development of scalable parallel applications to process vast amount of data on large cluster. Through a simple interface with two function map and reduce, this model facilities parallel implementation of many real world task such as data processing for search engine and machine learning. Earlier version of Hadoop MapReduce has several performance problems like connection map to reduce task, data overload and time consumption problem. In this paper, we proposed a modified MapReduce architecture that is MapReduce Agent (MRA) that resolve those performance problem. Our developed MRA can reduce completion time, improve system utilization and give better performance. MRA consists of multi-connection which concerns about error recovery with Q-chained load balancing system. In this paper we also discuss various application and implementations of MapReduce programming model in cloud environments.

 

Towards A Complete BigBench Implementation

Tilmann Rabl, Michael Frank, Manuel Danisch, Bhaskar Gowda and Hans-Arno Jacobsen

BigBench was the first proposal for an end to end big data analytics benchmark. It features a set of 30 realistic queries based on real big data use cases. In this paper, we present updates on our development of a complete implementation on the Hadoop ecosystem. We will focus on the changes that we have made to data set, scaling, refresh process, and metric.

 

TPCx-HS: Industry’s First Standard for Benchmarking Big Data Systems

Raghunath Nambiar

The designation “Big Data” has become a mainstream buzz phrase across many industries as well as research circles. Big Data was identified as one of the top areas for benchmark development at the most recent TPC Technology Conference on Performance Evaluation and Benchmarking. However, today many vendors are making performance claims that are not easily verifiable in the absence of a neutral industry-wide benchmark.

With this in mind, the TPC has created a Big Data Working Group (TPC-BDWG) tasked with developing industry standards for benchmarking Big Data systems. The Workshop Series on Big Data Benchmarking (WBDB) has significantly influenced TPC’s direction in looking at developing set of standards for benchmarking Big Data systems.

The first benchmark from the TPC is TPCx-HS, designed to stresses both hardware and software including Hadoop run-time, Hadoop Filesystem API compatible systems and MapReduce layers. This workload can be used to asses a broad range of system topologies and implementation of Hadoop clusters. The TPCx-HS can be used to asses a broad range of system topologies and implementation methodologies in a technically rigorous and directly comparable, in a vendor-neutral manner.

You may also like...