As a multi-discipline research effort, BigDataBench is an open-source big data benchmark suite. The current version is BigDataBench 3.0.  It includes 6 real-world and 2 synthetic data sets, and 32 big data workloads, covering micro and application benchmarks from areas of search engine, social networks, e-commerce. In generating representative and variety of big data workloads, BigDataBench focuses on units of computation frequently appearing in Cloud “OLTP”, OLAP, interactive and offline analytics. BigDataBench also provides several (parallel) big data generation tools–BDGS– to generate scalable big data, e.g. PB scale, from small-scale real-world data while preserving their original characteristics. For example, on an 8-node cluster system, BDGS generates 10 TB wiki data in 5 hours. For the same workloads, different implementations are provided. Currently, we and other developers implemented the offline analytics workloads using MapReduce, MPI, Spark, DataMPI, interactive analytics and OLAP workloads using Shark, Impala, and Hive.   The web link is


BigDataBench: a Big Data Benchmark Suite from Internet Services. Lei Wang, Jianfeng Zhan, ChunjieLuo, Yuqing Zhu, Qiang Yang, Yongqiang He, WanlingGao, Zhen Jia, Yingjie Shi, Shujie Zhang, Cheng Zhen, Gang Lu, Kent Zhan, Xiaona Li, and BizhuQiu. The 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014), February 15-19, 2014, Orlando, Florida, USA.


As architecture, systems, and data management commu- nities pay greater attention to innovative big data systems and architecture, the pressure of benchmarking and evalu- ating these systems rises. However, the complexity, diver- sity, frequently changed workloads, and rapid evolution of big data systems raise great challenges in big data bench- marking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target e- valuating specific types of applications or system software stacks, and hence they are not qualified for serving the pur- poses mentioned above.

This paper presents our joint research efforts on this is- sue with several industrial partners. Our big data bench- mark suite—BigDataBench not only covers broad applica- tion scenarios, but also includes diverse and representative data sets. Currently, we choose 19 big data benchmarks from dimensions of application scenarios, operations/ algo- rithms, data types, data sources, software stacks, and appli- cation types, and they are comprehensive for fairly measur- ing and evaluating big data systems and architecture. Big- DataBench is publicly available from the project home page

Also, we comprehensively characterize 19 big data workloads included in BigDataBench with varying data in- puts. On a typical state-of-practice processor, Intel Xeon E5645, we have the following observations: First, in comparison with the traditional benchmarks: including PAR- SEC, HPCC, and SPECCPU, big data applications have very low operation intensity, which measures the ratio of the total number of instructions divided by the total byte number of memory accesses; Second, the volume of data input has non-negligible impact on micro-architecture characteristics, which may impose challenges for simulation-based big data architecture research; Last but not least, corrobo- rating the observations in CloudSuite and DCBench (which use smaller data inputs), we find that the numbers of L1 instruction cache (L1I) misses per 1000 instructions (in short, MPKI) of the big data applications are higher than in the traditional benchmarks; also, we find that L3 caches are effective for the big data applications, corroborating the observation in DCBench.

Download Paper (LINK to .PDF)

BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. Zijian Ming, ChunjieLuo, WanlingGao, Rui Han, Qiang Yang, Lei Wang, and Jianfeng Zhan. The Fourth Workshop on Big Data Benchmarking (WBDB 2014). To appear in Lecture Notes in Computer Science.

Abstract. The complexity and diversity of big data systems and their rapid evolution give rise to various new challenges about how we design benchmarks in order to test such systems efficiently and successfully. Data generation is a key issue in big data benchmarking that aims to generate application-specific data sets to meet the 4V requirements of big data (i.e. volume, velocity, variety, and veracity). Since small scale real-world data are much easily accessible, generating scalable synthetic data (volume) of different types (variety) under controllable generation rates (velocity) while keeping the important characteristics of raw data (veracity) is an important issue. To date, most existing techniques only generate big data sets belonging to some specific data types such as structured data or support specific big data systems, e.g., Hadoop. For such an issue, we develop a tool, called Big Data Generator Suite (in short, BDGS), to efficiently generate scalable big data such as a petabyte (PB) scale, while employing data models to capture and preserve the important characteristics of real data during data generation. The ef- fectiveness of BDGS is demonstrated by developing six data generators based on a variety of real-world data sets from different internet services domains. These data generators cover three representative data types (structured, semi-structured and unstructured) and three data sources (text, graph, and table data). BDGS is an integrated part of our open- source big data benchmarking project: BigDataBench, which is available at We also evaluate BDGS un- der different experimental settings, and show BDGS can rapidly generate big data in a linear gross time as data volume increases.

Keywords: Big Data, Benchmark, Data Generator, Scalable, Veracity 

Download Paper (LINK to .PDF)

You may also like...