China released its first industry-standard big data benchmark suite
By Jianfeng Zhan,Professor of Computer Science and Engineering at Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences.
Recently, Chinese Academy of Sciences and China Academy of Telecommunication Research, together with a lot of industry partners, including Huawei, Microsoft (China), IBM CDL, Intel (China), Baidu, China Mobile, Sina, ZTE, INSPUR and etc released China’s first industry-standard big data benchmark suite—BigDataBench-DCA. The specifications have been submitted to and under review of China’s Ministry of Industry and Information Technology.
The specifications and source code are publicly available here .
BigDataBench-DCA has six real-world data sets, including unstructured text, semi-structured text, unstructured graph, structured and semi-structured table data, their corresponding scalable data generations tools, and ten I/O intensive or CPU-intensive or hybrid workloads.
BigDataBench-DCA is a subset of BigDataBench—-an open-source big data benchmark suite.
The current version, BigDataBench 3.1 models five important big data application domains: search engine, social networks, e-commerce, multimedia analytics, and bioinformatics. In specifying representative big data workloads, BigDataBench focuses on units of computation that are frequently appearing in OLTP, Cloud “OLTP”, OLAP, interactive and offline analytics in each application domain.
Meanwhile, it considers variety of data models with different types and semantics. BigDataBench also provides an end-to-end application benchmarking framework to allow the creation of flexible benchmarking scenarios by abstracting data operations and workload patterns, which can be extended to other application domains.
For the same big data benchmark specifications, different implementations are provided in BigDataBench, e.g., the offline analytics workloads using MapReduce, MPI, Spark, DataMPI, interactive analytics and OLAP workloads using Shark, Impala, and Hive. In addition to including real-world data sets, BigDataBench also provides several parallel big data generation tools—BDGS—to generate scalable big data, e.g., a PB scale, from small or medium-scale real-world data while preserving their original characteristics.
To model and reproduce multi-application or multi-user scenarios on Cloud or datacenters, BigDataBench provides the multi-tenancy version, which supports flexible setting and replaying of mixed workloads according to the real workload traces—the Facebook, Google and Sogou traces. For system and architecture researches, i. e., architecture, OS, networking and storage, the number of benchmarks will be multiplied by different implementations, and hence become massive.
To reduce the research or benchmarking cost, a small number of representative benchmarks, called the BigDataBench subset, are selected according to workload characteristics from a specific perspective.
For example, for architecture communities, as simulation-based research is very time-consuming, the BigDataBench architecture subset is provided on the MARSSx86, gem5, and Simics simulators, respectively.