MLPerf–ML benchmark suite

by Roberto Zicari · June 1, 2018

A broad ML benchmark suite for measuring performance of ML software frameworks, ML hardware accelerators, and ML cloud platforms.

LINK TO WEB SITE

Overview

The MLPerf effort aims to build a common set of benchmarks that enables the machine learning (ML) field to measure system performance for both training and inference from mobile devices to cloud services. We believe that a widely accepted benchmark suite will benefit the entire community, including researchers, developers, builders of machine learning frameworks, cloud service providers, hardware manufacturers, application providers, and end users.

Historical Inspiration

We are motivated in part by the System Performance Evaluation Consortium (SPEC)benchmark for general-purpose computing and the Transaction Processing Council (TPC) benchmark for database systems that drove rapid, measurable performance improvements in both fields for decades starting in the 1980s.

Goals

Learning from the 40 year history of benchmarks, MLPerf has these primary goals:

Accelerate progress in ML via fair and useful measurement

Serve both the commercial and research communities

Enable fair comparison of competing systems yet encourage innovation to improve the state-of-the-art of ML

Enforce replicability to ensure reliable results

Keep benchmarking effort affordable so all can participate

General Approach

Our approach is to select a set of ML problems, each defined by a dataset and quality target, then measure the wall clock time to train a model for each problem.

System Performance Metrics

Following the precedent of DAWNBench, the primary MLPerf metric is defined as the wall clock time to train a model to a target quality — often hours or days. The target quality is based on the original publication result, less a small delta to allow for run-to-run variance.

Following SPEC’s precedent, we will publish a score that summarizes performance for our set of Closed or Open benchmarks: the geometric mean of results for the full suite.

SPEC also reports power (a useful proxy for cost), and DAWNBench reports cloud cost. MLPerf will report power for mobile or on-premise systems and cost for cloud systems.

Datasets and Model Sources

Image Classification

Dataset: Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. S.; Berg, A. C. & Fei-Fei, L. (2015), ‘ImageNet Large Scale Visual Recognition Challenge’, International Journal of Computer Vision (IJCV).

Model: He, K.; Zhang, X.; Ren, S. & Sun, J. (2015), ‘Deep Residual Learning for Image Recognition’, CoRR abs/1512.03385.

Object Identification

Dataset: Lin, T.-Y.; Maire, M.; Belongie, S. J.; Bourdev, L. D.; Girshick, R. B.; Hays, J.; Perona, P.; Ramanan, D.; Dollбr, P. & Zitnick, C. L. (2014), ‘Microsoft COCO: Common Objects in Context’, CoRR abs/1405.0312.

Model: He, K.; Gkioxari, G.; Dollбr, P. & Girshick, R. B. (2017), ‘Mask R-CNN’, CoRR abs/1703.06870.

Translation

Dataset: WMT English-German from Bojar, O.; Buck, C.; Federmann, C.; Haddow, B.; Koehn, P.; Monz, C.; Post, M. & Specia, L., ed. (2014), Proceedings of the Ninth Workshop on Statistical Machine Translation, Association for Computational Linguistics, Baltimore, Maryland, USA.

Model: Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L. & Polosukhin, I. (2017), ‘Attention Is All You Need’, CoRR abs/1706.03762.

Speech-to-Text

Dataset: Panayotov, V.; Chen, G.; Povey, D. & Khudanpur, S. (2015), Librispeech: An ASR corpus based on public domain audio books, in ‘2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)’, pp. 5206-5210.

Model: Amodei, D.; Anubhai, R.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Chen, J.; Chrzanowski, M.; Coates, A.; Diamos, G.; Elsen, E.; Engel, J.; Fan, L.; Fougner, C.; Han, T.; Hannun, A. Y.; Jun, B.; LeGresley, P.; Lin, L.; Narang, S.; Ng, A. Y.; Ozair, S.; Prenger, R.; Raiman, J.; Satheesh, S.; Seetapun, D.; Sengupta, S.; Wang, Y.; Wang, Z.; Wang, C.; Xiao, B.; Yogatama, D.; Zhan, J. & Zhu, Z. (2015), ‘Deep Speech 2: End-to-End Speech Recognition in English and Mandarin’, CoRR abs/1512.02595.

Recommendation

Dataset: Harper, F. M. & Konstan, J. A. (2015), ‘The MovieLens Datasets: History and Context’, ACM Trans. Interact. Intell. Syst. 5(4), 19:1–19:19.

Model: He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X. & Chua, T.-S. (2017), ‘Neural Collaborative Filtering’, CoRR abs/1708.05031.

Sentiment Analysis

Dataset: Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y. & Potts, C. (2011), Learning Word Vectors for Sentiment Analysis, in ‘Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies’, Association for Computational Linguistics, Portland, Oregon, USA, pp. 142–150.

Model: Johnson, R. and Zhang, T. (2014), Effective use of word order for text categorization with convolutional neural networks, CoRR abs/1412.1058.

Reinforcement Learning

Dataset: Games from Iyama Yuta 6 Title Celebration, between contestants Murakawa Daisuke, Sakai Hideyuki, Yamada Kimio, Hyakuta Naoki, Yuki Satoshi, and Iyama Yuta.

Model: Tensorflow/minigo implementation by Andrew Jackson.

——————————

Supporting companies