Statistical Workload Injector for MapReduce (SWIM)
Yanpei Chen, Sara Alspaugh, Archana Ganapathi, Rean Griffith, Randy Katz
MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. SWIM includes
Repository of real life MapReduce workloads from production systems.
Workload synthesis tools to generate representative test workloads by sampling historical MapReduce cluster traces.
Workload replay tools to execute the historical or test workloads with low performance overhead.
SWIM enables rigorous performance measurement of MapReduce systems. SWIM contains suites of workloads of thousands of jobs, with complex data, arrival, and computation patterns. This represents an advance over previous MapReduce pseudo-benchmarks of limited diversity and scope. SWIM informs both highly targeted, workload specific optimizations, as well as designs that intend to bring general benefit.
We believe MapReduce cluster operators can use SWIM to accomplish other previously challenging tasks, including but not limited to resource provisioning and planning in multiple dimensions, configurations tuning for diverse job types within a workload, anticipating workload consolidation behavior and quantify workload superposition in multiple dimensions.
SWIM is currently integrated with Hadoop. The performance and evaluation science behind it is extensible to MapReduce systems in general.
SWIM is currently open-source under the New BSD License, except for files derived from Apache Hadoop, which are under the Apache License 2.0.
Download SWIM: LINK