Optimizing Parallel Job Performance in Data-Intensive Clusters
University of California, Berkeley
PhD Thesis- Technical Report No. UCB/EECS-2014-9
January 26, 2014
Extensive data analysis has become the enabler for diagnostics and decision making in many modern systems. These analyses have both competitive as well as social benefits. To cope with the deluge in data that is growing faster than Moore’s law, computation frameworks have resorted to massive parallelization of analytics jobs into many ne-grained tasks.
These frameworks promised to provide efficient and fault-tolerant execution of these tasks. However, meeting this promise in clusters spanning hundreds of thousands of machines is challenging and a key departure from earlier work on parallel computing. A simple but key aspect of parallel jobs is the all-or-nothing property: unless all tasks of a job are provided equal improvement, there is no speedup in the completion of the job. The all-or-nothing property is critical for the promise of efficient and fault-tolerant parallel computations on large clusters. Meeting this promise in clusters of these scales is challenging and a key departure from prior work on distributed systems. This PhD will look at the execution of a job from first principles and propose techniques spanning the software stack of data analytics systems such that its tasks achieve homogeneous performance while overcoming the various heterogeneities. To that end, we will propose techniques for (i) caching and cache replacement for parallel jobs, which outperforms even Belady’s MIN (that uses an oracle), (ii) data locality, and (iii) straggler mitigation. Our analyses and evaluation are performed using workloads from Facebook and Bing production datacenters. Along the way, we will also describe how we broke the myth of disk-locality’s importance in datacenter computing.
Advisor: Ion Stoica
DOWNLOAD PhD: LINK to .PDF