(The Need for) A Global Resource Manager for Data Analytics Clusters
by Ganesh Ananthanarayanan, Researcher at Microsoft Research
The necessity to process large quantities of data has led to widespread adoption of data parallel analytics frameworks like Google MapReduce, Hadoop, Microsoft Cosmos and Spark. These frameworks automatically compose a job into many small tasks, which then run in parallel on massively large clusters. While these frameworks were originally designed for batch processing, they have increasingly evolved to also serve interactive analytics.
All-or-nothing: An important property of data parallel analytics is that the job finishes (and provides the result) only after its last task finishes. This simple “all-or-nothing” property is crucial to job performance. Therefore, analytics frameworks strive to achieve equal progress across all the tasks of a job. This would lead to no task being left behind, thus providing predictable performance, crucial especially for interactive analytics.
Achieving such equal progress to satisfy the all-or-nothing property, however, is extremely challenging in large complex clusters. Tasks are dependent on resources from multiple layers during their lifetime. The scheduler, that simultaneously deals with tasks of many jobs together, decides the order in which tasks are scheduled. Tasks read their inputs from the underlying distributed file system (e.g., the Hadoop Distributed File System), the priority of access to which decides their progress. Often, data is cached in memory, thus a cache hit or miss significantly alters their completion. When tasks transfer data across the network, they contend with other network flows which decide their completion time. Finally, the resources on the machine that the tasks execute – processors and their caches and memory – also impact performance.
Existing solutions to satisfy the all-or-nothing property of jobs is often piecemeal and ad hoc, e.g., only at the caching layer or only at the scheduler. An important challenge for frameworks to provide predictable performance, thus, is a cluster-wide global resource manager that allocates resources to tasks of a job jointly at all the layers. Such a global resource manager that scales to large cluster sizes will be a high-value proposition for data analytics clusters.
Dr. Ganesh Ananthanarayanan is a Researcher at Microsoft Research.