Big Data: Improving Hadoop for Petascale Processing at Quantcast.
“We were an early user of Hadoop and found ourselves pushing its scalability bounds and of necessity innovating our own solutions. For example we wrote our own API on top of it, rebuilt its sorter, and developed an alternative file system to achieve better performance and cost-effectiveness” — Jim Kelly and Sriram Rao.
Quantcast released last October the Quantcast File System (QFS) to open source. I have interviewed Jim Kelly, head of R&D at Quantcast, and Sriram Rao, Principal Scientist in Microsoft’s Cloud and Information Services Lab (CISL). Both worked closely on the development of the QFS platform at Quantcast.
Q1. What is the main business of Quantcast?
Jim Kelly: Quantcast measures digital audiences for millions of web destinations and helps advertisers reach their audiences more effectively. We observe billions of anonymous digital media consumption events every month and use large scale machine learning techniques to characterize audiences and to identify and reach relevant ones for advertisers.
We’re not in the distributed file system business, we’re in the same category as Google and Yahoo: an advertising company doing pioneering work in big data and sharing it with the rest of the community.
Q2. What are the main big data processing challenges you have experienced so far?
Jim Kelly: We’ve been doing big data processing since the launch of our measurement product in 2006, which put us in the big data space before it had become a space. We were an early user of Hadoop and found ourselves pushing its scalability bounds and of necessity innovating our own solutions. For example we wrote our own API on top of it, rebuilt its sorter, and developed an alternative file system to achieve better performance and cost-effectiveness. Our business and data volume have grown steadily since then. We currently receive over 50 TB of data per day, continually challenging us to operate at a scale beyond most organizations’ experience and most technologies’ comfort zone.
Sriram: In terms of big data, Hadoop is targeted to the sweet spot where the volume of data being processed in a computation is roughly the size of the amount of RAM in the cluster. At Quantcast, what we found was that we were frequently jobs where the amount of data being data processed per day (even with compression) was substantial. For instance, when I started at Quantcast in 2008, we were doing 100TB per day. Scaling the data processing volume required us to rethink components of Hadoop and build novel soutions. As we developed novel solutions and deployed them in our cluster, we were able to increase data processing volumes to as much as 500TB per day over a 2-year period. This increase occurred purely due to software improvements on the same hardware.
Q3. What lessons did you learn so far in using Hadoop for Big Data Analytics?
Jim Kelly: Hadoop has been a fantastic boon for us and the rest of the community. By providing a framework for developers (and increasingly, non-developers) to do parallel computing without a lot of systems knowledge, it has democratized an arcane and difficult problem. It has been a catalyst for the emergence of the big data community: a whole ecosystem of people, companies and technologies focused on different aspects of the problem. We’ve been a big beneficiary, both in terms of technology we’ve been able to use directly, and team members we’ve been able to attract because they want to do cutting-edge work in the dynamic space Hadoop has created. The lesson, I think, is how much value a technology can provide above and beyond what it does when you run it.
Sriram: The Apache Hadoop distribution is the same one that powers Yahoo!’s large clusters. Having an open source software tested at such scale has provided a great starting point for companies looking to process big data. While the data processing needs at Quantcast are extreme (10’s of PB per day), the main lesson I think is that, for a vast majority of the companies interested in data analytics Hadoop has become a stable platform which works “out-of-the-box”.
Q4. What are the scalability limits of current implementation of Hadoop?
Jim Kelly: It’s difficult to generalize, because there are so many different environments using Hadoop in different ways. What becomes an issue for one set of use cases on one set of hardware is not necessarily relevant for another, and I think it’s safe to say that the great majority of environments working with Hadoop needn’t worry about its scalability.
In our environment we’re processing a steady diet of many-terabyte jobs on many hundreds of servers, and we’ve invested a great deal in making the “sort” phase of MapReduce more efficient. In between the “map” and the “reduce,” Hadoop has to sort and group the mapper output and distribute it to the appropriate reducers. In general every reducer will have to read some of every mapper’s output, and with M mappers and N reducers that causes MxN disk seeks. When your jobs are large enough to require M and N in the thousands, this becomes a big performance bottleneck. We wrote a blog post describing the problem in more detail.
In our own environment we’ve achieved more efficiency at this scale by developing the Quantcast File System (QFS) and Quantsort. Sriram joined us in 2008 and led the effort to adapt the Kosmos File System (KFS) to the job and to build a new sort architecture on top of it. Using the two systems, we’ve sorted data sets of nearly a petabyte, and we regularly process over 20 PB per day.
QFS also improved where Hadoop was scaling too readily for us: cost. By using a more efficient data encoding, it lets us do more data processing using only half the hardware we would need with Hadoop’s HDFS, and therefore half the power and half the cooling and half the maintenance.
Sriram: The fundamental issue with Hadoop is how “shuffle” phase is handled as the data volume scales.
In a Map-Reduce computation, the output generated by map tasks during the “map” phase is partitioned and a reducer obtains its input partition from each of the map tasks. In particular, for large jobs, the size of the map output may exceed the amount of RAM in the cluster. When that happens, data has to be written to disk and then read back and transported over the network.
The “shuffle” is an “external” distributed merge-sort and is known to be seek intensive. Hadoop uses traditional mechanisms for implementing the shuffle and this is known to have scalability issues when the volume of data to be shuffled exceeds the amount of RAM in the cluster.
Q5. Sriram what was your contribution to the QFS project?
Sriram: QFS is an evolution of the KFS system that I started at Kosmix. The bulk of the KFS code base is in QFS. At Quantcast, we continued to evolve KFS by improving performance, by adding features, etc. One key feature we added at Quantcast was support multi-writer append (i.e., atomic append). It turns out that this feature can be leveraged to build high-performance sort architectures, which are fundamental to doing Map-Reduce computations at scale. In general, I worked on all aspects of the KFS system.
Q6. How does the Kosmos File System (KFS) relate to the QFS?
Jim Kelly: QFS evolved from KFS and has a couple years’ improvements on it. The most significant is Reed-Solomon erasure coding, which doubles effective storage and improves performance.
Sriram: The QFS project has its roots in the KFS system that I built. The bulk of the KFS code base is in QFS.
I started KFS originally at Kosmix Corp (now, @WalmartLabs) in 2006. At that point, Kosmix was trying to build a search engine and KFS was intended to be the storage substrate. The KFS architecture is based on the Google filesystem paper. The main idea in KFS is that blocks of a file are striped across nodes and replicated for fault-tolerance. We eventually released KFS as open source in 2007, after which I joined Quantcast. At Quantcast, we continued to evolve KFS by improving performance, by adding features, etc. KFS 0.5 is a snapshot of the code base that was run in production cluster at Quantcast in 2010. So, if you will, think of KFS as QFS 0.5. The significant difference between the two systems is erasure coding.
Q7. Sriram you also worked on another open source project Sailfish. Could you tell us a bit more about it?
Sriram: Data analytics computations are run on machines in a datacenter. The frameworks for doing these computations were designed in 2004-2006 timeframe when bandwidth within the datacenter was a scarce resource. However, that design point is changing.
Over the next few years, technological trends suggest that bandwidth within a datacenter will substantially increase. Inter-node connectivity is expected to go up from 1Gbps between any pair of nodes to 10Gbps (and possibly higher). Given such plentiful bandwidth, how would we build analytical engines (such as, Map-Reduce computation engines)? Sailfish is an attempt at re-designing how the shuffle data is transported in Map-Reduce frameworks by taking advantage of better connectivity.
Sailfish design is based on two key ideas:
1. Use network-wide data aggregation to improve disk subsystem performance.
2. Gather statistics on the aggregated data to plan for reduce phase of execution.
We instantiated these two ideas in the context of Hadoop-0.20.2, where we re-worked the entire shuffle pipeline and how the reduce phase of execution is done. Sailfish leverages the concurrent append capabilities of KFS to move data from mappers to reducers. When a Hadoop Map-Reduce job is run with Sailfish, (1) user does not tune sort parameters and (2) number of reducers in a job and their task assignment is determined at run-time in a data-dependent manner. Essentially, Sailfish simplifies running large-scale Hadoop-based Map-Reduce computations. We have also added support for preemption in Sailfish, where the preemption is implemented via a checkpoint/restart mechanism. This allows us to seamlessly handle skew.
Here is the Sailfish website which has links to the papers we have published and the software we have released.
Q8. Who is currently using Sailfish?
Sriram: Sailfish is a research prototype which has been released to open-source so that other researchers/users can benefit from our work. We are currently working on migrating many of the ideas from Sailfish to Hadoop version 2.0 (aka YARN). We have filed some JIRAs and intend to contribute the code to Hadoop.
Q9. Quantcast released last October the Quantcast File System (QFS) to open source, (link here). Why?
Jim Kelly: We’ve always relied heavily on open source software internally, and it’s good to be able to give back. It’s a chance for developers to have more of an impact than they could working on a proprietary system. And we believe file systems particularly, as foundational infrastructure components, benefit from the open source model where they can get a lot of scrutiny and be challenged to prove themselves in many different environments.
Q10. Is QFS supposed to be alternative to Apache Hadoop`s HDFS or just a complementary solution?
Jim Kelly: Hadoop has a wide variety of customers with diverse priorities. Some are just getting started and put a lot of value on ease of use. Others care most about performance at small scale, or high availability. HDFS offers broad functionality to meet needs for all of them.
Quantcast’s priorities are cost efficiency and high performance at large scale, so in developing QFS we went deeper around that use case. QFS has significantly improved the performance and economics of our data processing and can do the same for other organizations running their own clusters at large scale. By contrast organizations just getting started or needing specific HDFS features will probably find HDFS is a better fit.
Q11. What are the key technical innovations of QFS?
Jim Kelly: Architecturally QFS places a couple of different bets than HDFS. It’s implemented in C++ rather than Java, which allows better control and optimization in a number of areas. It also leans more heavily on modern networks, which have become much faster since HDFS launched and allow different optimization choices.
Those differences in approach make possible QFS’s Reed-Solomon erasure coding, which doubles storage efficiency and improves performance. The concurrent append feature allows many processes to write simultaneously to a single file, enabling applications like Quantsort. QFS’s use of direct I/O also improves both performance and manageability, by ensuring processes live within a predictable memory footprint.
Sriram: One of KFS’s key innovations that we did at Quantcast was support for “atomic append”.
This is the ability to allow multiple writers to append data to a file. Basically, think of the file as a “bag” into which data is dropped and there are no ordering guarantees on the order in which data is appended to a file. It turned out that this feature can be leveraged to build high-performance sort engines.
Q12. What kind of benchmark results do you have comparing QFS with Hadoop`s HDFS? What benchmark did you use for that?
Jim Kelly: Both QFS and HDFS rely on a central server to manage the file system, and its performance is critical as every cluster operation is dispatched through it. We’ve tested the two head-to-head for various operations and found QFS’s leaner C++ implementation pays off. Directory listings were 22% faster, and directory creation was nearly three times as fast.
These were our tests, and we encourage other people to run their own. We’ve put scripts and instructions to help on github.
Our whole-cluster tests will be harder for others to replicate at the same scale we did on our production cluster. We attempted to get some sense of how the file system affects real-world job run time. We measured end-to-end run time for simple Hadoop jobs that wrote or read 20TB of user data on an otherwise idle cluster, using QFS or HDFS as the underlying file system. The average throughput of the write job was about 75% higher using QFS, due to having to write less physical data. The read job throughput was 47% higher, primarily due to better parallelism. More information about these tests is available on the QFS wiki.
Q13. How reliable is QFS? What is your experience in using QFS in the Quantcast environment?
Jim Kelly: We started running KFS internally in 2008 on secondary storage and over years evolved it into today’s QFS, in the process adding features to make it more manageable at large scale and hardening it for production use. By 2011 it was running reliably enough that we were comfortable going all in. We copied all our data over from HDFS and have been running QFS exclusively since then for all our storage and MapReduce work. QFS has handled over six exabytes of I/O since then, so we’re confident it’s ready for other organizations’ production workloads.
Q14. You mentioned Hadoop’s architectural limitations for large jobs (terabytes and beyond). The more tasks a job uses, the less efficient its disk I/O becomes. To improve that, Quantcast developed an improved Hadoop Sort, called Quantsort. What is special about Quantsort?
Jim Kelly: Quantsort avoids the MxN combinatorics Hadoop faces shuffling data between mappers and reducers. It leverages QFS’s concurrent append feature to shuffle data through a fixed number of intermediate files. Reducers read their data in larger chunks, keeping disk I/O efficient. We wrote a blog post describing Quantsort more fully.
Q15. How is the performance of Quantsort?
Jim Kelly: Fast. Sriram did Quantcast’s first large sort back in 2009, of 140 TB in 6.2 hours on much less hardware than we have today. Quantsort’s record on a production job is 971 TB in 6.4 hours.
Sriram: For large jobs (think 10’s-100’s of TB of data and beyond) where the map output exceeds the size of RAM, Quantsort performance can be upto an order of magnitude faster when compared to Hadoop’s shuffle. Quantsort enables you to process more data quickly using the same hardware. Since jobs finish faster, it has the multiplier effect of allowing users to run more jobs on the cluster.
Q17. Is Quantsort part of QFS and therefore, also open source?
Jim Kelly: No, Quantsort is a separate code base.
Q18 In 2011, Quantcast migrated all production data to QSF and turned off HDFS. How did you manage to migrate the data stored in HDFS to QSF?
Jim Kelly: The data migration was very straightforward, a matter of running a trivial MapReduce job that copies data from one place to another. We avoided any outage for our production jobs by phasing them over, making them read from either file system but write to QFS during the transition.
Q19. How easy is to plug-in and use QFS into an existing Hadoop cluster which has already data stored in HDFS?
Jim Kelly: QFS is plug-in compatible with Hadoop, using KFS bindings that are already included in standard Hadoop distributions, so it’s very easy to integrate. The data storage format is different, requiring either migrating data to QFS or phasing it in gradually. The QFS wiki has a migration guide.
Q20. KFS has also been released an open source project. How does it relate to the Quantcast File System (QFS)?
Jim Kelly: Quantcast adopted KFS in 2008, became the main sponsor of contributions to the KFS project for the next two years and subsequently evolved it into QFS. You’ll find the concurrent append feature in the KFS repository, for example, but Reed-Solomon encoding is only in QFS.
Qx Anything you wish to add?
Jim Kelly: An invitation to other organizations doing distributed batch processing and interested in doubling their cluster capacity to give QFS a try. We’re happy to field questions or receive feedback on the QFS mailing list.
Jim Kelly leads Quantcast’s R&D team, which works on both adding computing capacity through cluster software innovations and using it up through new analytic and modeling products. Having been at Quantcast for six years, he has seen its data volumes and processing challenges grow from zero to petabytes and led technical and organizational changes that have kept Quantcast a step ahead. Previously he held engineering leadership roles at Oracle, Kana, and Scopus Technology (acquired by Siebel). Jim holds a PhD in physics from Princeton University.
Sriram Rao is currently a Principal Scientist in Microsoft’s Cloud and Information Services Lab (CISL), and formerly lead the Quantcast’s Cluster Team. Sriram was the creator of the Kosmos File System (KFS), which would become the underlying architecture for the Quantcast File System (QFS) that was released to open source last year. Sriram kiis also the creator of Sailfish, another open source project that is geared towards processing massive amounts of data. Previously he has worked at Quantcast, Yahoo, and Kosmix (now @WalmartLabs). Sriram holds a PhD in Computer Sciences from UT-Austin.
–ODBMS.org: Big Data and Analytical Data Platforms – Free Software
–ODBMS.org: Big Data and Analytical Data Platforms – Articles
Follow ODBMS.org on Twitter: @odbmsorg