Hadoop at Yahoo. Interview with Mithun Radhakrishnan
“The main challenge when working with “big data” in Yahoo has always been our definition of “big”. :] There are several thousands of feeds on Yahoo’s Hadoop clusters, with daily, hourly and up-to-the-minute data frequencies, spanning Petabytes of data”.–Mithun Radhakrishnan
Q1. You work on Apache Hive, in the Yahoo Hadoop team. What are the most current projects you are working on?
Mithun Radhakrishnan: I work on the Hive team at Yahoo. Currently, we are migrating our Hadoop clusters from Hadoop 0.23 (initial release of YARN) to Hadoop 2.5. My team has been focusing on making sure that Hive 0.12 is performant on Hadoop 2.5, as well as rolling out Hive 0.13 to Yahoo’s Grid infrastructure. We have also been busy trying to enhance the performance of Hive queries, as well as of the Hive metastore, to work effectively at Yahoo’s large scale.
Q2. What are the most important challenges you are facing for the deployment, scaling and performance of Hive-related services at Yahoo?
Mithun Radhakrishnan:The main challenge when working with “big data” in Yahoo has always been our definition of “big”. :] There are several thousands of feeds on Yahoo’s Hadoop clusters, with daily, hourly and up-to-the-minute data frequencies, spanning Petabytes of data. Each feed would correspond to a Hive table, with the timestamp (date, hour, minute) being just one of several levels of partition-keys. Some of our more popular feeds add hundreds of thousands of partitions daily, and span millions overall. We’re working on optimizations in Hive’s metadata-storage, to scale to these high levels.
Another recent challenge has been the increased adoption of Business Intelligence and Data visualization tools (such as Tableau and MicroStrategy), connected directly to Grid data over HiveServer2. Such use imposes expectations not only on Hive query performance, but also on data transport as well as the metastore.
And finally, the hardware on which Hadoop runs at Yahoo is heterogeneous, accumulated over many years of usage at Yahoo. While our newer clusters use bleeding-edge hardware with gobs of memory, some of our clusters are several years old.
At our scale, we don’t have the luxury of completely replacing our hardware every year. We need our Grid software (Hadoop, Hive, Pig, etc.) to be performant on a variety of processor/memory/disk configurations.
Q3. What kind of Hive-related services did you implement at Yahoo?
Mithun Radhakrishnan: Yahoo has traditionally been an Apache Pig shop, but recently, we’ve seen an increase in the number of Hive jobs. This may be attributed to increased SQL-based analytics, proliferation of Business Intelligence tools, and some use of Hive for data transformations.
At Yahoo, we use HCatalog (i.e. Hive’s metadata server) for interoperability between Pig, Hive and MapReduce. An HCatalog Server runs as a separate service, serving metadata about various datasets.
Users consume this data using Hive directly, or using Pig and MapReduce (via HCatalog wrappers).
The data lifecycle (ingestion, replication and retirement) is managed via the Grid Data Management (GDM) suite, which was a pre-cursor to the Apache Falcon project. GDM is tightly integrated with HCatalog, and deals with data-registrations and discovery with HCatalog.
To enable data analysis and visualization tools for analysts, we deploy HiveServer2 instances. This allows direct JDBC/ODBC based connections to Grid data, to drastically cut down analysis and decision time, as well as unnecessary intermediate copies in a separate data warehouse.
A large number of users employ Oozie jobs to produce/consume Hive data, using Oozie’s “Hive Actions“. The Yahoo Hive and Oozie teams have integrated the two systems to reduce latencies in data processing pipelines.
Q4, What is Y!Grid ? And what is it useful for?
Mithun Radhakrishnan: Y!Grid is Yahoo’s Grid of Hadoop Clusters that’s used for all the “big data” processing that happens in Yahoo today. It currently consists of 16 clusters in multiple datacenters, spanning 32,500 nodes, and accounts for almost a million Hadoop jobs every day. Around 10% of those are Hive jobs.
No one else makes as much use out of Hadoop every single day as Yahoo does. Some of the notable use cases of Hadoop at Yahoo include:
Content Personalization for increasing engagement by presenting personalized content to users based on their profile and current activity
Ad Targeting and Optimization for serving the right ad to the right customer by targeting billions of impressions everyday based on recent user activities
New Revenue Streams from native ads and mobile search monetization through better serving, budgeting, reporting and analytics
Data Processing Pipelines for aggregating various dimensions of event level traffic data (page, ad, link views, link clicks, etc.) across billions of audience, search, and advertising events everyday
Mail Anti-spam and Membership Anti-abuse for blocking billions of spam emails and hundreds of thousands of abusive accounts per day through machine learning algorithms
Search Assist and Analytics for improving the Yahoo Search experience by processing billions of web pages
Q5. What are the strengths and weakness of Hive in your experience?
Mithun Radhakrishnan: Hive combines the immense computing power of Hadoop with the accessibility and expressiveness of SQL. Its main strengths include:
Scale: Hive scales easily to multi-terabyte datasets, and isn’t shackled by memory constraints.
SQL: Hive allows business logic to be expressed in SQL. This lowers the bar of entry for usage, allowing data-analysts with little Hadoop experience to use their expertise with Hadoop data. (Performance tuning is, admittedly, a different kettle of fish. ;])
Standard: Apache Hive supports analytics through Tableau, Microstrategy and Microsoft Excel, and has supported this for the longest time.
Strong community: The dev community in Hive is brilliant, vibrant and active (as a glance at the Git log would reveal. ;]) We’ve recently seen the introduction of an Apache Tez backend, vectorization support, optimized file-formats like ORC, as well as the promise of very interesting things to come (such as the new Cost Based Optimizer, and an Apache Spark-based back-end).
Which is not to say that everything’s perfect:
M/R: Until recently, Hive’s physical plans could only target MapReduce, which caused multi-stage queries to run quite slowly. Hive 0.13 now supports the expression of physical plans as arbitrary DAGs, using Apache Tez. This dramatically boosts performance, as our benchmarks have shown.
Standard SQL: HiveQL isn’t quite SQL92-compliant yet, although it’s tending in the right direction. Industry-standard benchmarks like TPC-h and TPC-ds typically need rewriting to run on Hive. To borrow a simile from Rowan Atkinson: it is sort of like Andrew Lloyd Webber rearranging the score of Evita to suit the vocal range of Britney Spears. :]
Metastore performance and data throughput in HiveServer2 still have room for improvement.
Q6. You are an Apache HCatalog committer. What are your most important contributions? Who is currently using HCatalog and for what?
Mithun Radhakrishnan: The Apache HCatalog project has been merged with the main Apache Hive project now.
My work with HCatalog has primarily revolved around integration with other projects. Specifically:
I worked on the HCatalog notification system, to send JMS compliant notifications in response to changes to a dataset’s metadata. In Yahoo, we use this specifically with Oozie, to kick off Oozie workflows as soon as their dependency dataset-partitions are published in HCatalog. This reduces workflow launch latencies, end-to-end pipeline execution times, while also reducing NameNode pressure caused by polling.
I’ve worked (and am still working) on integration with data ingestion services like GDM. My focus at the moment is on metastore performance, and replication of tables/partitions across HCatalog instances.
HCatalog is an integral part of data processing pipelines at Yahoo, given its integration with GDM and Oozie. Outside of Yahoo, HCatalog is also used at Twitter and LinkedIn, as far as I’m aware. I’m sure there are other firms as well.
HCatalog is also used externally by several projects such as Apache Falcon, Apache Oozie, etc.
Q7. You have been benchmarking various versions of Hive. What are the main results you have obtained?
Mithun Radhakrishnan: I’ve had the opportunity to benchmark Apache Hive 0.10 through Hive 0.13, across various scales of input data, multiple data formats and tuning parameters. We’ve observed that the query performance has improved steadily, with each major release. But the jump in Hive 0.13 has been quite phenomenal. The switch to a more expressive physical execution engine in Apache Tez, coupled with vectorization, ORC files and table/column statistics has really paid dividends.
For the Yahoo Hadoop team, the main result from the benchmark was that Apache Hive 0.13 supports a “high dynamic range” of data scale: it is performant enough at the 100GB scale to approach interactivity, while simultaneously also scaling to 10+ TB of data. Given that the system scales over such a wide range, and that Yahoo already deploys Hive in production, we find little reason to deploy any other frameworks for SQL-based analytics on Y!Grid.
Q8. How did you define the workloads for your benchmark of Hive?
Mithun Radhakrishnan: When we started off, we considered creating a Yahoo-specific benchmark: a set of Hive scripts and accompanying datasets to represent the Yahoo workload. The problem was that there was a variety of datasets, and several Hive users, running different kinds of workloads.
In the end, we opted to use the TPC-h benchmarks instead. These are industry standard, more or less representative of the jobs we run at Yahoo. Hortonworks was already running a large subset of TPC-ds benchmarks on Hive. We decided that TPC-h would allow for complementary coverage. We did partition the data and transform it in the way that we would have with production data.
At the time, the comparisons most people were trying to make were between Shark (on Apache Spark) and Hive.
Shark engineers had posted results from running a port of TPC-h, transliterated to Hive’s SQL dialect. We figured we’d get an apples-to-apples comparison by running those scripts against Hive.
Q9. Did you compare Hive with other Big Data software platforms?
Mithun Radhakrishnan: The objective of the benchmark was primarily to track Apache Hive’s progressive performance gains viz. prior versions. However, I did compare Hive 0.12 and 0.13’s performance against Shark 0.7.1 and Shark 0.8 (which was trunk at the time.)
The results were mixed. At the 100GB scale, I did see Shark perform admirably. But I ran into problems with Shark at scale. A large majority of queries simply didn’t complete on Shark at the 10+TB scale. It appeared that a lot of time was lost in shuffling data between consecutive stages. Coupled with the fact that Shark was only compatible with Hive Metastore v0.9, that the number of reducers wasn’t deduced per job, the lack of support for security or interoperability with our existing production systems, Apache Hive looked the better fit for Y!Grid.
I haven’t had the opportunity to compare Hive against other systems yet. I do hope to, as soon as I can find the time, but my day-job keeps me pretty busy. :]
Qx Anything else you wish to add?
Mithun Radhakrishnan: Lots of people fret over query performance. Performance is important, but one must think holistically about the data and workload that needs to be processed, hardware choices available at various price points, holistic long-term TCOs of operating the system, current and future use cases, support, etc. Everyone’s situation would be a bit different and something to take into account when thinking about a SQL-on-Hadoop solution.
My name is Mithun Radhakrishnan. I work on Apache Hive, in the Yahoo Hadoop team. My team is responsible for the deployment, scaling and performance of Hive-related services (including HCatalog and HiveServer2) on the Y!Grid, the largest production Hadoop Clusters in existence today.
I’ve been working on Hadoop-related projects in Yahoo since 2009, including the Grid Data Management System (pre-cursor to Apache Falcon), HCatalog and Hive. I’m an Apache HCatalog committer and Hive contributor. Prior to working at Yahoo, I was a firmware developer at Hewlett-Packard, writing hardware self-diagnostic and healing firmware for HP’s big-iron boxen (Integrity Servers, running Intel Itaniums).
I’m currently working broadly on getting the Hive Metastore to perform at Yahoo-scale.
I’ve recently had the pleasure of benchmarking various versions of Hive (0.10-13), with different settings, file-formats, etc., to gauge progressive performance gains. I’ll be presenting my findings at Strata 2014.
–Hive on Apache Tez: Benchmarked at Yahoo! Scale. Mithun Radhakrishnan (Yahoo! Inc.). Talk at Strata+Hadoop Conference. 2:35pm Thursday, 10/16/2014
Follow ODBMS.org on Twitter: @odbmsorg