On Virtualize Hadoop. Interview with Joe Russell.
“A common misconception when virtualizing Hadoop clusters is that we decouple the data nodes from the physical infrastructure. This is not necessarily true. When users virtualize a Hadoop cluster using Project Serengeti, they separate data from compute while preserving data locality. By preserving data locality, we ensure that performance isn’t negatively impacted, or essentially making the infrastructure appear as static.” — Joe Russell.
VMware announced in June last year an open source project called Serengeti.
The main idea of Project Serengeti is to enable users and companies to quickly deploy, manage and scale Apache Hadoop on virtual infrastructure.
I have interviewed Joe Russell, VMware, Product Line Marketing Manager, Big Data.
Q1. Why Virtualize Hadoop?
Joe Russell: Hadoop is a technology that Enterprises are increasingly using to process large amounts of information. While the technology is generally pretty early in its lifecycle, we are starting to see more enterprise-grade use cases. In its current form, Hadoop is difficult to use and lacking the toolsets to efficiently deploy run and manage Hadoop clusters in an Enterprise context. Virtualizing Hadoop not only brings enterprise-tested High Availability and Fault Tolerance to Hadoop, but it also allows for much more agile and automated management of Hadoop clusters. Additionally, virtualization allows for separation of data and compute, which allows users to preserve data locality and paves the way towards more advanced use cases such as mixed workload deployments and Hadoop-as-a-service.
Q2. You claim to be able to deploy an Apache Hadoop cluster (HDFS, MapReduce, Pig, Hive) in minutes on an existing vSphere cluster using Serengeti. How do you do this? Could you give us an example on how you customize a Hadoop Cluster?
You will be able to customize Hadoop clusters easily from within the Serengeti tool by specifying node and resource allocations through an easy to use user interface.
Q3. There are concerns on the approach of decoupling Apache Hadoop nodes from the underlying physical infrastructure. Quoting Steve Loughran (HP Research): “Hadoop contains lots of assumptions about running in a static infrastructure; it’s scheduling and recovery algorithms assume this.” What is your take on this?
Joe Russell: A common misconception when virtualizing Hadoop clusters is that we decouple the data nodes from the physical infrastructure. This is not necessarily true. When users virtualize a Hadoop cluster using Project Serengeti, they separate data from compute while preserving data locality. By preserving data locality, we ensure that performance isn’t negatively impacted, or essentially making the infrastructure appear as static. Additionally, it creates true multi-tenancy within more layers of the Hadoop stack, not just the name node.
I think there is some confusion when we say “in the cloud”. Here, Steve is talking about running it on a public cloud like Amazon. Steve is largely introducing the concept of data locality, or the notion that large amounts of data are hard to move. In this scenario, it makes sense to bring compute resources to the data to ensure performance isn’t negatively impacted by networking limitations. VMware advocates that Hadoop should be virtualized, as it introduces a level of flexibility and management that allows companies to easily deploy, manage, and scale internal Hadoop clusters.
Q4. How do ensure High Availability (HA)? How do you protect against host and VM failures?
Joe Russell: We ensure High Availability (HA) by leveraging vSphere’s tested solution via Project Serengeti’s integration with vCenter (management console of vSphere).
In the event of physical server failure, affected virtual machines are automatically restarted on other production servers with spare capacity. In the case of operating system failure, vSphere HA restarts the affected virtual machine on the same physical server.
In Hadoop nomenclature, this means that there is HA on more than just the name node. vSphere’s solution also allows for HA on the jobtracker node, metastores, and on the management server, which are critical pieces of any Hadoop system that require high availability.
More importantly, as Hadoop is a batch-oriented process, it is important that when a physical host does fail, that you are able to pause and then restart that job from the point in time in which it went down. VMware’s vSphere solution allows for this and has been tested amongst the biggest Enterprises for the better part of the past decade.
Q5. How do you get Data Insights? Do you already have examples how such Virtualize Hadoop is currently used in the Enterprises? If yes, which ones?
Joe Russell: Data Insights occur farther up the stack with analytics vendors.
Project Serengeti is a tool that allows you to run Hadoop ontop of vSphere and is a solution designed to allow users to consolidate Hadoop clusters on a single underlying virtual infrastructure. The tool allows for users to run different types of Hadoop distributions on a hypervisor to gain the benefits of virtualization, which include efficiency, elasticity, and agility.
Q6. Does Serengeti only works with VMware vSphere® platform?
Joe Russell: Project Serengeti today only works with the vSphere hypervisor.
However, VMware made the decision to open source Project Serengeti to make the code available to anyone who wishes to use it.
By making it open source vs. just offering a free closed source product, VMware allows users to take the Serengeti code and alter it for their own purposes. For example, any user could download the Project Serengeti code and alter it to make it work with other hypervisors other than vSphere. While it isn’t in VMware’s interest to dedicate resources to make Project Serengeti run with other hypervisors, it doesn’t prevent users from doing so. This is an important point.
Q7. VMware is working with the Apache Hadoop community to contribute changes to the Hadoop Distributed File System (HDFS) and Hadoop MapReduce projects to make them “virtualization-aware”. What does it mean? What are these changes?
Joe Russell: Hadoop Virtual Extensions (“HVE”) is one example of this. VMware contributed HVE back to the Apache community to make Hadoop distributions virtualization aware. This means inserting a node group layer between the rack and host to make Hadoop distributions topology aware for virtualized platforms. In its simplest terms, this allows for VMware to preserve data locality and increase reliability through the separation of data and compute.
A link to a whitepaper with further detail can be found here (.pdf).
Q8. What about the performance of such “Virtualize” Hadoop? Do you have performance measures to share?
Joe Russell: Please see whitepaper referenced above.
Q9. What is the value of Hadoop-in-cloud? How does it relate to the virtualization of Hadoop?
Joe Russell: I don’t necessarily understand the question and it would be particularly helpful to define what you mean by “Hadoop-in-Cloud”.
I think you may be referring to Hadoop-as-a-Service, which is valuable in that users are able to deliver Hadoop resources to internal users based on need. Centralized control through Hadoop-as-a-Service ensures high cluster utilization, lower TCO, and an agile framework to adjust to ever-changing business needs. As Enterprises increasingly look to service internal customers, I expect Hadoop-as-a-Service to become more popular as the Hadoop technology emerges within the enterprise. Please keep in mind that this relates both to private and public clouds. Virtualizing Hadoop is the first step toward being able to provision Hadoop in the cloud.
Q10. VMware also announced updates to Spring for Apache Hadoop. Could you tell us what are these updates?
Q11 VMware is working with a number of Apache Hadoop distribution vendors (Cloudera, Greenplum, Hortonworks, IBM and MapR ) to support a wide range of distributions. Why? Could you tell us exactly what is VMware contribution?
Joe Russell: VMware is focused on providing a common underlying virtual infrastructure so each of these vendors can run their software better on vSphere. Project Serengeti is a toolset that pre-configures, tunes and makes it easier to deploy and run Hadoop with increased reliability on vSphere. These efforts make it easier for enterprises to make architectural decisions around how to setup Hadoop within their companies. Deciding to virtualize Hadoop can have dramatic effects not only on companies just beginning to use Hadoop, but also on more advanced users of the technology. VMware’s contributions through Project Serengeti allow each of the vendor’s software to run better on virtualized infrastructure. As you know, these contributions are available for anyone to use.
Q12 Serengeti, “Virtualize” Hadoop, Hadoop in the Cloud, Spring for Apache Hadoop: what is the global picture here? How all of these efforts relate to each other? What are the main benefits for developers and users of Apache Hadoop?
Joe Russell: All of these efforts improve the technology and make it easier for developers and users of Hadoop to actually use Hadoop. Additionally, these efforts focus on virtualizing Hadoop to make the technology more elastic, reliable, and performant.
VMware is focused on bringing the benefits of virtualization to Hadoop, both from a community standpoint and a customer standpoint. It has been open in its approach to contributing back technology that makes it easier for users / developers to utilize virtualization for their Hadoop clusters. Conversely, it is investing in bringing Hadoop to its existing customers by making the technology more reliable and building easy to use tools around the technology to make it easier to deploy and administrate in an Enterprise setting with SLAs and business critical workloads.
Joe Russell is responsible for product strategy, GTM, evangelism and product marketing of Big Data at VMware.
He has over a decade of experience in a blend of product marketing, finance, operations, and M&A roles.
Previously he worked for Yahoo!, and as an Investment Banking – Technology M&A Analyst for GCA Savvian, Credit Suisse and Societe Generale.
He holds a MSc, Accounting & Finance from London School of Economics and Political Science, a BS, Economics with Honors from University of Washington, and a MBA from Wharton School, University of Pennsylvania.
ODBMS.org- Lecture Notes: Data Management in the Cloud.
by Michael Grossniklaus, David Maier, Portland State University.
Course Description: “Cloud computing has recently seen a lot of attention from research and industry for applications that can be parallelized on shared-nothing architectures and have a need for elastic scalability. As a consequence, new data management requirements have emerged with multiple solutions to address them. This course will look at the principles behind data management in the cloud as well as discuss actual cloud data management systems that are currently in use or being developed. The topics covered in the courserange from novel data processing paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data management platforms (Google BigTable, Microsoft Azure, Amazon S3 and Dynamo, Yahoo PNUTS) and open-source NoSQL databases (Cassandra, MongoDB, Neo4J). The world of cloud data management is currently very diverse and heterogeneous. Therefore, our course will also report on efforts to classify, compare and benchmark the various approaches and systems. Students in this course will gain broad knowledge about the current state of the art in cloud data management and, through a course project, practical experience with a specific system.”
Lecture Notes | Intermediate/Advanced | English | LINK TO DOWNLOAD ~280 slides (PDF)| 2011-12|
Follow ODBMS.org on Twitter: @odbmsorg