Skip to content

On HP Distributed R. Interview with Walter Maguire and Indrajit Roy

by Roberto V. Zicari on April 9, 2015

“Predictive analytics is a market which has been lagging the growth of big data – full of tools developed twenty or more years ago which simply weren’t built with today’s challenges in mind.”–Walter Maguire and Indrajit Roy

HP announced HP Distributed R. I wanted to learn more about it, and I have interviewed Walter Maguire, Chief Field Technologist with the HP Big Data Group,and Indrajit Roy, principal researcher at HP, who provided the answers with the assistance of Malu G. Castellanos, manager and technical contributor in the Vertica group of Hewlett Packard.

RVZ

Q1. HP announced HP Distributed R. What is the difference with the standard R?

Maguire, Roy: R is a very popular statistical analysis tool. But it was conceived before the era of Big Data. It is single threaded and cannot analyze massive datasets. HP Distributed R brings scalability and high performance to R users. Distributed R is not a competing version of R. Rather, it is an open source package that can be installed on vanilla R. Once installed, R users can leverage the pre-built distributed algorithms and the Distributed R API to benefit from cluster computing and dramatically expand the scale of the data they are able to analyze.

Q2. How does HP Distributed R work?

Maguire, Roy: HP Distributed R has three components:
(1) an open source distributed runtime that executes R functions,
(2) a fast, parallel data loader to ingest data from different sources such as the Vertica database, and
(3) a mechanism to deploy the model in the Vertica database.
The distributed runtime is the core of HP Distributed R.
It starts multiple R workers on the cluster, breaks the user’s program into multiple independent tasks, and executes them in parallel on cluster. The runtime hides much of the internal data communication. For example, the user does not need to know how many machines make up the cluster and where data resides in the cluster. In essence, it allows any R algorithm which has been ported to use distributed R to act like a massively parallel system.

Q3. Could you tell us some details on how users write Distributed R programs to benefit from scalability and high-performance?

Maguire, Roy: A programmer can use HP Distributed R’s API to write distributed applications. The API consists of two types of language constructs. First, the API provides distributed data-structures. These are really distributed versions of R’s common data structures such as array, data.frame, and list. As an example, distributed arrays can store 100s of gigabytes of data in-memory and across a cluster. Second, the API also provides a way for users to express parallel tasks on distributed data structures. While R users can write their own custom distributed applications using this API, we expect most R users to be interested in built-in algorithms. Just like R has built-in packages such as kmeans for clustering and glm for regression, HP Distributed R provides distributed versions of common clustering, classification, and graph algorithms.

Q4. R has already many packages that provide parallelism constructs. How do they fit into Distributed R?

Maguire, Roy: Yes, R has a number of open source parallel packages. Unfortunately, none of the packages can handle hundreds or thousands of gigabytes of data or has built-in distributed data structures and computational algorithms. HP Distributed R fills that functionality gap, along with enterprise support – which is critical for customers before they deploy R in production systems.
Also, it’s worth noting that using distributed R doesn’t prevent an R programmer from using their current libraries in their current environment. Those libraries just won’t gain the scale and performance benefits of distributed R.

Q5. Why is there a need to streamline language constructs in order to move R forward in the era of Big Data?

Maguire, Roy: The open source community has done a tremendous job of advancing R—different algorithms, thousands of packages, and a great user community.
However, in the case of parallelism and Big Data there is a confusing mix of R extensions. These packages have overlapping functionality, in many cases completely different syntax, and none of them solve all the issues users face with Big Data. We need to ensure that future R contributors can use a standard set of interfaces and write applications that are portable across different backend packages. This is not just our concern, but something that members of R-core and other companies are interested in as well. Our goal is to help the open source community streamline some of the language constructs so they can spend more time answering analytic questions and less time trying to make sense of the different R extensions.

Q6. What are in your opinion the strengths and weaknesses of the current R parallelism constructs?

Maguire, Roy: Some packages such as “parallel” are very useful. In the case of “parallel”, it is accessible to most R users, already ships with R, and it is easy to express embarrassingly parallel applications (those in which individual tasks don’t need to coordinate with each other). Still, parallel and other packages lack concepts such as distributed data-structures which can provide the much needed performance on massive data. Additionally, it is not clear if the infrastructure implementing existing parallel constructs have been tested on large, multi-gigabyte data.

Q7. When MPI and R wrappers around MPI are a good option?

Maguire, Roy: MPI is a powerful tool. It is widely used in the scientific and high performance computing domain.
If you have an existing MPI application and want to expose it to R users, the right thing is to make it available thought R wrappers. It does not make sense to rewrite these optimized scientific applications in R or any other language.

Q8. Why for in-memory processing, adding some form of distributed objects in R can potentially improve performance?

Maguire, Roy: In-memory processing represents a big change moving forward. The key idea is to remove bottlenecks such as the disk which slows down applications. In HP Distributed R, distributed objects provide a way to store and manipulate data in-memory. Without these distributed objects, data on worker nodes will be ephemeral and users will not be able to reference remote data. Worse, there will be performance issues. For example, many machine learning applications are iterative and need to execute tasks for multiple rounds. Without the concept of distributed objects, applications would end up re-broadcasting data to remote servers in each round. This results in a lot of data movement and very poor performance. Incidentally, this is a good example of why we undertook Distributed R in the first place. Implementing the bare bones of a parallel application is relatively straightforward, but there are thousands or tens of thousands of edge cases which arise once said application is in use due to the nature of distributed processing.
This is when the value of a cohesive parallel framework like Distributed R becomes very apparent.

Q9. Do you think that by using simple parallelism constructs, such as lapply, that operate on distributed data structures, may make it easier to program in R?

Maguire, Roy: Yes, we need to ensure that R users have a simple API to express parallelism. Implementing machine learning algorithms requires deep knowledge. Couple it with parallelism, and you are left with a very small set of people who can really write such applications. To ensure that R users continue to contribute, we need an API which is familiar to current R users. Constructs from the apply() family are a good choice. In fact we are exploring these kind of APIs with members of R-core.

Q10. R is an open-source software project. What about HP Distributed R?

Maguire, Roy: Just like R, HP Distributed R is a GPL licensed open source project. Our code is available on GitHub and we try to release a new version every few months. We provide enterprise support for customers who need it. If you have HP Vertica enterprise edition you will see additional benefits of integrating Vertica with Distributed R.
For example, you can build a machine learning model in Distributed R, and then deploy it in Vertica to score data real time in an analytic application – something many of our customers need.

Qx Anything you with to add?

Maguire, Roy: Predictive analytics is a market which has been lagging the growth of big data – full of tools developed twenty or more years ago which simply weren’t built with today’s challenges in mind.
With HP Distributed R we are not only providing users with scalable and high performance solutions, but also making a difference in the open source community. We look forward to nurturing contributors who can straddle the world of data science and distributed systems.
A core tenet of our big data strategy is to create a positive developer experience, and we are very focused on technology development and fulfillment choices which support that goal.

———————
Walter Maguire has twenty-eight years of experience in analytics and data technologies.
He practiced data science before it had a name, worked with big data when “big” meant a megabyte, and has been part of the movement which has brought data management and analytic technologies from back-office, skunk works operations to core competencies for the largest companies in the world. He has worked as a practitioner as well as a vendor, working with analytics technologies ranging from SAS and R to data technologies such as Hadoop, RDBMS and MPP databases. Today, as Chief Field Technologist with the HP Big Data Group, Walt has the unique pleasure of addressing strategic customer needs with Haven, the HP big data platform.

Indrajit Roy is a principal researcher at HP. His research focusses on next generation distributed systems that solve the challenges of Big Data. Indrajit’s pet project is HP Distributed R, a new open source product that helps data scientists. Indrajit has multiple patents, publications and a best paper award. In the past he worked on computer security and parallel programming. Indrajit received his PhD in computer science from the University of Texas at Austin.

Resources

Related Posts

On Big Data Analytics. Interview with Anthony Bak. ODBMS Industry Watch, December 7, 2014

Predictive Analytics in Healthcare. Interview with Steve Nathan. ODBMS Industry Watch, August 26, 2014

On Column Stores. Interview with Shilpa Lawande. ODBMS Industry Watch, July 14, 2014

Follow ODBMS.org on Twitter: @odbmsorg

##

From → Uncategorized

No comments yet

Leave a Reply

Note: HTML is allowed. Your email address will not be published.

Subscribe to this comment feed via RSS