Today, HPE announced HPE Distributed R, a massive leap forward in the world of predictive and statistical analytics. A scalable and high-performance engine for the R language, HP Distributed R allows tasks to be split across multiple nodes, enabling scale where before there simply was none. Now data scientists can analyze billions of rows for regression, page rank, and much more, all the while using the familiar RStudio and R console that is commonly used by an estimated 2 million + strong user base. Below is an overview of a workshop hosted at HP labs, focused around the new found benefits of Distributed R.
Over the last two decades, R has established itself as the most-used open source tool in data analysis. R’s greatest strength is its user community, which has collectively contributed thousands of packages that extend R’s use in everything from cancer research to graph analysis. But as users in these and many other areas embrace distributed computing, w¬e need¬¬ to ensure that R continues to be easy for people to write, share, and contribute code.
When it comes to distributed computing, though, while R has many packages that provide parallelism constructs , it has no standardized API. Each package has its own syntax, parallelism techniques, and operating systems that they supportit supports. Unfortunately, this makes it difficult for users to write distributed programs for themselves, or make contributions that extend easily to other scenarios.
Figuring it was time to brainstorm and standardize an API, Michael Lawrence (Genentech, R-core member) and I recently organized a workshop on “Distributed Computing in R” at HP Labs. It was attended by members of R-core and some of R’s most important academic (e.g., Univ. of Iowa, Yale, Purdue), research lab (e.g., AT&T research, ORNL), and industry (e.g., TIBCO, Revolution Analytics, Microsoft) contributors. These attendees have authored many popular R packages such as snow, Rcpp, RHIPE, foreach, and Bioconductor.
The one- and- a- half day workshop featured a number of interesting talks and many collaborative discussions. In his presentation, Robert Gentleman, R’s co-creator, emphasized the need to streamline language constructs in order to move R forward in the era of Big Data. Other talks were by authors of prominent R packages, who both presented overviews of their packages and commented on the strengths and weaknesses of their parallelism constructs.
The talks were grouped into three sessions. The first focused on interfaces around MPI, such as R’s snow package. The second session looked at R’s integration with external analytics systems like Hadoop MapReduce, and the third included talks on external memory algorithms and packages to access data that don’t fit in main memory.
All of these talks are available at the workshop home page.
Overall, a few common themes emerged:
- For those who prefer high-performance computing and are willing to write at a low-level interface, MPI and R wrappers around MPI are a very good option.
- For in-memory processing, adding some form of distributed objects in R can potentially improve performance.
- Using simple parallelism constructs, such as lapply, that operate on distributed data structures such as lapply, may make it easier to program in R.
- Any high level API should support multiple back ends, each of which can be optimized for a specific platform, much like R’s snow and foreach package run on any available backend
The workshop was hugely valuable in bringing the R community together to think about how we can evolve R to best serve our future needs. A big thanks to all the attendees!
Our next step is to act on the outcomes of the workshop. Stay tuned, and Michael and I will report back as we make progress.
Resources
– Data sheet HP Distributed R (.PDF): 20150128_DS_HP_Vertica_Distibuted_R_web
– Download: https://my.vertica.com/downloads/hp-vertica-distributed-r-1-0-0/
– Infographic (.PDF): Vertica_R_infographics_01262015
Sponsored by HP Software