On Big Data Analytics –Interview with David Smith.
“The data you’re likely to need for any real-world predictive model today is unlikely to be sitting in any one data management system. A data scientist will often combine transactional data from a NoSQL system, demographic data from a RDBMS, unstructured data from Hadoop, and social data from a streaming API” –David Smith.
On the subject of Big Data Analytics I have interviewed David Smith, Vice President of Marketing and Community at Revolution Analytics.
Q1. How would you define the job of a data scientist?
David Smith: A data scientist is someone charged of analyzing and communicating insight from data.
It’s someone with a combination of skills: computer science, to be able to access and manipulate the data; statistical modeling, to be able to make predictions from the data; and domain expertise, to be able to understand and answer the question being asked.
Q2. What are the main technical challenges for Big Data predictive analytics?
David Smith: For a skilled data scientist, the main challenge is time. Big Data takes a long time just to move (so don’t do that, if you don’t have to!), not to mention the time required to apply complex statistical algorithms. That’s why it’s important to have software that can make use of modern data architectures to fit predictive models to Big Data in the shortest time possible. The more iterations a data scientist can make to improve the model, the more robust and accurate it will be.
Q3. R is an open source programming language for statistical analysis. Is R useful for Big Data as well? Can you analyze petabytes of data with R, and at the same time ensure scalability and performance?
David Smith: Petabytes? That’s a heck of a lot of data: even Facebook has “only” 70 Pb of data, total. The important thing to remember is that “Big Data” means different things in different contexts: while raw data in Hadoop may be measured in the petabytes, by the time a data scientist selects, filters and processes it you’re more likely to be in the terabytes or even gigabyte range when the data’s ready to be applied to predictive models.
Open Source R , with its in-memory, single-threaded engine, will still struggle even at this scale, though. That’s why Revolution Analytics added scalable, parallelized algorithms to R, making predictive modeling on terabytes of data possible. With Revolution R Enterprise , you can use SMP servers or MPP grids to fit powerful predictive models to hundreds of millions of rows of data in just minutes.
Q4. Could you give us some information on how Google, and Bank of America use R for their statistical analysis?
David Smith: Google has more than 500 R users , where R is used to study the effectiveness of ads, for forecasting, and for statistical modeling with Big Data.
In the financial sector, R is used by banks like Bank of America and Northern Trust and insurance companies like Allstate for a variety of applications, including data visualization, simulation, portfolio optimization, and time series forecasting.
Q5. How do you handle the Big Data Analytics “process” challenges with deriving insight?
- capturing data
- aligning data from different sources (e.g., resolving when two objects are the same)
- transforming the data into a form suitable for analysis
- modeling it, whether mathematically, or through some form of simulation
- understanding the output
- visualizing and sharing the results
David Smith: These steps reflect the fact that data science is an iterative process: long gone are the days where we would simply pump data through a black-box algorithm and hope for the best. Data transformation, evaluation of multiple model options, and visualizing the results are essential to creating a powerful and reliable statistical model. That’s why the R language has proven so popular: its interactive language encourages exploration, refinement and presentation, and Revolution R Enterprise provides the speed and big-data support to allow the data scientist to iterate through this process quickly.
Q6. What is the tradeoff between Accuracy and Speed that you usually need to make with Big Data?
David Smith: Real-time predictive analytics with Big Data are certainly possible. (See here for a detailed explanation.) Accuracy comes with real-time scoring of the model, which is dependent on a data scientist building the predictive model on Big Data. To maintain accuracy, that model will need to be refreshed on a regular basis using the latest data available.
Q7. In your opinion, is there a technology which is best suited to build Analytics Platform? RDBMS, or instead non relational database technology, such as for example columnar database engine? Else?
David Smith: The data you’re likely to need for any real-world predictive model today is unlikely to be sitting in any one data management system. A data scientist will often combine transactional data from a NoSQL system, demographic data from a RDBMS, unstructured data from Hadoop, and social data from a streaming API.
That’s one of the reasons the R language is so powerful: it provides interfaces to a variety of data storage and processing systems, instead of being dependent on any one technology.
Q8. Cloud computing: What role does it play with Analytics? What are the main differences between Ground vs Cloud analytics?
David Smith: Cloud computing can be a cost-effective platform for the Big-Data computations inherent in predictive modeling: if you occasionally need a 40-node grid to fit a big predictive model, it’s convenient to be able to spin one up at will. The big catch is in the data: if your data is already in the cloud you’re golden, but if it lives in a ground-based data center it’s going to be expensive (in time *and* money) to move it to the cloud.
David Smith, Vice President, Marketing & Community, Revolution Analytics
David Smith has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment.
He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products. David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project.
Today, David leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog.
Prior to joining Revolution Analytics, David served as vice president of product management at Zynchros, Inc.
Follow ODBMS.org on Twitter: @odbmsorg