{"id":2058,"date":"2013-02-27T08:32:32","date_gmt":"2013-02-27T08:32:32","guid":{"rendered":"http:\/\/www.odbms.org\/blog\/?p=2058"},"modified":"2014-12-20T18:17:37","modified_gmt":"2014-12-20T18:17:37","slug":"on-big-data-analytics-interview-with-david-smith","status":"publish","type":"post","link":"https:\/\/www.odbms.org\/blog\/2013\/02\/on-big-data-analytics-interview-with-david-smith\/","title":{"rendered":"On Big Data Analytics &#8211;Interview with David Smith."},"content":{"rendered":"<p><strong><em> &#8220;The data you\u2019re likely to need for any real-world predictive model today is unlikely to be sitting in any one data management system. A data scientist will often combine transactional data from a NoSQL system, demographic data from a RDBMS, unstructured data from Hadoop, and social data from a streaming API&#8221;<\/em> &#8211;David Smith.<\/strong><\/p>\n<p>On the subject of <em>Big Data Analytics<\/em> I have interviewed <strong>David Smith<\/strong>, Vice President of Marketing and Community at <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.revolutionanalytics.com');\"  href=\"http:\/\/www.revolutionanalytics.com\">Revolution Analytics. <\/a><\/p>\n<p>RVZ<\/p>\n<p><strong>Q1. How would you define the job of a data scientist?<\/strong><\/p>\n<p><strong>David Smith:<\/strong> A data scientist is someone charged of analyzing and communicating insight from data.<br \/>\nIt\u2019s someone with a combination of skills: computer science, to be able to access and manipulate the data; statistical modeling, to be able to make predictions from the data; and domain expertise, to be able to understand and answer the question being asked.<\/p>\n<p><strong>Q2. What are the main technical challenges for Big Data predictive analytics? <\/strong><\/p>\n<p><strong>David Smith:<\/strong> For a skilled data scientist, the main challenge is time. Big Data takes a long time just to move (so don\u2019t do that, if you don\u2019t have to!), not to mention the time required to apply complex statistical algorithms. That\u2019s why it\u2019s important to have software that can make use of modern data architectures to fit predictive models to Big Data in the shortest time possible. The more iterations a data scientist can make to improve the model, the more robust and accurate it will be.<\/p>\n<p><strong>Q3. <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/R_(programming_language)');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/R_(programming_language)\">R<\/a> is an open source programming language for statistical analysis. Is R useful for Big Data as well? Can you analyze petabytes of data with R, and at the same time ensure scalability and performance?<\/strong><\/p>\n<p><strong>David Smith:<\/strong> Petabytes? That\u2019s a heck of a lot of data: even Facebook has \u201conly\u201d 70 Pb of data, total. The important thing to remember is that \u201cBig Data\u201d means different things in different contexts: while raw data in Hadoop may be measured in the petabytes, by the time a data scientist selects, filters and processes it you\u2019re more likely to be in the terabytes or even gigabyte range when the data\u2019s ready to be applied to predictive models.<br \/>\n<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.revolutionanalytics.com\/what-is-open-source-r\/');\"  href=\"http:\/\/www.revolutionanalytics.com\/what-is-open-source-r\/\">Open Source R <\/a>, with its in-memory, single-threaded engine, will still struggle even at this scale, though. That\u2019s why Revolution Analytics added scalable, parallelized algorithms to R, making predictive modeling on terabytes of data possible. With <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.revolutionanalytics.com\/products\/enterprise-big-data.php');\"  href=\"http:\/\/www.revolutionanalytics.com\/products\/enterprise-big-data.php\">Revolution R Enterprise<\/a> , you can use <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Symmetric_multiprocessing');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Symmetric_multiprocessing\">SMP servers <\/a>or <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Grid_computing');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Grid_computing\">MPP grids<\/a> to fit powerful predictive models to hundreds of millions of rows of data in just minutes.<\/p>\n<p><strong>Q4. Could you give us some information on how Google, and Bank of America use R for their statistical analysis?<\/strong><\/p>\n<p><strong>David Smith:<\/strong> <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/blog.revolutionanalytics.com\/2012\/07\/applications-of-r-at-google.html');\"  href=\"http:\/\/blog.revolutionanalytics.com\/2012\/07\/applications-of-r-at-google.html\">Google has more than 500 R users <\/a> , where R is used to study the effectiveness of ads, for forecasting, and for statistical modeling with Big Data.<br \/>\nIn the financial sector, <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/blog.revolutionanalytics.com\/2010\/09\/rs-time-is-now.html');\"  href=\"http:\/\/blog.revolutionanalytics.com\/2010\/09\/rs-time-is-now.html\">R is used by banks like Bank of America <\/a> and <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.revolutionanalytics.com\/why-revolution-r\/case-studies\/northern-trust-optimizes-operational-risk-with-revolution-r-monte-carlo-simulation.php');\"  href=\"http:\/\/www.revolutionanalytics.com\/why-revolution-r\/case-studies\/northern-trust-optimizes-operational-risk-with-revolution-r-monte-carlo-simulation.php \">Northern Trust <\/a> and insurance companies like <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/blog.revolutionanalytics.com\/2012\/10\/allstate-big-data-glm.html');\"  href=\"http:\/\/blog.revolutionanalytics.com\/2012\/10\/allstate-big-data-glm.html \">Allstate <\/a> for a variety of applications, including data visualization, simulation, portfolio optimization, and time series forecasting.<\/p>\n<p><strong>Q5. How do you handle the Big Data Analytics &#8220;process&#8221; challenges with deriving insight?<br \/>\n&#8211; capturing data<br \/>\n&#8211; aligning data from different sources (e.g., resolving when two objects are the same)<br \/>\n&#8211; transforming the data into a form suitable for analysis<br \/>\n&#8211; modeling it, whether mathematically, or through some form of simulation<br \/>\n&#8211; understanding the output<br \/>\n&#8211; visualizing and sharing the results<\/strong><\/p>\n<p><strong>David Smith:<\/strong> These steps reflect the fact that data science is an iterative process: long gone are the days where we would simply pump data through a black-box algorithm and hope for the best. Data transformation, evaluation of multiple model options, and visualizing the results are essential to creating a powerful and reliable statistical model. That\u2019s why the R language has proven so popular: its interactive language encourages exploration, refinement and presentation, and Revolution R Enterprise provides the speed and\u00a0big-data support to allow the data scientist to iterate through this process quickly.<\/p>\n<p><strong>Q6. What is the tradeoff between Accuracy and Speed that you usually need to make with Big Data?<\/strong><\/p>\n<p><strong>David Smith:<\/strong> Real-time predictive analytics with Big Data are certainly possible. (See <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/blog.revolutionanalytics.com\/2012\/11\/real-time-predictive-analytics-with-big-data-and-r.html');\"  href=\"http:\/\/blog.revolutionanalytics.com\/2012\/11\/real-time-predictive-analytics-with-big-data-and-r.html\">here <\/a> for a detailed explanation.) Accuracy comes with real-time scoring of the model, which is dependent on a data scientist building the predictive model on Big Data. To maintain accuracy, that model will need to be refreshed on a regular basis using the latest data available.<\/p>\n<p><strong>Q7. In your opinion, is there a technology which is best suited to build Analytics Platform? RDBMS, or instead non relational database technology, such as for example columnar database engine? Else?<\/strong><\/p>\n<p><strong>David Smith:<\/strong> The data you\u2019re likely to need for any real-world predictive model today is unlikely to be sitting in any one data management system. A data scientist will often combine transactional data from a NoSQL system, demographic data from a RDBMS, unstructured data from Hadoop, and social data from a streaming <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Application_programming_interface');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Application_programming_interface\">API<\/a>.<br \/>\nThat\u2019s one of the reasons the R language is so powerful: it provides interfaces to a variety of data storage and processing systems, instead of being dependent on any one technology.<\/p>\n<p><strong>Q8. Cloud computing: What role does it play with Analytics? What are the main differences between Ground vs Cloud analytics?<\/strong><\/p>\n<p><strong>David Smith:<\/strong> <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Cloud_computing');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Cloud_computing\">Cloud computing<\/a> can be a cost-effective platform for the Big-Data computations inherent in predictive modeling: if you occasionally need a 40-node grid to fit a big predictive model, it\u2019s convenient to be able to spin one up at will. The big catch is in the data: if your data is already in the cloud you\u2019re golden, but if it lives in a ground-based data center it\u2019s going to be expensive (in time *and* money) to move it to the cloud.<br \/>\n&#8212;&#8212;&#8212;<\/p>\n<p><strong>David Smith, <\/strong><em>Vice President, Marketing &amp; Community, <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.revolutionanalytics.com');\"  href=\"http:\/\/www.revolutionanalytics.com\">Revolution Analytics<\/a><br \/>\nDavid Smith has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment.<br \/>\nHe continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products. David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project.<br \/>\nToday, David leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog.<br \/>\nPrior to joining Revolution Analytics, David served as vice president of product management at Zynchros, Inc.<\/em><br \/>\n&#8212;<\/p>\n<p><strong>Related Posts<\/strong><\/p>\n<p><strong>&#8211;<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/blog\/2013\/02\/big-data-analytics-at-netflix-interview-with-christos-kalantzis-and-jason-brown\/');\"  href=\"http:\/\/www.odbms.org\/blog\/2013\/02\/big-data-analytics-at-netflix-interview-with-christos-kalantzis-and-jason-brown\/\">Big Data Analytics at Netflix. Interview with Christos Kalantzis and Jason Brown. February 18, 2013<\/a><\/strong><\/p>\n<p><strong>&#8211; <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/blog\/2013\/02\/lufthansa-and-data-analytics-interview-with-james-dixon\/');\"  href=\"http:\/\/www.odbms.org\/blog\/2013\/02\/lufthansa-and-data-analytics-interview-with-james-dixon\/\">Lufthansa and Data Analytics. Interview with James Dixon. February 4, 2013<\/a><\/strong><\/p>\n<p>&#8211;<strong><a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/blog\/2013\/01\/on-big-data-velocity-interview-with-scott-jarr\/');\"  href=\"http:\/\/www.odbms.org\/blog\/2013\/01\/on-big-data-velocity-interview-with-scott-jarr\/\">On Big Data Velocity. Interview with Scott Jarr. on January 28, 2013<\/a><\/strong><\/p>\n<p><strong>Resources<\/strong><br \/>\n<strong>&#8211; Big Data and Analytical Data Platforms<\/strong><br \/>\n<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/free-downloads-and-links\/big-data-and-analytical-data-platforms\/');\"  href=\"http:\/\/www.odbms.org\/free-downloads-and-links\/big-data-and-analytical-data-platforms\/\" target=\"_blank\">Blog Posts |\u00a0Free Software |\u00a0Articles |\u00a0PhD and Master Thesis |<\/a><\/p>\n<p><strong>&#8211; Cloud Data Stores<\/strong><br \/>\n<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/downloads.aspx#cloud_bp');\"  href=\"http:\/\/www.odbms.org\/downloads.aspx#cloud_bp\">Blog Posts <\/a>|<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/downloads.aspx#cloud_ln');\"  href=\"http:\/\/www.odbms.org\/downloads.aspx#cloud_ln\"> Lecture Notes<\/a>|<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/downloads.aspx#cloud_ar');\"  href=\"http:\/\/www.odbms.org\/downloads.aspx#cloud_ar\"> Articles and Presentations<\/a>|\u00a0<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/downloads.aspx#cloud_pm');\"  href=\"http:\/\/www.odbms.org\/downloads.aspx#cloud_pm\">PhD and Master Thesis<\/a>|<\/p>\n<p><strong>&#8211; NoSQL Data Stores<\/strong><br \/>\n<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/downloads.aspx#nosql_bp');\"  href=\"http:\/\/www.odbms.org\/downloads.aspx#nosql_bp\">Blog Posts <\/a>|<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/downloads.aspx#nosql_fr');\"  href=\"http:\/\/www.odbms.org\/downloads.aspx#nosql_fr\"> Free Software<\/a> |\u00a0<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/downloads.aspx#nosql_ap');\"  href=\"http:\/\/www.odbms.org\/downloads.aspx#nosql_ap\">Articles, Papers, Presentations<\/a>|<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/downloads.aspx#nosql_tu');\"  href=\"http:\/\/www.odbms.org\/downloads.aspx#nosql_tu\">Documentations, Tutorials, Lecture Notes<\/a> |<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/downloads.aspx#nosql_ph');\"  href=\"http:\/\/www.odbms.org\/downloads.aspx#nosql_ph\"> PhD and Master Thesis<\/a><\/p>\n<p><strong>Follow ODBMS.org on Twitter: <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/twitter.com\/odbmsorg');\"  href=\"https:\/\/twitter.com\/odbmsorg\">@odbmsorg<\/a><\/strong><\/p>\n<p>##<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>&#8220;The data you\u2019re likely to need for any real-world predictive model today is unlikely to be sitting in any one data management system. A data scientist will often combine transactional data from a NoSQL system, demographic data from a RDBMS, unstructured data from Hadoop, and social data from a streaming API&#8221; &#8211;David Smith. On the [&hellip;]<!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[21,35,55,66,97,193,224,239,411,412,413,446,485,490,499,501],"_links":{"self":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts\/2058"}],"collection":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/comments?post=2058"}],"version-history":[{"count":1,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts\/2058\/revisions"}],"predecessor-version":[{"id":3704,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts\/2058\/revisions\/3704"}],"wp:attachment":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/media?parent=2058"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/categories?post=2058"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/tags?post=2058"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}