{"id":3140,"date":"2014-04-14T07:00:30","date_gmt":"2014-04-14T07:00:30","guid":{"rendered":"http:\/\/www.odbms.org\/blog\/?p=3140"},"modified":"2014-04-14T07:31:48","modified_gmt":"2014-04-14T07:31:48","slug":"interview-mike-stonebraker-paul-brown","status":"publish","type":"post","link":"https:\/\/www.odbms.org\/blog\/2014\/04\/interview-mike-stonebraker-paul-brown\/","title":{"rendered":"On the SciDB array database. Interview with Mike Stonebraker and Paul Brown."},"content":{"rendered":"<blockquote><p><strong>&#8220;SciDB is both a data store and a massively parallel compute engine for numerical processing. The inclusion of this computational platform is what makes us the first \u201ccomputational database\u201d, not just a SQL-style decision support DBMS. Hence, we need a new moniker to describe this class of interactions.\u00a0We settled on computational databases, but if your readers have a better suggestion, we are all ears!&#8221;<br \/>\n&#8211;Mike Stonebraker, Paul Brown.<\/strong><\/p><\/blockquote>\n<p>On the SciDB\u00a0array database, I have interviewed\u00a0<b>Mike Stonebraker, <\/b>MIT Professor and <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.paradigm4.com');\"  href=\"http:\/\/www.paradigm4.com\" target=\"_blank\">Paradigm4<\/a> co-founder and CTO, and<b> Paul Brown<\/b>, Paradigm4 Chief Architect.<\/p>\n<p>RVZ<\/p>\n<p><b>Q1: What is SciDB and why did you create it?<\/b><\/p>\n<p><b>Mike Stonebraker,\u00a0<b>Paul Brown:\u00a0<\/b><\/b><a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/scidb.org');\"  href=\"http:\/\/scidb.org\" target=\"_blank\">SciDB<\/a> is an open source array database with scalable, built-in complex analytics, programmable from <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/R_(programming_language)');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/R_(programming_language)\" target=\"_blank\">R<\/a> and <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Python_(programming_language)');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Python_(programming_language)\" target=\"_blank\">Python<\/a>. The requirements for SciDB emerged from discussions between academic database researchers\u2014Mike Stonebraker and Dave DeWitt\u2014 and scientists at the first Extremely Large Databases conference (XLDB) at <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www-conf.slac.stanford.edu\/ssi\/2007\/lateReg\/default.htm');\"  href=\"http:\/\/www-conf.slac.stanford.edu\/ssi\/2007\/lateReg\/default.htm\" target=\"_blank\">SLAC in 2007<\/a> about coping with the peta-scale data from the forthcoming LSST telescope.<\/p>\n<p>Recognizing that commercial and industrial users were about to face the same challenges as scientists, Mike Stonebraker founded Paradigm4 in 2010 to make the ideas explored in early prototypes available as a commercial-quality software product.\u00a0Paradigm4 develops and supports both a free, open-source Community Edition (<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.scidb.org\/forum');\"  href=\"http:\/\/www.scidb.org\/forum\" target=\"_blank\">scidb.org\/forum<\/a>) and an Enterprise Edition with additional features (<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.paradigm4.com');\"  href=\"http:\/\/www.paradigm4.com\" target=\"_blank\">paradigm4.com<\/a>).<\/p>\n<p><b>Q2. With the rise of Big Data analytics, is <\/b><b>the convergence of analytic needs between science and industry really happening?<\/b><\/p>\n<p><b>Mike Stonebraker,\u00a0<b>Paul Brown: \u00a0<\/b><\/b>There is a \u201csea change\u201d occurring as companies move from Business Intelligence (think SQL analytics) to Complex Analytics (think predictive modelling, clustering, correlation, principal components analysis, graph analysis, etc.). Obviously science folks have been doing complex analytics on big data all along.<\/p>\n<p>Another force driving this sea change is all the machine-generated data produced by cell phones, genomic sequencers, and by devices on the Industrial Internet and the Internet of Things.\u00a0 Here too science folks have been working with big data from sensors, instruments, telescopes and satellites all along.\u00a0 So it is quite natural that a scalable computational database like SciDB that serves the science world is a good fit for the emerging needs of commercial and industrial users.<\/p>\n<p>There will be a convergence of the two markets as many more companies aspire to develop innovative products and services using complex analytics on big and diverse data. In the forefront are companies doing electronic trading on Wall Street; insurance companies developing new pricing models using telematics data; pharma and biotech companies analyzing genomics and clinical data; and manufacturing companies building predictive models to anticipate repairs on expensive machinery.\u00a0 We expect everybody will move to this new paradigm over time. \u00a0After all, a predictive model integrating diverse data is much more useful than a chart of numbers about past behavior.<\/p>\n<p><b>Q3. What are the typical challenges posed by scientific analytics?<\/b><\/p>\n<p><b>Mike Stonebraker,\u00a0<b>Paul Brown: \u00a0<\/b><\/b>We asked a lot of working scientists the same question, and published a paper in the IEEE Computing Science &amp; Engineering summarizing their answers (*see citation below). In a nutshell, there are 4 primary issues.<\/p>\n<p>1.\u00a0<b>Scale<\/b>. Science has always been intensely \u201cdata driven\u201d.\u00a0 With the ever-increasing massive data-generating capabilities of scientific instruments, sensors, and computer simulations, the average scientist is overwhelmed with data and needs data management and analysis tools that can scale to meet his or her needs, now and in the future.<\/p>\n<p>2.\u00a0<b>New Analytic Methods<\/b>. Historically analysis tools have focused on business users, and have provided easy-to-use interfaces for submitting SQL aggregates to data warehouses.\u00a0 Such business intelligence (BI) tools are not useful to scientists, who universally want much more complex analyses, whether it be outlier detection, curve fitting, analysis of variance, predictive models or network analysis.\u00a0 Such \u201ccomplex analytics\u201d is defined on arrays in linear algebra, and requires a new generation of client-side tools and server side tools in DBMSs.<\/p>\n<p>3. \u00a0\u00a0<b>Provenance<\/b>. One of the central requirements that scientists have is reproducibility. They need to be able to send their data to colleagues to rerun their experiments and produce the same answers.\u00a0 As such, it is crucial to keep prior versions of data in the face of updates, error correction, and the like.\u00a0 The right way to provide such provenance is through a no-overwrite DBMS; which allows time-travel back in time to when the experiment in question was performed.<\/p>\n<p>4.\u00a0\u00a0<b>Interactivity<\/b>. Unlike business users who are often comfortable with batch reporting of information, scientific users are invariably exploring their data, asking \u201cwhat if\u201d questions and testing hypotheses.\u00a0 What they need in interactivity on very large data sets.<\/p>\n<p><b>Q3. What are in your opinion the commonalities between scientific and industrial analytics?<\/b><\/p>\n<p><b>Mike Stonebraker,\u00a0<b>Paul Brown: \u00a0<\/b><\/b>We would state the question in reverse \u201cWhat are the differences between the two markets?\u201d In our opinion, the two markets will converge quickly as commercial and industrial companies move to the analytic paradigms pervasive in the science marketplace.<\/p>\n<p><b>Q4. How come in the past the database system software community has failed to build the kinds of systems that scientists needed for managing massive data sets?<\/b><\/p>\n<p><b>Mike Stonebraker,\u00a0<b>Paul Brown:\u00a0<\/b><\/b>Mostly it\u2019s because scientific problems represent a $0 billion market! However, the convergence of industrial requirements and science requirements means that science can \u201cpiggy back\u201d on the commercial market and get their needs met.<\/p>\n<p><b>Q5. SciDB is a scalable array database with native complex analytics. Why did you choose a data model based on multidimensional arrays?<\/b><\/p>\n<p><b>Mike Stonebraker,\u00a0<b>Paul Brown:\u00a0<\/b><\/b>Our main motivation is that at scale, the complex analyses done by \u201cpost sea change\u201d users are invariably about applying parallelized linear algebraic algorithms to arrays. Whether you are doing regression, singular value decomposition, finding eigenvectors, or doing operations on graphs, you are performing a sequence of matrix operations.\u00a0 Obviously, this is intuitive and natural in an array data model, whereas you have to recast tables into arrays if you begin with an RDBMS or keep data in files.\u00a0 Also, a native array implementation can be made much faster than a table-based system by directly implementing multi-dimensional clustering and doing selective replication of neighboring data items.<\/p>\n<p>Our secondary motivation is that, just like mathematical matrices, geospatial data, time-series data, image data, and graph data are most naturally organized as arrays.\u00a0 By preserving the inherent ordering in the data, SciDB supports extremely fast selection (including vectors, planes, \u2018hypercubes\u2019), doing multi-dimensional windowed aggregates, and re-gridding it to change spatial or temporal resolution.<\/p>\n<p><b>Q6. How do you manage in a nutshell scalability with high degrees of tolerance to failures?<\/b><\/p>\n<p><b>Mike Stonebraker,\u00a0<b>Paul Brown:\u00a0<\/b><\/b>In a nutshell? Partitioning, and redundancy (k-replication).<\/p>\n<p>First, SciDB splits each array\u2019s attributes apart, just like any columnar system. Then we partition each array into rectilinear blocks we call \u201cchunks\u201d. Then we employ a variety of mapping functions that map an array\u2019s chunks to SciDB instances. For each copy of an array we use a different mapping function to create copies of each chunk on different node of the cluster. If a node goes down, we figure out where there is a redundant copy of the data and move the computation there.<\/p>\n<p><b>Q7. How do you handle data compression in SciDB?<\/b><\/p>\n<p><b>Mike Stonebraker,\u00a0<b>Paul Brown: \u00a0<\/b><\/b>Use of compression in modern data stores is a very important topic.\u00a0 Minimizing storage while retaining information and supporting extremely rapid data access informs every level of SciDB\u2019s design. For example, SciDB splits every array into single-attribute components. We compress a chunk\u2019s worth of cell values for a specific attribute.\u00a0 At the lowest level, we compress attribute data using techniques like run-length encoding on data.\u00a0 In addition, our implementation has an abstraction for compression to support other compression algorithms.<\/p>\n<p><b>Q8. Why supporting two query languages?<\/b><\/p>\n<p><b>Mike Stonebraker,\u00a0<b>Paul Brown: \u00a0<\/b><\/b>Actually the primary interfaces we are promoting are R and Python as they are the languages of choice of data scientists, quants, bioinformaticians, and scientists.\u00a0\u00a0 <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.paradigm4.com\/scidb-r\/');\"  href=\"http:\/\/www.paradigm4.com\/scidb-r\/\" target=\"_blank\">SciDB-R<\/a> and <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.paradigm4.com\/scidb-py\/');\"  href=\"http:\/\/www.paradigm4.com\/scidb-py\/\" target=\"_blank\">SciDB-Py<\/a> allow users to interactively query SciDB using R and Python. Data is persisted in SciDB. Math operators are overloaded so that complex analytical computations execute scalably in the database.<\/p>\n<p>Early on we surveyed potential and existing SciDB users, and found there were two very different types. By and large, commercial users using RDMBSs said \u201cmake it look like SQL\u201d. For those users we created <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.paradigm4.com\/technology\/aql-afl-query-languages\/');\"  href=\"http:\/\/www.paradigm4.com\/technology\/aql-afl-query-languages\/\" target=\"_blank\">AQL\u2014array SQL<\/a>. On the other hand, data scientists and programmers preferred R, Python, and functional languages. For the second class of users we created SciDB-R, SciDB-Py, and <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.paradigm4.com\/technology\/aql-afl-query-languages\/');\"  href=\"http:\/\/www.paradigm4.com\/technology\/aql-afl-query-languages\/\" target=\"_blank\">AFL\u2014an array functional language<\/a>.<\/p>\n<p>All queries get compiled into a query plan, which is a sequence of algebraic operations.\u00a0 Essentially all relational versions of SQL do exactly the same thing. In SciDB, AFL, the array functional language, is the underlying language of algebraic operators. Hence, it is easy to surface and support AFL in addition to AQL, SciDB-R, and SciDB-Py, allowing us to satisfy the preferred mode of working for many classes of users.<\/p>\n<p><b>Q9. You defined SciDB a computational database &#8211; not a data warehouse, not a business-intelligence database, and not a transactional database. Could you please elaborate more on this point?<\/b><\/p>\n<p><b>Mike Stonebraker,\u00a0<b>Paul Brown:\u00a0<\/b><\/b>In our opinion, there are two mature markets for DBMSs: transactional DBMSs that are optimized for large numbers of users performing short write-oriented ACID transactions, and data warehouses, which strive for high performance on SQL aggregates and other read-oriented longer queries.\u00a0 The users of SciDB fit into neither category.\u00a0 They are universally doing more complex mathematical calculations than SQL aggregates on their data, and their DBMS interactions are typically longer read-oriented queries. SciDB is both a data store and a massively parallel compute engine for numerical processing. The inclusion of this computational platform is what makes us the first \u201ccomputational database\u201d, not just a SQL-style decision support DBMS. Hence, we need a new moniker to describe this class of interactions. We settled on computational databases, but if your readers have a better suggestion, we are all ears!<\/p>\n<p><b>Q10. How does SciDB differ from analytical databases, such as for example HP Vertica,\u00a0and in-memory analytics databases such as SAP HANA?<\/b><\/p>\n<p><b>Mike Stonebraker,\u00a0<b>Paul Brown:\u00a0<\/b><\/b>Both are data warehouse products, optimized for warehouse workloads.\u00a0 SciDB serves a different class of users from these other systems. Our customers\u2019 data are naturally represented as arrays that don&#8217;t fit neatly or efficiently into relational tables. \u00a0Our users want more sophisticated analytics\u2014more numerical, statistical, and graph analysis\u2014and not so much SQL OLAP.<\/p>\n<p><strong>Q11. What about Teradata?<\/strong><\/p>\n<p><b>Mike Stonebraker,\u00a0<b>Paul Brown:\u00a0<\/b><\/b>Another data warehouse vendor.\u00a0 Plus, SciDB runs on commodity hardware clusters or in a cloud and not on a proprietary appliances or expensive servers.<\/p>\n<p><b>Q12. Anything else you wish to add?<\/b><b><\/b><\/p>\n<p><b>Mike Stonebraker,\u00a0<b>Paul Brown: \u00a0<\/b><\/b>SciDB is currently being used by commercial users for computational finance, bioinformatics and clinical informatics, satellite image analysis, and industrial analytics.\u00a0 The publicly accessible <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.ncbi.nlm.nih.gov\/variation\/tools\/1000genomes\/help\/');\"  href=\"http:\/\/www.ncbi.nlm.nih.gov\/variation\/tools\/1000genomes\/help\/\" target=\"_blank\">NIH NCBI One Thousand Genomes browser<\/a> has been running on SciDB since the Fall of 2012.<\/p>\n<p>Anyone can try out SciDB using an AMI or a VM available at <a href=\"scidb.org\/forum\" target=\"_blank\">scidb.org\/forum<\/a>.<\/p>\n<p>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<\/p>\n<p><strong>Mike Stonebraker<\/strong> <em>, CTO Paradigm4<br \/>\nRenowned database researcher, innovator, and entrepreneur: Berkeley, MIT, Postgres, Ingres, Illustra, Cohera, Streambase, Vertica, VoltDB, and now Paradigm4.<br \/>\n<\/em><br \/>\n<strong>Paul Brown<\/strong> <em>, Chief Architect Paradigm4<br \/>\nPremier database \u2018plumber\u2019 and researcher moving from the \u201cI\u2019s\u201d (Ingres, Illustra, Informix, IBM) to a \u201cP\u201d (Paradigm4).<\/em><br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br \/>\n<strong>Resources<\/strong><\/p>\n<p><b>*Citation for IEEE paper<\/b><br \/>\nStonebraker, M.; Brown, P.; Donghui Zhang; Becla, J., <strong>&#8220;SciDB: A Database Management System for Applications with Complex Analytics,&#8221;<\/strong>\u00a0<i>Computing in Science &amp; Engineering<\/i>\u00a0, vol.15, no.3, pp.54,62, May-June 2013<br \/>\ndoi: 10.1109\/MCSE.2013.19,\u00a0URL:\u00a0<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/ieeexplore.ieee.org\/stamp\/stamp.jsp?tp=&amp;arnumber=6461866&amp;isnumber=6549993');\"  href=\"http:\/\/ieeexplore.ieee.org\/stamp\/stamp.jsp?tp=&amp;arnumber=6461866&amp;isnumber=6549993\" target=\"_blank\">http:\/\/ieeexplore.ieee.org\/stamp\/stamp.jsp?tp=&amp;arnumber=6461866&amp;isnumber=6549993<\/a><\/p>\n<p>&#8211; <strong>ODBMS.org:<\/strong> <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/2014\/04\/paradigm4\/');\"  href=\"http:\/\/www.odbms.org\/2014\/04\/paradigm4\/\" target=\"_blank\">free resources related to Paradigm4<\/a><\/p>\n<p><strong>Related Posts<\/strong><\/p>\n<p>&#8211;\u00a0<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/blog\/2013\/01\/the-gaia-mission-one-year-later-interview-with-william-omullane\/');\"  href=\"http:\/\/www.odbms.org\/blog\/2013\/01\/the-gaia-mission-one-year-later-interview-with-william-omullane\/\" target=\"_blank\"><strong>The Gaia mission, one year later. Interview with William O\u2019Mullane. ODBMS Industry Watch,\u00a0January 16, 2013<\/strong><\/a><\/p>\n<p>&#8211; <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/blog\/2011\/04\/objects-in-space-vs-friends-in-facebook\/');\"  href=\"http:\/\/www.odbms.org\/blog\/2011\/04\/objects-in-space-vs-friends-in-facebook\/\" target=\"_blank\"><strong>Objects in Space vs. Friends in Facebook. ODBMS Industry Watch,<\/strong>\u00a0April 13, 2011.<\/a><\/p>\n<p><strong>Follow ODBMS.org on Twitter: <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/twitter.com\/odbmsorg');\"  href=\"https:\/\/twitter.com\/odbmsorg\" target=\"_blank\">@odbmsorg<\/a><\/strong><\/p>\n<p>##<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>&#8220;SciDB is both a data store and a massively parallel compute engine for numerical processing. The inclusion of this computational platform is what makes us the first \u201ccomputational database\u201d, not just a SQL-style decision support DBMS. Hence, we need a new moniker to describe this class of interactions.\u00a0We settled on computational databases, but if your [&hellip;]<!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[35,66,84,656,655,133,654,658,388,659,653,482,485,516,523,549,569,657,616],"_links":{"self":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts\/3140"}],"collection":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/comments?post=3140"}],"version-history":[{"count":21,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts\/3140\/revisions"}],"predecessor-version":[{"id":3207,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts\/3140\/revisions\/3207"}],"wp:attachment":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/media?parent=3140"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/categories?post=3140"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/tags?post=3140"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}