On the SciDB array database. Interview with Mike Stonebraker and Paul Brown.
“SciDB is both a data store and a massively parallel compute engine for numerical processing. The inclusion of this computational platform is what makes us the first “computational database”, not just a SQL-style decision support DBMS. Hence, we need a new moniker to describe this class of interactions. We settled on computational databases, but if your readers have a better suggestion, we are all ears!”
–Mike Stonebraker, Paul Brown.
On the SciDB array database, I have interviewed Mike Stonebraker, MIT Professor and Paradigm4 co-founder and CTO, and Paul Brown, Paradigm4 Chief Architect.
Q1: What is SciDB and why did you create it?
Mike Stonebraker, Paul Brown: SciDB is an open source array database with scalable, built-in complex analytics, programmable from R and Python. The requirements for SciDB emerged from discussions between academic database researchers—Mike Stonebraker and Dave DeWitt— and scientists at the first Extremely Large Databases conference (XLDB) at SLAC in 2007 about coping with the peta-scale data from the forthcoming LSST telescope.
Recognizing that commercial and industrial users were about to face the same challenges as scientists, Mike Stonebraker founded Paradigm4 in 2010 to make the ideas explored in early prototypes available as a commercial-quality software product. Paradigm4 develops and supports both a free, open-source Community Edition (scidb.org/forum) and an Enterprise Edition with additional features (paradigm4.com).
Q2. With the rise of Big Data analytics, is the convergence of analytic needs between science and industry really happening?
Mike Stonebraker, Paul Brown: There is a “sea change” occurring as companies move from Business Intelligence (think SQL analytics) to Complex Analytics (think predictive modelling, clustering, correlation, principal components analysis, graph analysis, etc.). Obviously science folks have been doing complex analytics on big data all along.
Another force driving this sea change is all the machine-generated data produced by cell phones, genomic sequencers, and by devices on the Industrial Internet and the Internet of Things. Here too science folks have been working with big data from sensors, instruments, telescopes and satellites all along. So it is quite natural that a scalable computational database like SciDB that serves the science world is a good fit for the emerging needs of commercial and industrial users.
There will be a convergence of the two markets as many more companies aspire to develop innovative products and services using complex analytics on big and diverse data. In the forefront are companies doing electronic trading on Wall Street; insurance companies developing new pricing models using telematics data; pharma and biotech companies analyzing genomics and clinical data; and manufacturing companies building predictive models to anticipate repairs on expensive machinery. We expect everybody will move to this new paradigm over time. After all, a predictive model integrating diverse data is much more useful than a chart of numbers about past behavior.
Q3. What are the typical challenges posed by scientific analytics?
Mike Stonebraker, Paul Brown: We asked a lot of working scientists the same question, and published a paper in the IEEE Computing Science & Engineering summarizing their answers (*see citation below). In a nutshell, there are 4 primary issues.
1. Scale. Science has always been intensely “data driven”. With the ever-increasing massive data-generating capabilities of scientific instruments, sensors, and computer simulations, the average scientist is overwhelmed with data and needs data management and analysis tools that can scale to meet his or her needs, now and in the future.
2. New Analytic Methods. Historically analysis tools have focused on business users, and have provided easy-to-use interfaces for submitting SQL aggregates to data warehouses. Such business intelligence (BI) tools are not useful to scientists, who universally want much more complex analyses, whether it be outlier detection, curve fitting, analysis of variance, predictive models or network analysis. Such “complex analytics” is defined on arrays in linear algebra, and requires a new generation of client-side tools and server side tools in DBMSs.
3. Provenance. One of the central requirements that scientists have is reproducibility. They need to be able to send their data to colleagues to rerun their experiments and produce the same answers. As such, it is crucial to keep prior versions of data in the face of updates, error correction, and the like. The right way to provide such provenance is through a no-overwrite DBMS; which allows time-travel back in time to when the experiment in question was performed.
4. Interactivity. Unlike business users who are often comfortable with batch reporting of information, scientific users are invariably exploring their data, asking “what if” questions and testing hypotheses. What they need in interactivity on very large data sets.
Q3. What are in your opinion the commonalities between scientific and industrial analytics?
Mike Stonebraker, Paul Brown: We would state the question in reverse “What are the differences between the two markets?” In our opinion, the two markets will converge quickly as commercial and industrial companies move to the analytic paradigms pervasive in the science marketplace.
Q4. How come in the past the database system software community has failed to build the kinds of systems that scientists needed for managing massive data sets?
Mike Stonebraker, Paul Brown: Mostly it’s because scientific problems represent a $0 billion market! However, the convergence of industrial requirements and science requirements means that science can “piggy back” on the commercial market and get their needs met.
Q5. SciDB is a scalable array database with native complex analytics. Why did you choose a data model based on multidimensional arrays?
Mike Stonebraker, Paul Brown: Our main motivation is that at scale, the complex analyses done by “post sea change” users are invariably about applying parallelized linear algebraic algorithms to arrays. Whether you are doing regression, singular value decomposition, finding eigenvectors, or doing operations on graphs, you are performing a sequence of matrix operations. Obviously, this is intuitive and natural in an array data model, whereas you have to recast tables into arrays if you begin with an RDBMS or keep data in files. Also, a native array implementation can be made much faster than a table-based system by directly implementing multi-dimensional clustering and doing selective replication of neighboring data items.
Our secondary motivation is that, just like mathematical matrices, geospatial data, time-series data, image data, and graph data are most naturally organized as arrays. By preserving the inherent ordering in the data, SciDB supports extremely fast selection (including vectors, planes, ‘hypercubes’), doing multi-dimensional windowed aggregates, and re-gridding it to change spatial or temporal resolution.
Q6. How do you manage in a nutshell scalability with high degrees of tolerance to failures?
Mike Stonebraker, Paul Brown: In a nutshell? Partitioning, and redundancy (k-replication).
First, SciDB splits each array’s attributes apart, just like any columnar system. Then we partition each array into rectilinear blocks we call “chunks”. Then we employ a variety of mapping functions that map an array’s chunks to SciDB instances. For each copy of an array we use a different mapping function to create copies of each chunk on different node of the cluster. If a node goes down, we figure out where there is a redundant copy of the data and move the computation there.
Q7. How do you handle data compression in SciDB?
Mike Stonebraker, Paul Brown: Use of compression in modern data stores is a very important topic. Minimizing storage while retaining information and supporting extremely rapid data access informs every level of SciDB’s design. For example, SciDB splits every array into single-attribute components. We compress a chunk’s worth of cell values for a specific attribute. At the lowest level, we compress attribute data using techniques like run-length encoding on data. In addition, our implementation has an abstraction for compression to support other compression algorithms.
Q8. Why supporting two query languages?
Mike Stonebraker, Paul Brown: Actually the primary interfaces we are promoting are R and Python as they are the languages of choice of data scientists, quants, bioinformaticians, and scientists. SciDB-R and SciDB-Py allow users to interactively query SciDB using R and Python. Data is persisted in SciDB. Math operators are overloaded so that complex analytical computations execute scalably in the database.
Early on we surveyed potential and existing SciDB users, and found there were two very different types. By and large, commercial users using RDMBSs said “make it look like SQL”. For those users we created AQL—array SQL. On the other hand, data scientists and programmers preferred R, Python, and functional languages. For the second class of users we created SciDB-R, SciDB-Py, and AFL—an array functional language.
All queries get compiled into a query plan, which is a sequence of algebraic operations. Essentially all relational versions of SQL do exactly the same thing. In SciDB, AFL, the array functional language, is the underlying language of algebraic operators. Hence, it is easy to surface and support AFL in addition to AQL, SciDB-R, and SciDB-Py, allowing us to satisfy the preferred mode of working for many classes of users.
Q9. You defined SciDB a computational database – not a data warehouse, not a business-intelligence database, and not a transactional database. Could you please elaborate more on this point?
Mike Stonebraker, Paul Brown: In our opinion, there are two mature markets for DBMSs: transactional DBMSs that are optimized for large numbers of users performing short write-oriented ACID transactions, and data warehouses, which strive for high performance on SQL aggregates and other read-oriented longer queries. The users of SciDB fit into neither category. They are universally doing more complex mathematical calculations than SQL aggregates on their data, and their DBMS interactions are typically longer read-oriented queries. SciDB is both a data store and a massively parallel compute engine for numerical processing. The inclusion of this computational platform is what makes us the first “computational database”, not just a SQL-style decision support DBMS. Hence, we need a new moniker to describe this class of interactions. We settled on computational databases, but if your readers have a better suggestion, we are all ears!
Q10. How does SciDB differ from analytical databases, such as for example HP Vertica, and in-memory analytics databases such as SAP HANA?
Mike Stonebraker, Paul Brown: Both are data warehouse products, optimized for warehouse workloads. SciDB serves a different class of users from these other systems. Our customers’ data are naturally represented as arrays that don’t fit neatly or efficiently into relational tables. Our users want more sophisticated analytics—more numerical, statistical, and graph analysis—and not so much SQL OLAP.
Q11. What about Teradata?
Mike Stonebraker, Paul Brown: Another data warehouse vendor. Plus, SciDB runs on commodity hardware clusters or in a cloud and not on a proprietary appliances or expensive servers.
Q12. Anything else you wish to add?
Mike Stonebraker, Paul Brown: SciDB is currently being used by commercial users for computational finance, bioinformatics and clinical informatics, satellite image analysis, and industrial analytics. The publicly accessible NIH NCBI One Thousand Genomes browser has been running on SciDB since the Fall of 2012.
Anyone can try out SciDB using an AMI or a VM available at scidb.org/forum.
Mike Stonebraker , CTO Paradigm4
Renowned database researcher, innovator, and entrepreneur: Berkeley, MIT, Postgres, Ingres, Illustra, Cohera, Streambase, Vertica, VoltDB, and now Paradigm4.
Paul Brown , Chief Architect Paradigm4
Premier database ‘plumber’ and researcher moving from the “I’s” (Ingres, Illustra, Informix, IBM) to a “P” (Paradigm4).
*Citation for IEEE paper
Stonebraker, M.; Brown, P.; Donghui Zhang; Becla, J., “SciDB: A Database Management System for Applications with Complex Analytics,” Computing in Science & Engineering , vol.15, no.3, pp.54,62, May-June 2013
doi: 10.1109/MCSE.2013.19, URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461866&isnumber=6549993
– ODBMS.org: free resources related to Paradigm4
Follow ODBMS.org on Twitter: @odbmsorg