Can Columnar Database Systems Help Mathematical Analytics?

Can Columnar Database Systems Help Mathematical Analytics?

by Carlos Ordonez, Department of Computer Science, University of Houston

It is now folklore that row DBMSs will dominate transaction processing and columnar DBMS will be prerred to evaluate decision support queries [1]. The main reason is that columnar DBMSs provide orders of magnitude improvement in SQL query processing speed, preserving the parallel speedup of row-based parallel DBMSs. However, it is unclear if columnar DBMSs can help with more complex analytical tasks, involving mathematical processing, like linear algebra, convex optimization, machine learning and multivariate statistics [3].

In the DBMS world it is fair to say row, column and arrays are currently the main technologies to process structured data (tables or arrays). However, column, row and array DBMSs have fundamentally different storage mechanisms, which lead to different query processing algorithms and optimizations. Compressed storage, a standard feature in columnar DBMSs, with one file per column, is significantly different from row blocks and B-trees used by row DBMSs. Therefore, adapting physical storage and older algorithms to decrease query processing time is difficult. In a columnar DBMS maintaining column values sorted is essential, whereas in a row DBMS indexing data structures (B-trees, hash tables or bitmaps) are the most common ordering mechanism, complemented by dynamically ordering block rows during query processing.

As shown by more recent research, in an array DBMS storage is neither row, nor column-based, but based on multidimensional subarrays (called chunks in SciDB [2], indexed by a grid data structure with chunk boundaries in main memory. From a programming perspective, SQL and similar query languags (like AFL/AQL in SciDB) are cumbersome and difficult to use for vector and matrix computations, the most common element in mathematical processing. Finally, specialized languages for mathematical processing like R and Matlab will remain the preferred programming mechanism for analysts, but the underlying systems will remain slower and difficult to scale for big data. As shown by recent trends, we may see more fast, parallel connectors (i.e. fast transfer) between the DBMS and mathematical packages like R and Matlab, but in this case the DBMS performs only simple analytic processing (mostly parallel scans) and mathematical processing is delegated to the mathematical package (i.e. not scalable and increasing data redundancy).

We conjecture that columnar DBMSs will be exploited to accelerate key mathematical computations involving large data sets. Specifically, columnar DBMSs have the potential to evaluate matrix multiplication of sparse matrices [4], the most common form of a data set in big data problems (large size, high dimensionality). But clearly, SQL queries are not enough and may be inefficient since they require transforming large tabular data sets (i.e. an SQL table) to a vertical triple storage format (row, column, value). This vertical storage format requires expressing matrix multiplication as a query combining an inner join and a GROUP BY aggregation. Needless to say, the join operation may become a bottleneck, no matter how fast the columnar DBMS is. A promising research direction is to exploit User-Defined Functions (UDFs) [5] with optimized processing for matrices on a columnar architecture. We should point UDFs provide significant programming flexibility, like a traditional language scuh as C++, but computations are restricted to matrices that can fit in RAM. Fortunately, data set summarization matrices, like the summarization matrices for linear models presented in [5], fit this I/O and CPU processing pattern.

To conclude, we should mention that columnar storage is also a trend in the Hadoop ecosystem. Therefore, we will likely see the same trend on pushing mathematical processing into big data systems.

References:

[1] Valerie Barr, Michael Stonebraker: A valuable lesson, and whither Hadoop? Commun. ACM (CACM) 58(1):18-19 (2015)

[2] Michael Stonebraker, Paul Brown, Donghui Zhang, Jacek Becla: SciDB: A Database Management System for Applications with Complex Analytics. Computing in Science and Engineering (CSE) 15(3):54-62 (2013)

[3] Michael Stonebraker, A Biased Take on a Moving Target: Complex Analytics, Red Book, 5h edition.

[4] Carlos Ordonez, Achyuth Gurram, Nirmala Rai: Recursive Query Evaluation in a Column DBMS to Analyze Large Graphs. DOLAP 2014:71-80

[5] Carlos Ordonez: Statistical Model Computation with UDFs. IEEE Trans. Knowl. Data Eng. (TKDE) 22(12):1752-1765 (2010)

You may also like...