Data summarization is an essential mechanism to accelerate analytic algorithms on large data sets. In this work we present a comprehensive data summarization matrix, namely the Gamma matrix, from which we can derive equivalent equations for many analytic algorithms. In this way, iterative algorithms are changed to work in two phases: (1) Incremental and parallel summarization of the data set in one pass; (2) Iteration in main memory exploiting the summarization matrix in many intermediate computations. We show our summarization matrix captures essential statistical properties of the data set and it allows iterative algorithms to work a lot faster in main memory.
Specifically, we show our summarization matrix benefits statistical models, including PCA, linear regression and variable selection. From a system perspective, we carefully study the efficient computation of the summarization matrix in two parallel database systems including the array DBMS SciDB, and the columnar relational DBMS HP Vertica.
We also propose general optimizations according to the data density and system-dependent optimizations for each platform. We present an experimental evaluation benchmarking system and algorithm performance. Our experiments show that our algorithms work significantly faster than existing machine learning libraries for model computations in R and Spark, and R working together with SciDB in general can run our algorithm significantly faster than all the other parallel analytic systems compared. More importantly, it eliminates main memory and performance limitations from R.