SCALABLE MACHINE LEARNING ALGORITHMS IN PARALLEL DATABASE SYSTEMS EXPLOITING A DATA SUMMARIZATION MATRIX

by Roberto Zicari · July 5, 2017

Date

2016-05

Author Zhang, Yiqun, Department of Computer Science University of Houston

Abstract

Data summarization is an essential mechanism to accelerate analytic algorithms on large data sets. In this work we present a comprehensive data summarization matrix, namely the Gamma matrix, from which we can derive equivalent equations for many analytic algorithms. In this way, iterative algorithms are changed to work in two phases: (1) Incremental and parallel summarization of the data set in one pass; (2) Iteration in main memory exploiting the summarization matrix in many intermediate computations. We show our summarization matrix captures essential statistical properties of the data set and it allows iterative algorithms to work a lot faster in main memory.

Specifically, we show our summarization matrix benefits statistical models, including PCA, linear regression and variable selection. From a system perspective, we carefully study the efficient computation of the summarization matrix in two parallel database systems including the array DBMS SciDB, and the columnar relational DBMS HP Vertica.

We also propose general optimizations according to the data density and system-dependent optimizations for each platform. We present an experimental evaluation benchmarking system and algorithm performance. Our experiments show that our algorithms work significantly faster than existing machine learning libraries for model computations in R and Spark, and R working together with SciDB in general can run our algorithm significantly faster than all the other parallel analytic systems compared. More importantly, it eliminates main memory and performance limitations from R.

SCALABLE MACHINE LEARNING ALGORITHMS IN PARALLEL DATABASE SYSTEMS EXPLOITING A DATA SUMMARIZATION MATRIX

Author Zhang, Yiqun, Department of Computer Science University of Houston

Abstract

View/Open ZHANG-THESIS-2016.pdf (232.1Kb)

URI http://hdl.handle.net/10657/1482

You may also like...

Resources

Search

News

Events

Archives

Sponsored By

HPCC Systems from LexisNexis Risk Solutions

KX

InterSystems

MySQL/Oracle

SingleStore

Supporters

McObject

NEXTGRES

Raima

Scality

Volt Active Data