OPTIMIZED ALGORITHMS FOR DATA ANALYSIS IN PARALLEL DATABASE SYSTEMS

by Roberto Zicari · July 5, 2017

Date

2017-04-26

Author: Cabrera, Wellington, Department of Computer Science University of Houston

Abstract

Large data sets are generally stored on disk following an organization as rows, columns or arrays, with row storage being the most common. On the other hand, matrix multiplication is frequently found in machine learning algorithms as an important primitive operation. Since database management systems do not support matrix operations, analytical tasks are commonly performed outside the database system, in external libraries or mathematical tools.

In this work, we optimize several analytic algorithms that benefit from a fast in-database matrix multiplication. Specifically, we study how to compute in-database parallel matrix multiplication to solve two major family of big data analytics problems: machine learning models and graph algorithms We focus on three cases: the product of a matrix by its transposed, the powers of a square matrix and iteration of matrix-vector multiplication. Based on this foundation, we introduce important optimizations to the computation of fundamental linear models in machine learning: linear regression, variable selection and principal components analysis. On the other hand, we present parallel graph algorithms that take advantage of matrix powers and parallel vector multiplication to solve several graph problems: transitive closure, all pairs shortest paths, reachability from a single source vertex, single source shortest paths, connected components and PageRank.

OPTIMIZED ALGORITHMS FOR DATA ANALYSIS IN PARALLEL DATABASE SYSTEMS

Author: Cabrera, Wellington, Department of Computer Science University of Houston

Abstract

View/Open CABRERA-DISSERTATION-2017.pdf (2.439Mb)

You may also like...

Resources

Search

News

Events

Archives

Sponsored By

HPCC Systems from LexisNexis Risk Solutions

KX

InterSystems

MySQL/Oracle

SingleStore

Supporters

McObject

NEXTGRES

Raima

Scality

Volt Active Data