**Bringing Big Data Analytics to a Traditional DBMS.**

by **Carlos Ordonez**, University of Houston. – October 2014.

Big Data is a fuzzy term with a broad meaning, used to refer to data sets so large, growing so fast and with so diverse content that well-known algorithms must be rethought in order to properly analyze them. Social media, online retailers, web site logs with millions of data records and files, are examples of such massive data sets. It is fair to say that our

daily life is impacted by Big Data in ways we are not fully aware; from the ads that target us with amazing precision while we browse the web, to the ebb and flow of the financial markets, we can say that we live in the era of Big Data.

The major challenge about big data is not storing it: disk storage is cheap, fast and easy (automated): data repositories are constantly growing at a faster pace. Thus the major challenge is making sense out of big data: its analysis takes longer time..hence the fundamental problem to solve is “big data analytics”. Based on decades of research, there are plenty of analytical algorithms and techniques going from exploratory analysis to sophisticated mathematical models. In general, when analysing data sets the analyst ends up computing statistical, machine learning models or some kind of mathematical model.

The big questions these days are HOW and WHERE to compute such models.

The “how” involves a wide spectrum of tools, programming languages and hardware.

The “where” is perhaps a harder question: inside or outside the system where data originates from, in main memory or on disk, on a local computer (even a laptop) or on a cloud system.

Most research these days proposes to move data to a large computer cluster with “unlimited” of CPU power and disk

space (i.e. elastic resources), where the dominating technology is the Hadoop Distributed File System (HDFS), building upon the innovative Google analytics infrastructure. This research direction has proven successful to analyze diverse and complex data from the Internet, especially web pages, files and natural language text. However, when it comes to tabular data,

particularly databases, the answer is unclear. We defend a contrarian point of view, where large data sets can be analyzed “in-situ” inside a database management system, closer to where data records are stored and originally processed (if coming from an OLTP system). This approach provides several benefits including reduced data movement, lower data redundancy, easier analysis with queries, higher security, adoption by organizations familiar with DBMS technology and foremost, being faster

than most other approaches.

Recent research from ther DBMS group at the University of Houston has proposed novel algorithms to analyze massive data sets exploiting DBMS technology including: a matrix operator to summarize a multidimensional data set reading it only once and fully in parallel to compute many popular statistical models, an incremental clustering algorithm, a major problem in Bayesian statistics that can reduce thousands of iterations to tens of iterations using a clever combination MCMC and EM methods

evaluated with queries instead of a traditional programming language and a recursive algorithm to analyze large graphs, representing a social network like Facebook.

*References:*

Carlos Ordonez, Yiqun Zhang, Wellington Cabrera: The Gamma Operator for

Big Data Summarization on an Array DBMS. Journal of Machine Learning

Research (JMLR, edited by MIT): Workshop and Conference Proceedings

(BigMine 2014), 36:88-103, 2014

David Sergio Matusevich, Carlos Ordonez: A Clustering Algorithm Merging

MCMC and EM Methods Using SQL Queries. Journal of Machine Learning

Research (JMLR, edited by MIT): Workshop and Conference Proceedings

(BigMine 2014), 36:61-76, 2014

Carlos Ordonez, Achyuth Gurram, Nirmala Rai. Recursive Query Evaluation in

a Column DBMS to Analyze Large Graphs, Proc. ACM International Workshop on

Data Warehousing and On-line Analytical Processing (DOLAP), 2014.