On in-database machine learning. Interview with Waqas Dhillon.
“The goal of in-database machine learning is to bring popular machine learning algorithms and advanced analytical functions directly to the data, where it most commonly resides – either in a data warehouse or a data lake.” — Waqas Dhillon.
I have interviewed Waqas Dhillon, Product Manager – Machine Learning at Vertica. We talked about in-database machine learning, and what are the new machine learning features of Vertica.
RVZ
Q1. What is in-database machine learning?
Waqas Dhillon: The goal of in-database machine learning is to bring popular machine learning algorithms and advanced analytical functions directly to the data, where it most commonly resides – either in a data warehouse or a data lake. While machine learning is a common mechanism used to develop insights across a variety use cases, the growing volume of data has increased the complexity of building predictive models, since few tools are capable of processing these massive datasets. As a result, most organizations are down-sampling, which can impact the accuracy of machine models and created unnecessary steps to the predictive analytics process.
In-database machine learning changes the scale and speed through which these machine learning algorithms can be trained and deployed, removing common barriers and accelerating time to insight on predictive analytics projects. To that end, we’ve built machine learning and data preparation functions natively into Vertica, so the computational processes can be parallelized across nodes –scaling-out to address performance requirements, larger data volumes, and serving many concurrent users. Vertica in-database machine learning aims to eliminate the need of downloading and installing separate packages, purchasing 3rd party tools, or moving data out of database. Unlike traditional statistical analysis tools, we’ve given users the ability to archive and manage machine learning models inside the database, so they can train, deploy, and manage their models with a few simple lines of SQL.
Q2. What problem domains are most suitable for using Predictive Analytics?
Waqas Dhillon: Most organizations are realizing the role that predictive analytics can play in addressing certain business challenges to create a competitive advantage. While simple business intelligence and reporting has played a key role in understanding how an organization operates and where improvements can be made, the volume of data available combined with the power of machine learning is driving the adoption of forward-looking, predictive analytics projects. This adoption is compounded by an increase in end-user/customer demand for applications with embedded intelligence that no longer just identified ‘what happened’ but predicts ‘what will happen’.
In general, machine learning models using linear regression, logistic regression, naïve Bayes, etc. are better suited for problem domains involving structured data analysis. Beyond this, the most suitable domains for using predictive analytics are driven by the use cases and business applications that drive new revenue opportunities, increase operational efficiencies, or both.
Q3. Can you give us some examples?
Waqas Dhillon: In-database machine learning and the use of predictive analytics can drive tangible business benefits across a broad range of industries. Below are some of the most common industries and use cases where I’ve seen an adoption of predictive analytics capabilities:
• Financial services organizations can discover fraud, detect investment opportunities, identify clients with high-risk profiles, or determine the probability of an applicant defaulting on a loan.
• Communication service providers can leverage a variety of network probe and sensor data to analyze network performance, predict capacity constraints, and ensure quality service delivery to end customers.
• Marketing and sales organizations can use machine learning to analyze buying patterns, segment customers, personalize the shopping experience, and predict which targeted marketing campaigns will be most effective.
• Oil and gas organizations can leverage machine learning to analyze minerals to find new energy sources, streamline oil distribution for increased efficiency and cost effectiveness, or predict mechanical or sensor failures for proactive maintenance.
• Transportation organizations can analyze trends and identify patterns that can be used to enhance customer service, optimize routes, and increase profitability.
Q4. How do you handle machine learning on Big Data using an in-database approach?
Waqas Dhillon: The Vertica Analytics Platform was always built specifically for Big Data analytics and other analytical workloads where speed, scalability, and simplicity are crucial requirements.
Since we had spent years building out such a high-performance, scalable SQL engine, we started to ask ourselves, “Why should we limit the scope of our platform to standard SQL functions and descriptive analytics? Why not extend the power of Vertica to include more advanced analytics and machine learning functions?”
While some solutions might be limited by inherent architectural problems, such as lacking a shared-nothing-cluster architecture suitable for big data analytics, Vertica has an incredible engine for performing analytics on large scale data. That’s why we felt it was such an obvious choice to build machine learning functions natively into the platform. By building these machine learning capabilities on top of a foundation that already provides a tested, reliable distributed architecture and columnar compression, customers can now leverage these core features for advanced and predictive analytics uses cases.
In Vertica, we have implemented all in-database algorithms from scratch to run in parallel across multiple nodes in a cluster. Using parallel execution for model training, as well as scoring, not only results in extremely fast performance but also extends the capability of these algorithms to run on much larger datasets in comparison to traditional machine learning tools.
Using Vertica for machine learning provides another great advantage born from the fact that the computation engine and data storage management system are combined – this combination eliminates the need to move data between a database and a statistical analysis tool. You can build, share and deploy your machine learning pipelines in-place, where the data lives. This is a very important consideration when working with Big Data since it’s not just difficult, but sometimes outright impossible to move data at that scale between different tools.
Q5. How does Vertica support the machine learning process? Can you give some examples?
Waqas Dhillon: Vertica supports the entire machine learning workflow from data exploration and preparation to model deployment.
Users can explore their data using native database functions. As an analytics database, Vertica includes a large number of functions to support data exploration, and many more have recently been added to the machine learning library. Users can also prepare data with functions for normalization, outlier detection, sampling, imbalanced data processing, missing value imputation and many other native SQL and extended functions. They can also train and test advanced machine learning models like random forests and support vector machines on very large data sets.
There are multiple model evaluation metrics likes ROC, lift-table, AUC, etc. which can be used to assess your existing trained models. Any models built within Vertica can be stored inside the platform, shared with other users using the same instance of Vertica, or exported out to other Vertica databases. This can be quite useful while training models in test clusters and then moving them to production clusters. Training and managing models inside the database also reduces the overhead needed to transfer data into another system for analysis, along with the maintenance of that system.
Q6. How did you take advantage of a Massively Parallel Processing (MPP) Architecture, when implementing in-database machine learning in Vertica?
Waqas Dhillon: Vertica’s MPP architecture provided a great foundation on top of which we built a range of in-database machine learning functions, from data ingestion to model storage and scoring capabilities.
For data ingestion, there was already an extremely fast copy command used to move data in parallel into Vertica, where it’s stored on multiple nodes in a cluster. When we were writing our distributed machine learning algorithms, we could already rely on the data distribution across various nodes and instead focus our engineering efforts on the computation logic used to parallelize model training. We have also used a built-in distributed file system to maintain intermediate results as well as the final, trained models. These machine learning functions are mainly developed using Vertica’s C++ SDK, and are executed with Vertica’s distributed execution-engine.
To give an example of a machine learning algorithm used natively within Vertica leveraging the MPP architecture, let’s look at Random Forests. Random Forests is a popular algorithm among data scientists for training predictive models that can be applied to both regression and classification problems. It provides good prediction performance, and is quite robust against overfitting. The running time and memory footprint of this algorithm in R-randomForest package or Python-sklearn can be a major hurdle when working with large data volumes.
Our distributed implementation of Random Forest overcomes these obstacles. Model training is distributed across multiple nodes in a distributed architecture with multiple trees possibly being trained on the various nodes in the network, and then combining these results to provide a classification model. This model can then be used to perform scoring in parallel on data that might be distributed across multiple nodes (possibly hundreds) in a cluster.
Q7. You offer SQL-based machine learning functions. Is this an extension to SQL? Can you give us some examples?
Waqas Dhillon: Although Vertica follows the SQL standard, it offers multiple SQL extensions such as windowing functions and pattern matching. In-database machine learning algorithms are now part of the database’s analytical toolset, allowing users to write SQL like commands to run machine learning processes. They go beyond other, simpler SQL extensions users will find within Vertica.
For example, a simpler SQL extension in Vertica would be event series pattern matching. Event patterns are simply a series of events that occur in an order, or pattern that you specify. Vertica evaluates each row in your table, looking for the event you define. When Vertica finds a sequence of rows that conform to your pattern among a dataset of possibly hundreds of billions of rows or more, it outputs the rows that contribute to the match.
An example of a SQL extension for machine learning would be support vector machines (SVM). SVM is a very powerful algorithm that can be applied to large data sets for both classification and regression problems. For instance, an SVM model can be trained to predict the sales revenue of an e-commerce platform. There are many other extended SQL functions in Vertica as well to support a typical machine learning workflow from data preparation to model deployment.
Q8. What are the common barriers to Applying Machine Learning at Scale?
Waqas Dhillon: There are several challenges when it comes to applying machine learning to massive volumes of data. Predictive analytics can be complex, especially when big data is added to the mix. Since larger data sets yield more accurate results, high-performance, distributed, and parallel processing is required to obtain insights at a reasonable speed suitable for today’s business.
Traditional machine learning tools require data scientists to build and tune models using only small subsets of data (called down-sampling) and move data across different databases and tools, often resulting in inaccuracies, delays, increased costs, and slower access to critical insights:
• Slower development: Delays in moving large volumes of data between systems increases the amount of time data scientists spend creating predictive analytics models, which delays time-to-value.
• Inaccurate predictions: Since large data sets cannot be processed due to memory and computational limitations with traditional methods, only a subset of the data is analyzed, reducing the accuracy of subsequent insights and putting at risk any business decisions based on these insights.
• Delayed deployment: Owing to complex processes, deploying predictive models into production is often slow and tedious, jeopardizing the success of big data initiatives.
• Increased costs: Additional hardware, software tools, and administrator and developer resources are required for moving data, building duplicate predictive models, and running them on multiple platforms to obtain the desired results.
• Model management: Archiving and managing the machine learning models is a challenge when using most of the data science tools as they usually lack a mechanism for model management.
Q9. How do you overcome such barriers in Vertica?
Waqas Dhillon: Capable of storing large amounts of diverse data while also providing key built-in machine learning algorithms, Vertica eliminates or minimizes many of these barriers. Built from the ground up to handle massive volumes of data, Vertica is designed specifically to address the challenges of big data analytics using a balanced, distributed, compressed columnar paradigm.
Massively parallel processing enables data to be handled at petabyte scale for your most demanding use cases. Column store capabilities provide data compression, reducing big data analytics query times from hours to minutes or minutes to seconds, compared to legacy technologies. In addition, as a full-featured analytics system, Vertica provides advanced SQL-based analytics including pattern matching, geospatial analytics and many more capabilities.
As an optimized platform enabling advanced predictive modeling to be run from within the database and across large data sets, Vertica eliminates the need for data duplication and processing on alternative platforms—typically requiring multi-vendor offerings—that add complexity and cost. Now that same speed, scale, and performance used for SQL-based analytics can be applied to machine learning algorithms, with both running on a single system for additional simplification and cost savings.
————————————-
Waqas Dhillon, Product Manager – Machine Learning, Vertica
Waqas is the product management lead for machine learning with Vertica. In his current role, he drives the strategy and implementation of advanced analytics and machine learning features in the Vertica MPP platform. Waqas holds a bachelor’s degree in computer software engineering from NUST and a master’s degree in management from Harvard University.
Prior to his current role, Waqas has worked in multiple positions where he applied data analytics and machine learning for consumer research and revenue growth for companies in consumer packaged goods and telecommunication industries.
Resources
– Vertica in-database machine learning: product page.
– Vertica in-database machine learning: full documentation.
– Try version of Vertica for free
Related Posts
– On using AI and Data Analytics in Pharmaceutical Research. Interview with Bryn Roberts ODBMS Industry Watch, Published on 2018-09-10
– On AI and Data Technology Innovation in the Rail Industry. Interview with Gerhard Kress ODBMS Industry Watch, Published on 2018-07-31
– On Artificial Intelligence, Machine Learning, and Deep Learning. Interview with Pedro Domingos ODBMS Industry Watch, Published on 2018-06-18
Follow us on Twitter: @odbmsorg
##