Big Data. Small Laptop. (Part 3 of 3)
Developing useful predictive models in a resource constrained environment.
(Part 3 of 3)
Doing big data analytics when resources are limited can be challenging. Maybe all you have is an old desktop or a hand-me-down laptop and no money for an upgrade. What can one do in such circumstances?
This is the final article in a series of three, which discuss some practical expedients for doing machine learning in large scale data environments when resources are limited. The two previous articles[i] discussed ways to whittle down the universe of potential data using domain expertise and discussed the merits of sampling. In this article, I’m going to discuss data pre-processing, variable selection and choice of algorithm (in no particular order). As before, the focus is on predictive modelling of individual behaviour, but the same principles apply to many other application domains.
My main focus is going to be on an “old favourite:” linear regression (least-squares). I’m going to be talking about how to use it as a preliminary analysis and modelling tool for classification problems. Now obviously, most people with a traditional statistics background will immediately say that that’s not appropriate: “Linear regression for classification, that’s outrageous!” but please bear with me.
Let’s start by comparing logistic regression with linear regression. For decades, logistic regression has acted as the baseline or “gold standard” for binary classification problems. For a machine learning solution to be considered any good it must at least beat, if not exceed, the performance of logistic regression. Consequently, when developing classification models, it’s not a bad idea to start with a logistic regression model to gauge your final solution against. Logistic regression is not particularly CPU intensive compared to modern (deep) neural network and ensemble based approaches, but even logistic regression can require hours or days of CPU if one has data sets containing thousands of variables[ii]. Linear regression on the other hand requires 1-2 orders of magnitude less CPU; i.e. in situations where it’s taking hours for a logistic regression to run, a linear regression can be undertaken in a matter of seconds.
One argument against using linear regression for classification is that some of the standard regression assumptions are broken. Another is that linear regression provides inaccurate probability estimates that can be negative or greater than 1. However, if all one is interested in is maximising discrimination within the population; i.e. optimising measures such as AUC, GINI, KS, etc., then linear regression offers pretty much identical levels of discrimination as logistic regression for binary classification problems. The fact that the errors are not normally distributed or a there is a high degree of hetroscedasticity is not relevant. To put it another way, if the resulting model ranks well on an independent validation sample, who cares how it’s derived or what assumptions are broken?[iii] If accuracy as well as discrimination is important, then a calibration exercise can be undertaken; i.e. a transformation applied to the outputs of the linear regression model to align them to a 0 – 1 probability range.
Given its speed, linear regression can also be employed very successfully as a “wrapper” method for variable selection[iv]. Therefore, even if you want to utilise a logistic regression (or another) approach, using linear regression for variable selection is worth considering. In fact, linear regression is so fast, one can use it as part of a bootstrap variable selection process, running multiple regressions with different sets of variables and run parameters. For example, forward and backward stepwise procedures with different entry/exit criteria. Variables are then ranked based on a function of their significance in each of the models in which they feature. Those with the highest rank are taken forward to the next stage.
Let’s now move on to compare linear models with other machine learning approaches. It’s certainly the case that overall, complex models such as (deep) neural networks and ensembles can deliver solutions that are sometimes 5 – 10% more discriminatory than simple linear models. However, the training algorithms applied to deliver these more advanced forms of model are very computationally demanding. They often require 2-3 orders of magnitude more CPU than logistic regression, so 4-5 orders of magnitude more CPU than linear regression.
In situations where it may take days to train a complex multi-layered neural network, a linear regression model can be produced in just a few minutes. It’s also the case that with suitable data pre-processing (transformation of the independent variables), the performance of linear modelling approaches can still run the newer approaches pretty close for many common business problems. Linear models provide the added benefit of being easy to understand. You can explain to business leaders and industry regulators which variables contribute to model outputs in simple language. This is something which is becoming increasingly important as automated decision making, based on machine learning, is being rolled out across an ever increasing set of applications that impact people in their everyday lives.
One key factor to maximising the performance of linear (or logistic) based regression models is appropriate data pre-processing. In particular, transformations that linearize the relationships between each of the independent variables and the target variable, before the data is presented to the chosen machine learning algorithm. Use of simple transformations, such as “Weights of Evidence” will do this effectively and with very little loss of information[v], while at the same time dealing with missing data and outliers.
A final comment is that once data has been suitably pre-processed many (but not all) business problems become predominantly linear in nature. Therefore, you can get very close to optimal using a linear modelling approach. The temptation may be to jump straight to a complex (non-linear) approach such as a deep neural network, but this may not be needed. Very complex modelling solutions are best suited to very complex problems. A good way to test this is to see how much improvement a relatively simple neural network provides over and above a linear model. If a network with a few neurons in a single hidden layer gives only very marginal benefits (or even none at all), then having a much more complex network with many neurons and hidden layers won’t get you much further; i.e. it’s an indication that you don’t have a highly non-linear problem. Therefore, a simple (linear) modelling approach will suffice.
So to sum up, whatever machine learning algorithm you want to apply, you can do a lot of preliminary work very quickly and easily using simple regression approaches. In doing so, you can potentially save a lot of time and effort down the line, enabling you to get the most from the limited time and resources available to you.
Steven Finlay’s latest book: Artificial Intelligence and Machine Learning for Business. A No-Nonsense Guide to Data Driven Technologies was published in June 2017, by Relativistic.
[i] Finlay, S. (2017). Big Data. Small Laptop: Developing useful predictive models in a resource constrained environment. ODBMS. http://www.odbms.org/2017/06/big-data-small-laptop-developing-useful-predictive-models-in-a-resource-constrained-environment/
Finlay, S. (2017). Big Data. Small Laptop (Part 2 of 3): Developing useful predictive models in a resource constrained environment . ODBMS http://www.odbms.org/2017/07/big-data-small-laptop-part-2-of-3/
[ii] The time to create a model using logistic regression (like most, but not all machine learning approaches) is linear in relation to the number of observations in the data set, but proportional to the square (or greater) of the number of variables.
[iii] Cleary there are problems where people do care, but for many big data applications a model’s discriminatory performance is the key measure.
[iv] Variable selection methods can broadly be grouped into “wrappers” and “filters.” Filters consider the relationship (correlation) between each data item (independent variable) and the target (dependent) variable, independently of the other data items under consideration. Wrappers on the other hand consider the inter-relationships between the independent variables.
[v] For a description of weights of evidence transformations see: Thomas, L. C., Edelman, D. B. and Crook, J. N. (2002). Credit Scoring and Its Applications, 1st edn. Philadelphia: Siam.