Developing useful predictive models in a resource constrained environment.
(Part 2 of 3)
Some data points are far far more valuable than others
By Steven Finlay
Not everyone can afford to invest in Big Data Technologies. Many Data Scientists are stuck using a mediocre desktop or a hand-me-down laptop. Yet, this is no excuse for failing to deliver very good, if not quite optimal, machine learning solutions in situations where there can be many terabytes of data available.
This is the second of three articles describing some practical expedients for approaching large scale machine learning problems when resources are limited. In the previous article I focused on using a hierarchy for prioritising data in conjunction with human domain expertise. In this article, I’m going to discuss sampling (Sample Design) as a method to further reduce problems to something more manageable. As before, the focus is on predictive modelling of individual behaviour, but the same principles apply to many other application domains.
To some, “Sampling” has become something of a dirty word. That is, building predictive models using only a small sub-set of the observations (data points) available. The argument against sampling is that it leads to insufficient observations to cater for niche populations within the data; i.e. a model built from a small sample will not be sufficiently granular to capture subtle patterns in the data, and hence will result in a significantly sub-optimal solution. However, if you have say, 100 million data points, then taking only 100,000 may be sufficient to give you 99% of the accuracy of a model derived from the full population. The trick is being clever about which 100,000 data points you include in your sample.
The first point to make is that some data points are far far more valuable than others. It is these data points that you should focus on within your sampling strategy. With classification, for example, most problems are highly unbalanced.
That is, there are far more cases of some classes than others. Perhaps more importantly, an observation in one class may have much greater worth than observations in other classes. Take target marketing. Often, only around 1 in 100 customer contacts result in a positive event; i.e. a sale. However, that’s OK because a sale generates many dollars of profit whereas a non-sale incurs just a few cents in advertising costs. Consequently, a stratified sampling strategy can work exceeding well.
A common strategy for target marketing is to sample all of the sale events, but only a small proportion of the “Non-events.” A useful rule of thumb is that having between 5 and 10 non-events for every event is sufficient to build models that are as close to optimal as to make no difference. This is on the understanding that one applies a suitable approach to balancing and weighting the sample so that it is optimized for the chosen modelling approach, and that model assessment is representative of the full population.
What one also finds is that there are often large homogenous groups within the non-event population (e.g. dormant customers). Therefore, as long as there are a least a few hundred of these types of cases included in the development sample, the rest can be discarded. Likewise, if you know there are certain groups in the population are very important, ensure that these groups are well represented in the sample. In this way, it is possible to develop sophisticated sampling strategies which results in a sample that is just a fraction of the size of the full population, yet includes almost all of the most important cases.
A second point about sampling is that it’s true that larger development samples lead to more accurate models, but it’s very much a case of diminishing returns. For many classification and regression problems, once samples contain more than a few tens of thousands of cases (of each class for classification problems) then the benefits of larger samples are marginal. It’s certainly not the case that an infinite sample will lead to perfect predictions. Rather, given a fixed set of predictor (independent) variables then model accuracy is asymptotic in nature. There is a maximum level of accuracy which can be achieved regardless of the size of the development sample employed.
A final comment is that a very significant proportion of machine learning based solutions include decision rules.
These rules fire when certain thresholds are breached, resulting in action being taken. A predictive model for credit card transaction fraud, for example, will trigger a referral to a Bank’s fraud team if the model predicts fraud with say, a 25% likelihood or more. What this means is that improvements to the overall accuracy of predictions often doesn’t provide any practical benefit. If one model predicts a case has a 60% chance of being fraudulent, but a better (more accurate) model refines this to 45% then the decision rule will be triggered in either case. Improvements to model accuracy are only relevant to marginal cases around the threshold values of the decision rule(s), which affect relatively few cases.
Therefore, if you can identify cases which are likely to lie far from the decision rule threshold(s), these cases can be sampled to a higher degree than cases near the threshold(s). One way that you can do this is by using any existing models or ranking tools that exist. To put it another way, if you are upgrading an existing predictive model with a new one, then both the old and new models will tend to generate broadly similar predictions. Consequently, you can use the predictions generated by the old model to help you design your sampling strategy.
Even if you don’t want to build your final models using samples, then at least consider samples during exploratory analysis. For many methods such as neural networks and clustering, many test models are required to explore the different options available. For example, deciding how many neurons in each hidden layer of a neural network or the optimum number of clusters. Using samples to help determine these parameters will speed up the overall development time for your solutions.
In my next article I’ll talk about preliminary variable selection, model forms and algorithms.
Steven Finlay’s latest book: Artificial Intelligence and Machine Learning for Business. A No-Nonsense Guide to Data Driven Technologies was published in June 2017, by Relativistic.
Finlay, S. (2017). Big Data. Small Laptop: Developing useful predictive models in a resource constrained environment. (Part 1) ODBMS.org http://www.odbms.org/2017/06/big-data-small-laptop-developing-useful-predictive-models-in-a-resource-constrained-environment/
Weiss, G.M. (2004). Mining with rarity: A unifying framework. ACM SIGKDD Explorations Newsletter 6(1): 370-91. Crone, S. and Finlay, S. (2012). Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting. 28(1): 224-38.