For most business problems you can get 95+% of the benefits by analysing only a small proportion of the available data.
By Steven Finlay
The cost of entry into the world of Big Data and data driven decision making (machine learning / predictive analytics) has fallen dramatically in recent years. However, many data scientists find themselves being told that there is no budget to buy “a few off the shelf servers” or to invest in a heavyweight Big Data solution from the likes of IBM, SAS or Teradata. There are cheap cloud based services available for machine learning for sure, but these are often limited.
They tend to be black box in nature and don’t integrate easily with operational systems. Consequently, for many data scientists it’s a case of just getting on with it. Doing the best that you can using a mediocre laptop, an ODBC link to your organisation’s data warehouse and an open source software package such as R. With such limited resources, how do you even start to analyse the many terabytes of data available? This situation, far from being the exception, is very much more common than you might imagine.
In this and the subsequent article, I’m going to talk about a number of practical strategies that can be employed to deliver good quality machine learning solutions in resource constrained environments. The focus is on predictive modelling of individual behaviour, but the same principles apply to many other application domains.
The first thing to say is that it’s easy to fall into the trap of thinking that to deliver a good machine learning based solution you have no choice but to adopt a “Brute force” approach; crunching every possible item of data using the most advanced algorithms available. This is a misconception. For most business problems you can get 95+% of the benefits by analysing only a small proportion of the available data. This is by using relatively simplistic algorithms, combined with a bit of common sense and business knowledge to reduce the problem to something more manageable. OK, so it’s going to be a good rather than an optimal solution (in terms of predictive accuracy), but that’s real life. In every other sphere of business, trade-offs between time, money and deliverables are accepted without question as being part of the day job – no one ever delivers optimal. Big Data analytics isn’t any different. It’s all about delivering the best solution that you can with the time and resources available to you.
To begin, let’s talk about prioritisation. These days, there are often tens of thousands of data items available for analysis, if not more. What types of data should you consider first and what should you ignore? Consider Table 1.[i]
Table 1. Data Relevance: Data type / Description
• Data which relates to previous examples of the event/behavior that you want to predict.
• Data relating to previous events/behaviors which are similar (but different) to the predicted behaviour.
• Information about the individual/entity for which predictions are being sought. For example, an individual’s age, gender, income and location.
• Other information that has no obvious or direct connection with the behavior being predicted and/or the entity for which a prediction is required.
• Information about associated individuals/entities. For example, people’s friends and family members.
Let’s take the case where I want to apply machine learning to predict the likelihood that someone buys can(s) of beans when they go shopping next week. This is so that I can target them with enticing offers for my new brand of beans.
Data with primary relevance is all to do with people’s previous bean buying behaviour. For example, how long since they last bought beans, the average interval between bean purchases, which brands they purchased, average purchase amount, and so on. Data with secondary relevance relates to buying patterns for similar or associated products such as sausages, eggs, bread and so on, as well as data about the individual themselves such as their age and income.
Tertiary data relates to all the other stuff and often accounts for the majority of the data available. Maybe there are references to beans on social media. Maybe someone drives a BMW, or has green eyes or their mother has a pet cat.
The list is almost endless. Almost anything could potentially have some correlation with bean buying behaviour at some level. However, what one tends to find in many application domains is that data with primary relevance, captured by just a few dozen variables, will provide the vast majority of the power of any predictive models which are constructed. Consequently, these are the variables to prioritize first. Secondary data may also add value, but to a lesser degree.
There has been a lot of hype about discovering “hidden patterns” in all that other (tertiary) data out there.
The reality is that these patterns do exist, but they are far rarer and add much less value than you might imagine, over and above the primary and secondary data types described previously. “Over and above” is the key phrase here. Just because a data item is correlated with what you are trying to predict does not mean it will add value to any predictive models that you create. It’s got to provide something additional to the data that you already have.
What one often finds in practice is that a lot of tertiary data correlates with primary and secondary data types. Therefore, even if tertiary data is highly correlated with the target variable, including it in the analysis will provide only a small uplift to your models. In my experience, tertiary data sources only really come into their own when primary and secondary data is sparse or unavailable.
Two other factors to consider when prioritising what data to include in the machine learning process are recency of data and current business practice. Recent history is a far better predictor of behaviour than what occurred in the distant past.
One way to think about this is that many types of data have a half-life. Their value steadily diminishes over time.
A person’s shopping history in the last month will be far more predictive of bean buying behaviour than details of what they bought last year, which in turn will be more predictive than what they were buying a decade ago.
The second consideration is current business practice. If machine learning is going to be used to replace or supplement manual decision making, then the data that is used in the current decision making system will almost certainly be useful. This is because the organisation will have spent a lot of time trying to optimise what it does, and will therefore have considered carefully what data human decision makers require. Consequently, one would be remiss not to consider those data items in any machine learning based solution.
So before crunching the data, sit down with subject matter experts and work your way through the organizations data assets and prioritize them. Then limit yourself to no more than say, 4-500 core data items initially (or whatever your laptop can comfortably deal with). If time and resources permit, you can always go back and explore further. In my experience, using some simple logic incorporating the aforementioned strategies can reduce the amount of data under consideration by several orders of magnitude. This is prior to doing any quantitative analytics whatsoever, and will still deliver near optimal solutions in most cases.
In my next article I’ll talk about instance sampling, preliminary variable selection and choice of algorithm.
Steven Finlay’s latest book: Artificial Intelligence and Machine Learning for Business. A No-Nonsense Guide to Data Driven Technologies was published in June 2017, by Relativistic.
[i] Adapted from: Finlay, S. (2014). Predictive Analytics, Data Mining and Big Data. Myths, Misconceptions and Methods. Palgrave Macmillan.