For the series Q&A with Data Scientists: Richard J Self
Richard J Self is a Research Fellow in the Big Data Lab at the University of Derby in the UK. His current area of research and teaching is in Governance and Analytics.
He has been using SAS for over thirty years and introduced it into the University of Derby for teaching four years ago. He also introduced Watson Analytics into the curriculum in the autumn of 2015 and, as a result, the University of Derby has become the first university in the UK, and the second outside the USA, to be able to offer Watson Analytics for Students to all the students in the university.
He is regularly invited to present on the topic of Governance and Big Data Analytics at business conferences across a wide range of business sectors.
He is the main editor of the Big Data and Analytics Educational Resources website at http://computing.derby.ac.uk/bigdatares/
His personal website is at http://computing.derby.ac.uk/c/people-2/richard-j-self/ and his LinkedIn profile is at https://uk.linkedin.com/in/richardselfllm
His email address is r.j.self AT derby.ac.uk
Q1. What should every data scientist know about machine learning?
It all depends what you mean by the term machine learning and how broad a definition of Data Scientist you want to use.
On one hand it seems the term “machine learning” has become the key buzz word for the Data Science industry: everything is about machine learning or even as a synonym for Artificial Intelligence. The basic, common and very broad concept is about the machine being able to detect patterns in the data. As a result, everything from regression analysis (of all types) through clustering and systems using neural networks to cognitive systems appear to qualify as machine learning in the terms of the new buzz word.
On the other hand, many see it relating more specifically to systems which are trained to do something. Good examples of this are autonomous cars or image recognition. Many of these systems are trained through the presentation of a large set of training data using human assistance or by using self-learning technologies.
Returning to your question, I will answer in relation to the latter concept and the fact that Data Science is a very broad church indeed. In many respects Data Scientist is as unspecific as the term Mechanical Engineer or Accountant. They all cover very wide ranges of career paths and different blends of required knowledge and useful knowledge.
We have seen this effect in the older term “Computer Science” which could cover careers in Security, Networks, Data Bases, Computer Games Programming and many others. IN Computer Science, we tend to consider that we need to provide all students with a basic, common foundation of Maths, Programming, Data Bases, Security and Governance, Networks.
After that we then develop more specialised curricula that develop deeper and more specialised knowledge and understanding relating to each of the wide range of specialities.
In this context, an outline of all the forms of machine learning is clearly needed, together with an overview of the strengths and weaknesses of each form in the initial, common curriculum for Data Science. This can then be followed up with much deeper courses for each form, as required by the different specialisms within Data Science.
Q2. Is domain knowledge necessary for a data scientist?
It turns out that, to be effective as a data scientist, deep domain knowledge is critical to become an effective data scientist. I will be developing this theme in the remaining questions.
My reason for this rather provocative statement is that in almost all the tasks that data scientists undertake, the nature of the data involved has specific metadata, values, meanings, semantics and ontologies.
It is clearly not possible for undergraduates, or even masters’ students to develop any significant domain knowledge during degree programmes and courses. However, it is possible to help students to develop some level of understanding of the nature of specific domain knowledge by carefully considered and designed projects that they undertake during their education.
In most cases, the acquisition of adequate domain knowledge will be done during their professional activities. In many cases, domain knowledge will be found from colleagues with considerable experience of the domain, as part of a data science team.
Q3. What is your experience with data blending?
One of the largest and most complicated BI projects that I undertook during the early 1990s required collecting data from a very disparate range of sources in many different formats and with few common keys from across the aerospace industry.
The project required very high levels of domain knowledge just to develop the basic data sources and to make the necessary personal contacts in many organisations to obtain the data, in whatever form and format that was available. The levels of completeness of the data was rather limited.
The overall data science and modelling part of the project was, however, successful and delivered some very interesting and valuable insights.
Q4. Predictive Modeling: How can you perform accurate feature engineering/extraction?
The term “Predictive Modelling” in an interesting one to me. It has clearly been developed to entice businesses to use modelling tools to gain an answer to the perennial question from business “what will tomorrow be like” (or next week, next month, next year, etc.)
When I started financial modelling in the early 1980s, we built the 30 year sales models and then ran them with a wide range of different assumptions and generated the classic star shaped sensitivity chart. We also developed a range of cost, revenue, cash-flow and margin charts for the project lifetime, typically with min / max / median lines. But we all recognised that these were only simple guides to a range of possible guesses about the futures. We would never have dignified this process with the term Predictive Modelling.
Today, the term seems to be based on large scale Big Data Analytics with machine learning (see above) that can forecast the future by the use of highly complex statistical techniques, often using a wide range of these correlation techniques in an attempt to identify a mathematical formula which has the best fit to the “training data”.
What we need to remember is that these tools are generally based on statistical techniques, which analyse data from the past. Any projections are only valid as long as the conditions remain the same as during the data collection period.
We have several excellent examples which demonstrate just how important this.
The Phillips curve, see https://en.wikipedia.org/wiki/Phillips_curve#/media/File:Phillips_curve.jpg connects wage inflation to unemployment levels, however, it is clear that there are two totally different relationships in this curve from 1913 to 1948. A similar pair of behaviours was also found during the period from the 1960s to 1990s. If an economist were using the curve from the first half of the time-sequence, the projections (predictive models) in the second half would be incomprehensible.
BREXIT Referendum and 2016 USA Election
These two situations have demonstrated very powerfully the limitations of big data and predictive modelling and analytics. Whilst the exact cause of the failures of the pollsters has yet to be fully documented, it is already becoming clear that a primary component of the problem was the very high levels of trust in the predictive models and the sources of data.
There was too much faith in the coverage of the polling data, with a failure to adequately consider the “forgotten” who probably did not use social media, both the rust belt inhabitants and also, surprisingly, the 20 to 30 age demographic who use social media but mainly Instagram and Snapchat.
Q5. Can data ingestion be automated?
It all depends on the data sources. We need to remember John Easton’s observation from 2012 that “80% of all the data [that we need to use] is of uncertain veracity”. This includes data from all sources, whether internet and social media based, IoT sensor network based or from our core corporate systems.
Those of us who have been involved in major systems implementations remember the major data cleansing exercises which often removed over 70% of all data in the legacy systems. The big surprise to most of us has been to observation that after about 5 years or so, we needed to carry out new data cleansing exercises on the “new systems”.
In general terms data from corporate core systems can mostly be automatically ingested, subject to some comparatively simple data cleansing to remove some of the rubbish that steadily accretes in any system.
Data from social media can be initially automatically ingested, however, it is often the case that human review will be needed to ensure that the data is adequately cleansed.
I remember an amusing case from a SAS Analytics EMEA conference where a bank in Southern Africa had launched a new product and wanted to track customer reaction to it using a hashtag related to the product’s name. Unfortunately, at that particular time there was an incident featuring a Hippopotamus with the same name as the banking product and the same hashtag. The consequence was that the bank’s team had to invest considerable effort in separating the two streams of social media feeds.
In the case of IoT sensor networks, the fundamental problem is that sensor calibrations drift with time at unpredictable rates. An early example of this occurred in Aerospace during the mid-1980s with Engine Condition Monitoring (ECM) from the three major aero-engine suppliers. For the first year or so, most of the airlines found that they were using the ECM analytical systems to trouble-shoot the sensors, rather than trouble-shooting the engines.
The use of anomaly detection techniques such as clustering and other forms of time-series based techniques can help to detect and adjust for the drift.
Q6. How do you ensure data quality?
Data Quality is a fascinating question. It is possible to invest enormous levels of resource into attempting to ensure near perfect data quality and still fail.
The critical question should, however, start from the Governance perspective of questions such as:
- What is the overall business Value of the intended analysis?
- How is the Value of the intended insight affected by different levels of data quality (or Veracity)?
- What is the level of Vulnerability to our organisation (or other stakeholders) if the data is not perfectly correct (see J Easton of IBM comment above) in terms of reputation, or financial consequences?
Once you have answers to those questions and the sensitivities of your project to various levels of data quality, you will then begin to have an idea of just what level of data quality you need to achieve. You will also then have some ideas about what metrics you need to develop and collect, in order to guide your data ingestion and data cleansing and filtering activities.
Q7. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
The answer to this returns to the Domain Expert question. If you do not have adequate domain expertise in your team, this will be very difficult.
Referring back to the USA election, one of the more unofficial pollsters, who got it pretty well right observed that he did it because he actually talked to real people. This is domain expertise and Small Data.
All the official polling organisations have developed a total trust in Big Data and Analytics, because it can massively reduce the costs of the exercise. But they forget that we all lie un-remittingly on line. See the first of the “All Watched Over by Machines of Loving Grace” documentaries at https://vimeo.com/groups/96331/videos/80799353 to get a flavour of this unreasonable trust in machines and big data.