Q&A with Data Scientists: Claudia Perlich
For the series Q&A with Data Scientists: Claudia Perlich
Q1. What should every data scientist know about machine learning?
I will speak primarily about predictive modeling/supervised learning because this is where my expertise is. Also – I am looking at this question from the perspective of a ‘practical’ data scientist who is looking to solve a specific problem using machine learning, not somebody who is trying to develop new machine learning algorithms – although it would be good to know this too.
In practice, correct evaluation is incredibly difficult – and I am not even talking in vs. out of sample or validation vs. testset. Those are the table stakes, but not what matters most in applications. Almost anybody can build hundreds if not thousands of models on a given dataset, but being sure that the one you picked is indeed most likely the best for the job is an art! The question is typically not even which algorithm (logistic regression, SVM, decision tree, deep learning), but rather the entire pipeline from sampling a training set, preprocessing, feature representation, labeling, etc. None of this has anything to do with just ‘out-of-sample’ evaluation. So here is your compass for doing it right:
Your evaluation setting has to be as close as possible to the intended USE of your model.
You basically want to come as close as you can to simulating having that model in production and track the impact as far to the bottom line as you can. That means in a perfect world you need to simulate the decision that your model is going to influence. This is often not entirely possible.
Here is an example: You want to evaluate a new model to predict the probability of a person clicking on an ad. The first problem you have is that almost surely you have neither adequate training not evaluation data … Because until you actually show the ads you have nothing to learn from. So welcome to the chicken and egg part of the world with a lot of literature on exploration vs exploitation. So already getting a decent data set to use for evaluation is hard. You can of course consider some ideas from transfer learning and build your model on some other ad campaign and hope for the best – which is fine for learning but really adds just one more question to your evaluation – which alternative dataset is best suited and of course you still have not data for evaluation.
But let’s for the moment assume that you have a somewhat right dataset. Now you can of course calculate all kinds of things. But again, you only added to the many questions – what should you look at: Likelihood, AUC, Lift (at what percentage), Cost per click? And while there are some statistical arguments for one over the other, there is no right answer.
What matters is what you are going to do with the model: Are you using it to select the creative in 100% of all cases? Are you using it to select only the top n percent of most likely opportunities? Do you want to change the bid price in an online auction based on this prediction? Or do you want to understand what makes people click on ads in general? All of those questions can be answered by more or less the same predictive task – predict whether somebody will click. But you need to look at different metrics in each case (in fact there is some correspondence between the above 4 metrics and the 4 questions here) and I would bet that you should select very different models for each of these uses.
Finally – have a baseline! One thing is to know when you are doing better or worse. But there is still the question – is it even worth it or is there a simple solution that gets you close? Having a simple solution to compare to is a fundamental component of good evaluation. At IBM we always used ‘Willie Sutton’. He was a bank robber and when asked why he did it, the answer was because that’s “where the money was”. Any sales model we build was always compared to Willie Sutton – just rank companies by revenue. How much better does your fancy model get than that?
Q2. Is domain knowledge necessary for a data scientist?
Welcome to a long standing debate. I was drafted on short notice to a panel back in 2013 at KDD called The evolution of the expert on exactly this topic.
There are many different ways I have tried to answer in the past:
“If you are smart enough to be a good data scientist, you can for most cases probably learn whatever domain knowledge you need in a month or two.”
“Kaggle competitions have shown over and over again that good machine learning beats experts.”
“When hiring data scientists I am more interested by somebody having worked in many industries than having experience in mine.”
Or I just let my personal credentials speak for themselves: I have won 5 data mining competitions without being an expert on breast cancer, the yeast genome, CRM in telecom, Netflix movie reviews, or hospital management.
All this might suggest that the answer is no. I in fact would say NO in the usual interpretation of domain knowledge. But here is where things change dramatically:
I do not need to know much about the domain in general, BUT I need to understand EVERYTHING about how the data got created and what it means. Is this domain knowledge? Not really – if you talk to a garden variety oncologist, he or she will be near useless at explaining the details of the fRMI data set you just got. The person you need to talk to is probably the technician who understand the machine and all the data processing that is happening in there including stuff like calibration.
Q3. What is your experience with data blending? (*)
I have to admit that I had never heard of the concept of ‘data blending’ before reading the reference. After some digging, I am contemplating whether it is simply a sales pitch for some ‘new’ capability of a big data solution or a somewhat general attempt to cover a broad class of feature construction and ‘annotation’ that are based on some form of a ‘fuzzy join’ where you do not have the luxury of a shared key. Giving this the benefit of doubt, I will go with the second. There are a few ways to look at the need for adding (fuzzy or not) data:
1) On the most abstract level it is a form of feature construction. I my experience, features often trump algorithm – so I am a huge fan of feature construction. And if you are doing the predictive modeling right – the model will tell you if your blending worked or not. So you really have little to lose and you can try all kinds of blending even of the same information. This tends to be the most time consuming (and also in my case fun) part of modeling, so having some tools that simplify this and in particular allow for fuzzy matches and automated aggregation would be neat …
2) Let me put on my philosopher’s hat: All you do with blending is navigating the bias-variance tradeoff in the context of the limitations of the expressiveness of your current model. Most often the need to blend arises around identifiers of events/entities. Say you have a field that is ZIP code (or Name). You might want to blend some actual features of ZIP codes at a certain time – so you are really just dealing with some identity of a combination of time and space. You can add in some census data based on ZIP and date and hope that this improves your model. But in some information theoretical sense, you in fact did not add any data. ZIP and date implicitly contain all you need to know (think of it as a factor). In a world of infinite data you do not need to bring in that other stuff because a universal function approximator can learn all you can bring in directly from the ZIP date combination.
This of course only works in theory. In practice it matters how often that ZIP and date appear in your training set and whether your model can deal naturally with interaction effects, which for instance linear models cannot unless you add them. In order to learn anything from it, it has to appear multiple times. If it does not – blending in information de facto replaces the super high-dimensional identifier space (ZIP and date combination) with a much lower common space of say n features (average income, etc). So in terms of bias variance, you just managed a huge variance reduction but you may also have lost all the relevant information (huge increase of bias): say some hidden feature like the occurrence of a natural catastrophe that was not available in the blend but that ZIP and date as a combination was a good proxy for …
In terms of related experience, I in fact spend a good 3 years (my dissertation) on something very closely related. It was not so much on the ‘fuzzy’ part but rather on the practical question how to automatically create features in multi-relational databases. I did assume that the link structure (keys) were known to join between tables. We published this work and some conceptual thoughts around the role of identifiers in the Machine Learning Journal: Distribution-based aggregation for relational learning with identifier attributes
And then I spent a good 3 years at IBM wrangling the data annotation problem for Company names. We had to build a propensity model for IBM sales accounts. While all kinds of internal info was available for an account, we had no external set of features. Each account was linked somehow to a real company. However, that match was fuzzy at best. What we needed for a model was some information about industry, size, revenue, etc. So in this case, each ‘identifier’ is unique in my dataset and that nice theory gets me nothing. The match between accounts and Dun & Bradstreet entities was something of a n-to-m string matching. For a while we used their matching solution and eventually replaced it with our own (took us a good 2 years).
In the end we were wrong about the match for about 15% of the accounts. The project nevertheless won a good number of awards internally and externally (Introduction). We also published a lot of the methodology on the modeling side, but of course, the hard matching part was not scientific enough …
I will take a shot at this from my experience winning KDD CUP’s back in the days (2007–09 and the publications I have on this). But before I go into the tricks of the trade, beware that what one does for competitions is not necessarily the right thing for building a model ‘in the real world’.
First of, in principle, if you have a universal function approximator (neural networks, decision trees, also read about VC dimensions if you want to know more) – a model that can express anything, and infinite data, you do not really need to worry about feature engineering. This is basically what we see with the advances of deep learning. Decades of research on feature construction from images have become entirely obsolete….
But in reality you usually do not have the luxury of ‘infinite data’. And all of a sudden having a super powerful model class that can express anything is no longer the obvious best choice (see my answer to my prefered algorithm). The reason is the bias variance tradeoff. With great power comes great responsibility … and in the world of predictive modeling overfitting. For instance you may find yourself in a position where a linear model might be the best choice. In fact I won most of my competitions with smart feature construction for linear models.
The answer on how to construct features depends VERY MUCH on which type of model you want to use and the strategies for trees are very different from those for linear models. Essentially you are trying to make it easy for a model type to find a relationship that is there. Linear models are great at taking differences, trees are terrible at it. Say you want a model to predict whether a company is profitable – this is simply a question whether revenue is greater than cost. If you have both features, this is really easy for a linear model and really hard for a tree to learn. So you can help the tree by adding differences of pairs of numeric features if you suspect they could matter. Linear models on the other hand have a really hard time with nonlinear relationships (I know I am stating the obvious). Say in health you know that both being too heavy and too light is a problem – it is therefore a good idea to include the square of weight. What about interaction effects (the infamous XOR problem)? Same story, you need to include pairwise products in the model to make it easy for a linear model to find.
That also means that there is a good amount of domain knowledge that comes to bear when deciding what might matter and next thinking about whether your model could easily take advantage of this if given the information. In general – having good hypothesis of what might influence the outcome is a great place to start thinking about features.
In competitions, something else comes into play – can you find weakness in the data to exploit? Almost all datasets carry traces of their creation and often by exploiting those, you can get better performance. This is know as leakage. In reality you do not want to exploit leakage because it does not really improve your generalization performance in the real world, but in competitions all is game. But before you get all excited, these days Kaggle is trying very hard to remove all such leaked information.
So now to a few examples from winning competitions based on feature construction in combination with mostly linear models (logistic, linear svm, etc.) We usually tried all other model classes as well but often ended up with linear (and good feature construction) having the best performance.
In a task to predict breast cancer based on 117 features from fMRI data our team was apparently the only one to observe that the patient ID was super predictive. So we added the patient ID range to the model and got a 10% performance increase. Finding accidental information in supposedly random identifiers is a common problem and always suggests that something is wrong in the dataset construction. BTW: we also tried to construct many other features that did not add any value.
On a telecom task with 50K anonymous features two members of our team independently tried to do a feature ranking and while we agreed somewhat, there were a number of features that had high mutual information and low AUC as individual contributors. This essentially means that we have a highly non-monotonic relationship. It turned out that in some features somebody had replaced missing values with some method and you could see spikes in the histogram. Those values with spikes were highly predictive but the linear model could obviously not learn from it – so we used decision trees on such single features as a means of discretizing the numeric feature and feed it into the linear models.
Common generic tricks to help linear models: discretize numeric variables, cap extreme values of linear and add indicator, include interaction effects, include log/squares, replace missing by zero and include indicator as well as interaction effect.
Another broad class of techniques in feature construction deal with the very common case of relational model where a large part of the information is either in a network structure or in 1-n or n-m relationships with the main entity for which you are trying to make a prediction. These cases are notoriously difficult and I spend my entire PhD on trying to create methods for automated feature construction.
To conclude, smart feature construction makes much more of a difference than fancy algorithms (that was until deep learning came along, now I am no longer sure …)
Q5. Can data ingestion be automated?
In the day and age of ‘Big Data”, data ingestion has to be automated on some level – anything else is out of the question.
The more interesting question is how to best automate it. And which parts of the data preparation stages can be done during digestion. I have a very strong opinion on wanting my data as ‘raw’ as possible. So you should for instance NOT automate how to deal with missing data. I’d much rather know that it was missing than it being replaced by the system. Likewise I prefer the highest granularity of information to be maintained – consider for instance the full URL address of a webpage that a consumer went to vs. keeping only the hostname (less desirable, but OK) vs. keeping only some content category. From a privacy perspective there are good arguments against the former – but tools like hashing can mediate some of these concerns.
So let’s talk about the how: There are 3 really important parts of the automation process:
- Flexibility in sampling if the full data stream is too large: if you are dealing with 50 Billion events per day – just stuffing all into a hadoop system is nice – but makes later manipulation tedious. Instead, it is great to have in addition a process that ‘fishes out’ events of specific interest. See some of the details in a recent blog we wrote on this.
- Annotation of histories on the fly: having event logs of everything is great, but for predictive modeling I usually need to have features that capture the entity’s history. Joining every time over Billions of rows to create a history is impossibly. So part of the ingestion process is an annotation process that appends vital historical information to each event.
- Having statistical tests that evaluate if the properties of the incoming data flow is changing and sends alarms if for instance some data sources go temporarily dark. Some of this is covered here.
Q6. How do you ensure data quality?
The sad truth is – you cannot. Much is written about data quality and it is certainly a useful relative concept, but as an absolute goal it will remain an unachievable ideal (with the irrelevant exception of simulated data …).
First of, data quality has many dimensions.
Secondly – it is inherently relative: the exact data can be quite good for one purpose and terrible for another.
Third, data quality is a very different concept for ‘raw’ event log data vs. aggregated and processed data.
Finally, and this is by far the hardest part: you almost never know what you don’t know about your data.
In the end, all you can do is your best! Scepticism, experience, and some sense of data intuition are the best sources of guidance you will have.
Q7. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
First of, one should not even have to ask whether the insight is relevant – one should have designed the analysis that led to the insight based on the relevant practical problem one is trying to solve! The answer might be that there is nothing better you can do than status quo. That is still a highly relevant insight! It means that you will NOT have to waste a lot or resources. Taking negative answer into account as ‘relevant’ – if you are running into this issue of the results of data science not being relevant you are clearly not managing data science correctly. I have commented on this here: What are the greatest inefficiencies data scientists face today?
Let’s look at ‘correct’ next. What exactly does it mean? To me it somewhat narrowly means that it is ‘true’ given the data: did you do all the due diligence and right methodology to derive something from the data you had? Would somebody answering the same question on the same data come to the same conclusion (replicability)? You did not overfit, you did not pick up a spurious result that is statistically not valid, etc. Of course you cannot tell this from looking at the insight itself. You need to evaluate the entire process (or trust the person who did the analysis) to make a judgement on the reliability of the insight.
Now to the ‘good’. To me good captures the leap from a ‘correct’ insight on the analyzed dataset to supporting the action ultimately desired. We do not just find insights in data for the sake of it! (well – many data scientists do, but that is a different conversation). Insights more often than not drive decisions. A good insight indeed generalizes beyond the (historical) data into the future. Lack of generalization is not just a matter of overfitting, it is also a matter of good judgement whether there is enough temporal stability in the process to hope that what I found yesterday is still correct tomorrow and maybe next week. Likewise we often have to make judgement calls when the data we really needed for the insight is simply not available. So we look at a related dataset (this is called transfer learning) and hope that it is similar enough for the generalization to carry over. There is no test for it! Just your gut and experience …
Finally, good also incorporates the notion of correlation vs. causation. Many correlations are ‘correct’ but few of them are good for the action one is able to make. The (correct) fact that a person who is sick has temperature is ‘good’ for diagnosis, but NOT good for prevention of infection. At which point we are pretty much back to relevant! So think first about the problem and do good work next!
(*) What is data blending By Oleg Roderick, David Sanchez, Geisinger Data Science, ODBMS.org November 2015