Q&A with Data Scientists: Wolfgang Steitz
Wolfgang Steitz, Senior Data Scientist, travel audience GmbH.
Wolfgang is currently working on real-time click prediction and user profiling at travel audience, a data-driven travel advertising platform. He holds a PhD from the University of Mainz, researching heuristic optimization methods and a Master’s degree from the University of Mannheim. For more info, visit his LinkedIn.
Q1. Is domain knowledge necessary for a data scientist?
For a successful data-driven project there is definitely domain knowledge required. However, that doesn’t mean a Data Scientist needs to become a domain export first, before he can start building some machine learning models. It’s enough to have some subject matter expert working closely with the Data Science team. If the goal is to build a high-performing machine learning model, you can obviously do without prior domain knowledge, proven many times in the Kaggle competitions. There, Data Scientist work on (well defined) model building tasks in domains most of them are not familiar with. Being a fast learner and interested in the domain obviously helps.
Q2. What should every data scientist know about machine learning?
There is definitely a lot to learn in the machine learning area and things are progressing fast. Every Data Scientist has a slightly different toolbox and that’s actually a good thing, as you are usually working in a team and your skills and tools complement one another. Nevertheless, there are some concepts that are common no matter what problem you are solving and which approach you are taking. One of the most important of those is experimental methodology (how to you evaluate and compare different models? How to tune your hyperparameters? What are different cross validation strategies? etc.). In my experience that’s what makes a good Data Scientist.
Q3. What are the most effective machine learning algorithms?
The choice of algorithm strongly depends on the problem you are trying to solve, the amount of data you can play with and the computational resources you can access. There is no algorithm to rule them all! Nevertheless, I often recommend to use tree-based models, especially gradient boosting trees and the xgboost package. This methods often works quite well, as demonstrated in various Data Science competitions on Kaggle. Besides a good performance, boosted trees have some more advantages that makes them easy to apply: a) they usually cope with your data as is, no need to normalize or worry about co-linearity, etc b) easy to understand and communicate, c) model is able to find interactions automatically, d) quite robust and easy to apply. Overall, this makes boosted trees a good model to start with.
Q4. What is your experience with data blending?
Combining different data sources is one of the central tasks when working on some real-world Data Science problem. I don’t remember one project where all information was contained in one nice data set. Often, there are several data sets that you potentially use, each with his own advantages and drawbacks. For example, one might have a higher granularity while another covers a wider range of cases. And another one might add some additional details that you want to use in your model. Combining all those together should obviously improve whatever you plan to do with the data afterwards. I was actually working on projects where the whole purpose was creating a process to ad-hoc blend data sets. However, there are some things one need to consider when combining data sets. It’s a huge increase in complexity, don’t underestimate the amount of work and maintenance it brings while in general your combined data set is an improvement, there are most likely some rare cases where you actually worsened things. If someone spots these, they are usually hard to explain (Why are you not using the correct number from data set A?) Overall, I think one needs to trade-off the benefits (more accurate, wider reach, more details, etc) with the complexity and potential errors it brings.
Q5. Predictive Modeling: How can you perform accurate feature engineering/extraction?
Feature engineering is the crucial part for building a successful model. The key is to have a proper evaluation strategy, so you can reliably decide if a feature is helping or not. Once you have that, you can get creative. Don’t hesitate to try some unintuitive features, as intuitive features are not necessarily the best ones.
Q6. Can data ingestion be automated?
Yes, data ingestion can be automated and it should be. The more automated your pipeline, the easier it is to try out new things and to replicate previous experiments. An automated pipeline takes more work to setup, but it definitely pays often during the course of the project. It also makes it easier to productize your project afterwards.
Q7. How do you ensure data quality?
It’s good practice to start with some exploratory data analysis before jumping to the modeling part. Doing some histograms and some time series is often enough to get a feeling for the data and know about potential gaps in the data, missing values, data ranges, etc. In addition, you should know where the data is coming from and what transformations it went through. Ones you know all this, you can start filling the gaps and cleaning your data. Eventually there is even another data set you want to take into account. For some model running in production, it’s a good idea to automate some data quality checks. These tests could be as simple as checking if the values are in the correct range or if there are any unexpected missing values. And of course someone should be automatically notified if things go bad.
Q8. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
Presenting results to some domain experts and your customers usually helps. Try to get feedback early in the process to make sure you are working in the right direction and the results are relevant and actionable. Even better, collect expectations first to know how your work will be evaluated later-on.