Q&A with Data Scientists: Dirk Tassilo Hettich
Dirk Tassilo Hettich holds a PhD in Computer Science, as well as MSc and BSc degrees in Bioinformatics from the University of Tübingen, Germany where he has closely worked with neuroscientists in the area of applied machine learning in the context of human-computer interaction and affective computing. After having researched brain-computer interfacing for communication and control for almost 10 years, Dirk Tassilo decided that it was time to see how big data and advanced analytics do apply in an economic context and joined the Tax Technology & Analytics team at EY Stuttgart led by Florian Buschbacher in March of 2016. Since then he has applied his software development, machine learning, and visualization expertise in multiple client projects as project and application lead. He is continually amazed by all the real-world potential of artificial intelligence.
For further information about Dirk Tassilo’s work at EY visit Tax Technology & Analytics at EY Stuttgart.
Dirk Tassilo tweets @dthettich and you can visit his website or reach out to him on XING or LinkedIn
Q1. Is domain knowledge necessary for a data scientist?
I get this question a lot and I always answer with ‘a critical mass of domain knowledge is required’. As from physics, critical mass is the smallest amount required for a sustainable (nuclear) chain reaction. To me, exactly this happens with domain knowledge in data science for it facilitates a chain reaction in analyses. Furthermore, domain knowledge is critical in understanding the clients’ requirements of the data science task at hand.
Q2. What should every data scientist know about machine learning?
Well there are a lot of points, at the same time the Machine Learning course by Andrew Ng is a wonderful resource for a common ground. Try to understand the history since the math has been around for some time (e.g. Arthur Samuel in 1950 with linear discriminant analysis). Besides that, machine learning itself follows a modular workflow with certain repeating patterns. Get the basics right: supervised and unsupervised learning; classification, regression, clustering, feature analysis (basic statistics), performance testing and metrics (e.g. accuracy, F1-scores, AUC-values).
Q3. What are the most effective machine learning algorithms?
There is not the single most effective algorithm. I guess everybody in the data science community agrees that the “right” features and data are key to successful or effective machine learning. The real question is: what do you want to achieve?
Then choose your tool accordingly. That said, I am a big fan of linear support vector machines for exploratory machine learning purposes yet also applied ones for their inner workings are relatively easy to understand (e.g. features and support vector relations). Random forests work wonderful also on mixed data (i.e. numbers and categories). Since I also have a neuroscience background, deep learning is highly fascinating to me and the way I see things it is also very promising in the near future.
Q4. What is your experience with data blending? (*)
The ETL process of aggregating multiple data sources can be labor-intensive for the real world usually is not well organized (e.g. think of different granularity in samples, failing systems, or missing data etc.) but usually required in an industrial context (i.e. data lakes). At the same time, if multiple sources are already available why not use them as features. I’m not saying the more features the better. In the contrary, I strongly support the parsimony principle and usually seek to use the smallest set of highly descriptive features.
Q5. Predictive Modeling: How can you perform accurate feature engineering/extraction?
Think, relate to the real world and be practical! (This depends usually on the domain you’re in). As for feature extraction, there are a lot of useful methods. I had good experiences with the Pearson correlation coefficient.
Q6. Can data ingestion be automated?
Sure, anything can be automated according to the current state-of-the-art technological capabilities.
Q7. How do you ensure data quality?
Understanding the data at hand by visual inspection. Ideally, browse through the raw data manually since our brain is a super powerful outlier detection apparatus. Do not try to check every value, just get an idea of how the raw data actually looks! Then, looking at the basic statistical moments (e.g. numbers and boxplots) to get a feeling how the data looks like.
Once patterns are identified, parsers can be derived that apply certain rules to incoming data in a productive system.
Q8. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?
Define an analysis workflow that consists of distinct modules. Think, implement, try, repeat! Automate anything that should be automated (e.g. ETL, pre-processing, model train/test, metrics, graphs, logs). Work on representative subsets of data before going all in. Well, obviously such workflow definitions are easier with more experience. Thus, discuss the design with more experienced colleagues or data scientists you can reach.
Q9. How do you know when the data sets you are analyzing are “large enough” to be significant?
Very important! I understand the question like this: how do you know that you have enough samples? There is not a single formula for this, however in classification this heavily depends on the amount and distribution of classes you try to classify. Coming from a performance analysis point of view, one should ask how many samples are required in order to successfully perform n-fold cross-validation. Then there is extensive work on permutation testing of machine learning performance results. Of course, Cohen’s d for effect size and or p-statistics deliver a framework for such assessment.
Not to make too much of advertisement, but I wrote exactly about this article in Section 2.5.
Q10. What are the differencse between Data Science and Business Intelligence (BI)?
Data science is required for delivering BI.
Q11. What are the ethical issues that a data scientist must always consider?
“With great power comes great responsibility.” – Linux Terminal, sudo session. With access to big data resources and potentially sensitive information comes great responsibility for the data engineers, InfoSec, yet also data scientists.