Q&A with Data Scientists: Christopher Schommer
Professor Schommer studied Computer Science at the University of Saarbrücken and at the German Research Centre for Artificial Intelligence. In 2000, he received his PhD degree from the Goethe-University of Frankfurt/Main.
From 1997 until 2003, Professor Schommer worked for IBM Research and Development as an IT Architect in several Business Intelligence projects worldwide. Since 2003, he is an Associate Professor at the University of Luxembourg and an international expert for Data Science.
Q1. Is domain knowledge necessary for a data scientist?
Absolutely yes! And this is, for example, for situations where additional (potentially unknown) data needs to be added, where seasonal effects emerge, and where decisions have to be taken regarding the preparation of the data in general. Also, knowledge about cultural differences is strictly important, for example, concerning applied color schemes in visualisations.
Q2. What should every data scientist know about machine learning?
I, personally, see Machine Learning as an attractor of a data life cycle and the heart of each data science process. All this data, whether it is in a raw or in a prepared state, will finally end there some time. From that point of view, I believe that each data scientist should not only be a specialist in the own field but should have a level of armamentarium that enables to put oneself in a Machine Learner’s place.
Moreover, I believe as well that each data scientists, who works with the results of Machine Learning processes, should brace oneself for a comprehension.
Q3. What are the most effective machine learning algorithms?
Simplicity, comprehensibleness, and powerfulness are – in my eyes – the most conclusive arguments for an effectiveness. For that reason, the apriori algorithm (Association Discovery), C4.5 (Decision Tree; Classification), and k-means (Clustering) are to me the most effective algorithms. This is also confirmed by a voting among experts, which was published during an ICDM 2006 panel session (organised by Xindong Wu and Vipin Kumar): here, C4.5, k-means and apriori were ranked under the Top4 (together with SVM, which appeared on the third position).
Q4. What is your experience with data blending? (*)
It is a very tedious, but responsible, process. Particularly, because this kind of data fusioning takes place before the analytics takes place, but after a construction of a (potentially) stable data architecture is made. Working with data blending, therefore, means to me to risk a re-building and a changing of a running data system on the one side, but to ensure a data quality with respect to a further processing on the other. To work on such a bridge does not seem to be an easy task.
Q5. Predictive Modeling: How can you perform accurate feature engineering/extraction?
A certain number of factors should be implemented: first, a predictive modeling should not be performed alone but in a team, which itself should be composed of experts of different fields (domain experts, statisticians, data engineers, machine learning experts). Second: it should be clear that predictive patterns do not necessarily justify a causality. It is wise to critically check gained results and to involve domain experts (see question 1). Third, a developped predictive model is not necessarily
the best one. Instead, alternatives should be developed and tested under different conditions.
Q6. Can data ingestion be automated?
I do not think that an automatization will be complete, ranging from collecting until taking decisions, which data can/should be stored and which data can/should be removed. But I believe that – particularly in the age of big data/texts – that a symbiosis of a human (data) care, a high-performance computing, and the right use of AI-related inventions (e.g., robots, self-healing) may become highly effective.
Q7. How do you ensure data quality?
To keep a data quality is mostly an adaptive process, for example, because provisions of national law may change or because the analytical aims and purposes of the data owner may vary. Therefore, the ensuring of a data quality should be performed regularly, it should be consistent with the law (data privacy aspects and others), and should be commonly performed by
a team of experts of different education levels (e.g., data engineers, lawyers, computer scientists, mathematicians).
Q8. When data mining is a suitable technique to use and when not?
The application of data mining in a serious way requires a sufficient amount of data. To remind, the idea of exploring data exists since hundreds, even thousands, of years! It is dangerous and wrong to reason and to interpret a causality if a solid data quantity is missing. A second aspect is that Data Mining does not follow a straightforward standard procedure but still follows a data-driven principle: from that perspective, each data mining-project is new in itself and requires an individual treatment.
Also, it should also reminded that “if there is nothing in the data, then nothing can be found”. Even this last point should be respected. Apart from that, Data Mining is one of several directions in the Data Life Cycle: analytical results should always be verified with other methods.
Q9. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
In my understanding, an insight is already a valuable/evaluated information, which has been received after a detailed interpretation and which can be used for any kind of follow-up activities, for example to relocate the merchandise or to deeper dig in clusters showing a fraudulent behavior.
However, it is less oportune to rely only on statistical values: an association rule, which shows a conditional probability of, e.g., 90% or more, may be an “insight”, but if the right-hand side of the rule refers to a plastic bag only (which is to be paid (3 cents), at least in Luxembourg), the discovered pattern might be uninteresting.
Q10. What were your most successful data projects? And why?
The most succesful projects, where I have been involved in, are certainly those, where an effect could be seen immediately after having taken some kind of reaction. In this regard, I remember a business project regarding the detection of fraud in telco data as well as a diverse number of Market Basket Analysis-projects, where the customers’ behavior and profiling
patterns have been used to improve a customer satisfaction.
Q11. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?
I believe that missing expertises, an inappropriate communication among the team members, and the favorising of quick-and-dirty solutions are serious problems. Personally, I am not a friend of sampling. The reason is that interesting data patterns may disappear and that a subset of the data does not necessarily reflect the overall data structure. Also, statistical values should not have the final word (see question 10) and should not be the only reason for an insight. The analysis of data
is also a multi-disciplinary and multi-cultural concern and should performed correspondingly.
Q12. What are the ethical issues that a data scientist must always consider?
Because of the masses of data sensors – that emerge day by day and that affect a data loading and a further use of data likewise -, each data scientist bears a kind of responsibility in terms of, e.g., data correctness, data privacy, and
data availability. As a central ethical issue, this (new) burden of work should be accepted; each data scientist should be aware of that. Data Science should also be sensed as a chance to do something good and meaningful.