Q&A with Data Scientists: Paolo Giudici
Paolo Giudici. Born in Sondrio (Italy, 1965). Master of Science in Economics, Bocconi University of Milan (1989); Master of Science in Statistics, University of Minnesota (1990); Ph.D. in Statistics, University of Trento (1993). Post-doc research periods at the University of Bristol (1996-1997), at the University of Cambridge (1998) and at the Fields Institute, Toronto (1999).
Full Professor of Statistics and Data Science at the Department of Economics and Management of the University of Pavia (since 2007; previously Associate Professor, 2000-2006 and Assistant Professor, 1993-1999). Supervisor of about 150 Master of Science students in Economics and of 12 Phd students in Statistics and Economics. Most of them currently work in financial companies, consulting/IT companies or in the academia.
Board director of Credito Valtellinese banking group and, within the board, member of the risk committee (since 2010). Research fellow at the Monetary and Economic department of the Bank of International Settlements, Basel (since 2016). Principal investigator of several data science projects for the financial industry, in particular with: Accenture, KPMG, SAS Institute, Intesa San Paolo, Unicredit, Banco BPM, UBI, Monte dei Paschi di Siena, Banca Popolare di Sondrio, Credito Valtellinese, Metro, Mondadori, Opera Multimedia, Sky. Currently advisor for the Bank Account Based Blockchain BABB, the Deutsche Bundesbank, the Italian Statistical Institute ISTAT, the Italian Markets Authority CONSOB.
President of the scientific committee and honorary member of the Italian Financial risk management association (since 2013).
Author of about 190 publications, among which 72 in internationally refereed journals, with 2887 total citations and an h-index of 24. The corresponding research profile is that of a data scientist focused on methodological research, especially in Bayesian statistics, Graphical network models and MCMC computational methods, and on its application to financial industry problems such as Customer relationship, Operations quality and Financial risk assessment.
Q1. Is domain knowledge necessary for a data scientist?
Yes it is necessary as data science itself is defined at the intersection of: 1) domain knowledge; 2) machine learning+statistics; 3) computer programming.
Q2. What should every data scientist know about machine learning?
Main supervised and unsupervised methods, what is their statistical rationale, and how to implement them.
Q3. What are the most effective machine learning algorithms?
Unsupervised: association rules (for variables); k-means cluster analysis (for observations);
Supervised: generalised linear models (for continuous variables); classification tree/forests (for categorical variables).
Q4. Predictive Modeling: How can you perform accurate feature engineering/extraction?
In a stepwise fashion, starting from most parsimonious models, then add features only of predictive performance (e.g. measured by AUROC or lift) increases.
Q5. Can data ingestion be automated?
Only if domain knowledge problems are repetitive.
Q6. How do you ensure data quality?
For unsupervised problems: checking the contribution of the selected data to between groups heterogeneity and within groups homogeneity; For supervised problems: checking the predictive performance of the selected data.
Q7. When data mining is a suitable technique to use and when not?
When there aren’t firm domain knowledge theories to be tested.
Q8. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
By testing its out-of-sample predictive performance we can check if it is correct. To check its relevance, the insights must be matched with domain knowledge models or consolidated results.
Q9. What were your most successful data projects? And why?
Many scoring models, concerning operations, finance, or marketing, in which model results did considerably improve the predictive performance of existing methods. Or, in unsupervised models. where description of the problem had become clearer, more interpretable, than previously.
Q10. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?
Forget data quality and exploratory data analysis, rushing to the application of complex models. Forgetting that pre-processing is a key step, and that benchmarking the model versus simpler ones is always a necessary pre requisite.
Q11. How do you know when the data sets you are analyzing are “large enough” to be significant?
When estimations and/or predictions become quite stable under data and/or model variations.
Q12. What are the differencse between Data Science and Business Intelligence (BI)?
Data science follows the scientific discovery paradigm, whereas business intelligence does not necessarily do so.
Business intelligence was a buzzword when there were many data, not collected for the scope of the analysis, typically internal, and the aim was to obtain some new “data mining” insights from this data. Data science starts from a knowledge domain problem, formulated as a scientific hypotheses, that can be tested with all available data, not only internal.
Q13. How do manage to communicate (if any) with Data Engineers in common data projects?
Explaining them the advantages (also in terms of computational efficiency) of good machine learning/statistical models.
Q14. How do you convince/ explain to your managers the results you derived from analyzing (Big) Data?
In terms of predictive performance or descriptive interpretation for a domain knowledge problem thay have.
Q15. What is in your opinion the impact of the EU General Data Protection Regulation (GDPR) on Data Science?
Data science must consider that data protection will be embedded in the design of business processes. This implies, for instance, that models should rely more on parsimonious statistical models, defined for classes of individuals, rather than on complex models, with a close mapping to individuals. Also, data, models, algorithms and results from them should be more open and transparent than before, to be reproducible.
Q16. What are the ethical issues that a data scientist must always consider?
Be the most transparent possible in disclosing the used data, the models employed, and the obtained results
Qx Anything else you wish to add?
There is a strong need to develop a data science community that goes beyond traditional science classifications. For example between statistics, machine learning, operational research, data analysis.