Q&A with Data Scientists: Natalino Busa
Natalino is Head of Data Science at Teradata, where he leads the definition, design and implementation of data-driven financial applications. He has served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.
Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.
Q1. Is domain knowledge necessary for a data scientist?
In my opinion it’s not necessary, but it does not harm either. You can produce accurate models without having to understand the domain, but most of the time some domain knowledge will speed up the process of selecting relevant features and will provide a better context for knowledge discovery in the available datasets.
Q2. What should every data scientist know about machine learning?
First of all, I would say a good basic knowledge of algebra and matrix and tensor math is absolutely a must. Let’s not forget that datasets after all can be handled as large matrices! More specifically on the topic of machine learning, a good understanding of how bias and variance affect the outcome of predictive models, a subject very closely related to the problem of fitting and regularization. How to make most use of the available data while training a model using cross-validation techniques. Data bootstrapping and bagging. Also I believe that cost based, gradient iterative optimization methods are a must, as they implement the “learning” for four very powerful classes of machine learning algorithms: glm’s, boosted trees, svm and kernel methods, neural networks. Last but not least an introduction to Bayesian statistics as many
Q3. What are the most effective machine learning algorithms?
Regularized Generalized Linear Models, ANN’s – which is in fact a further generalization of GLM’s, Boosted and Random Forests. In general any method which has strong generalization qualities. Also I am very interested in dimensionality reduction and unsupervised machine learning algorithms, such as T-SNE, OPTICS, and TDA.
Q4. What is your experience with data blending?
Blending data from different domains, or sensors works well when the data blended in increases the explanatory power of the model. This is not always easy to assess beforehand as the blended data could be uncorrelated to the original data.
Data blending could be valuable but could also generate spurious correlations. It’s therefore very important to carefully validate the trained model using cross validation and other statistical methods such as variances analysis on the augmented dataset.
Q5. Predictive Modeling: How can you perform accurate feature engineering/extraction?
Let’s tackle feature extraction and feature engineering separately. Extraction can be as simple as getting a number of fields from a database table, and as complicated as extracting information from a scanned paper document using OCR and image processing techniques. Feature extraction can easily be the hardest task in a given data science engagement.
Extracting the right features and raw data fields usually require a good understanding of the organization, the processes and the physical/digital data building blocks deployed in a given enterprise. It’s a task which should never be underestimated as usually the predictive model is just as good as the data which is used to train it.
After extraction, there comes feature engineering, which usually consists in a number of data transformations, oftentimes dictated by a combination of intuition, data exploration, and domain knowledge. Engineering features are usually added to the original samples features and fed as input to the predictive model. Before the renaissance of neural networks and hierarchical machine learning, feature engineering was actually required as the models were too shallow to properly transform the input data in the model itself. For instance, decision trees can only split data areas along the features’ axes, therefore to correctly classify donut shaped classes you will need feature engineering to transform the space to polar coordinates.
In the past years, however models usually have multiple layers, and machine learning experts are deploying increasingly “deeper” models. Those models usually can “embed” feature engineering as part of the internal state representation of data, rendering manual feature engineering less relevant. For some examples applied to text check the section “Visualizing the predictions and the “neuron” firings in the RNN” in The Unreasonable Effectiveness of Recurrent Neural Networks.
These models are also usually referred as “end-to-end” learning, although this definition it’s still vague not unanimously accepted in the AI and Data Science communities. So what about feature engineering today? Personally, I do believe that some feature engineering is still relevant to build good predictive systems, but should not be overdone, as many features can be now learned by the model itself, especially in the audio, video, text, speech domains.
Q6. Can data ingestion be automated?
Yes. But beware of metadata management. In particular, I am a big supporter of “closed loop” analytics on metadata, where changes in the data source format or semantic are detected by means of analytics and machine learning on the metadata itself.
Q7. How do you ensure data quality?
I tend to rely on the “wisdom of the crowd” by implementing similar analysis using multiple techniques and machine learning algorithms. When the results diverge, I compare the methods to gain any insight about the quality of both data as well as models. This technique works also well to validate the quality of streaming analytics: in this case the batch historical data can be used to double check the result in streaming mode, providing, for instance, end-of-day or end-of-month reporting for data correction and reconciliation.
Q8. What techniques can you use to leverage interesting metadata?
Fingerprinting is definitely an interesting field for metadata generation. I have worked extensively in the past on audio and video fingerprinting. However this technique is very general and can be applied to any sort of data. For instance, fingerprinting can be used to summarize web pages retrieved by users or to define the nature and the characteristics of data flows in network traffic, for cybersecurity prevention strategies. I also work often with time (event time, stream time, capture time), network data (ip/mac addresses, payloads, etc.) and geolocated information to produce rich metadata for my data science projects and tutorials.
Q9. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
Most of the time I interact with domain experts for a first review on the results. Subsequently, I make sure than the model is brought into “action”. Relevant insight, in my opinion, can always be assessed by measuring their positive impact on the overall application. Most of the time, as human interaction is part of the loop, the easiest method is to measure the impact of the relevant insight in their digital journey.
Q10. What were your most successful data projects? And why?
1. Geolocated data pattern analysis, because of its application to fraud prevention and personalized recommendations. 2. time serie analytics for anomaly detection and forecasting of temporal signals – in particular for enterprise processes and KPI’s. 3. Converting images to features, because allows images/videos to be searched and classified using standard BI tools.
Q11. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?
Aggregating too much will most of the time “flatten” the signals in large datasets. To prevent this, try using more features, and/or provide a finer segmentation of the data space. Another common problem is “buring” signals provided by a limited number of samples with those of a dominating class. Models discriminating unbalanced classes tend to perform worse as the dataset grows. To solve this problem try to rebalance the classes by applying stratified resampling, or weighting the results, or boosting on the weak signals.
Q12. What are the ethical issues that a data scientist must always consider?
Respect for individual privacy and possibly enforce it algorithmically. Be transparent and fair on the use of the provided data with the legal entities and the individuals who have generated the data. Avoid building models discriminating and scoring based on race, religion, sex, age etc as much as possible and be aware of the implication of reinforcing decisions based on the available data labels. On this topic, there is an interesting idea around “equal opportunity” machine learning which is visually explained on this Google Research page Attacking discrimination with smarter machine learning from a recent paper by Hardt, Price, Srebro.