Q&A with Data Scientists: Carlos Andre Reis Pinheiro
Carlos Andre Reis Pinheiro is a Principal Analytical Training Consultant at SAS and a visiting professor at Data ScienceTech Institute, France. He has been working in analytics since 1996, first for a Brazilian shipping company and then for some of the largest telecommunications providers in Brazil, such as Embratel, Brasil Telecom and Oi.
He worked as a Senior Data Scientist for Teradata and EMC2. Dr. Pinheiro has examined business problems in data mining and machine learning across a wide range of departments, including IT, Marketing, Contact Center, Sales, Revenue Assurance and Fraud. Dr. Pinheiro holds a D.Sc. in Engineering from Federal University of Rio de Janeiro.
He also has successfully accomplished several postdoctoral research terms, such as in Optimization (IMPA, Brazil, 2005-2006), Social Network Analysis (Dublin City University, Ireland, 2008-2009), Human Mobility (Universite de Savoie, France, 2012), Human Mobility and Dynamic Social Networks (Katholieke Universiteit Leuven, Belgium, 2013-2014) and Urban Mobility and Multi-Modal Traffic (Fundacao Getulio Vargas, Brazil, 2014-2015).
Q1. Is domain knowledge necessary for a data scientist?
Definitely. One of the key factors of success in analytical approaches for business applications is exactly this, the combination of technical skills to develop unsupervised and supervised models and a solid business knowledge to understand how to optimize and deploy the models based on some particular business needs and goals. The domain knowledge can actually drives the data scientist approach in terms of which models to build, how to combine exploratory analysis to unsupervised and supervised models, how to use structured and unstructured data, how to emphasize the outcomes and explain the results and possible practical solutions to the business departments.
Q2. What should every data scientist know about machine learning?
All analytical models count when we need to solve business problems. Statistical models such regressions and generalized linear models, or simple decision trees, each one of them has its importance in order to address business issues.
Machine learning comprises more complex models such as ensemble decision trees; like forest and gradient boosting; or neural networks or support vector machines. These models may require more computational power but they can deliver improved and accurate results. At the end of the day, the possibility to try all of them in solving a particular business problem can make the real difference. All these models are based on data. And all these data and information describe the business markets, the customers, the competitors and the company itself, in terms of strengths, weaknesses, threats and opportunities.
As the markets evolve, the data changes, and therefore the models’ results. Having the opportunity to apply all these models in a single environment is definitely a big differentiation. As human beings we all have our own preferences. But it is good to know how these models work and then have a chance to apply them to solve business problems. It is good to know that some models work better for classification and others for estimation. Some handle properly missing values in the inputs, some not. Some are affected by outliers or great variance in the inputs, some not. Some are affected by outliers in the target, some not. Understanding the data and the characteristics of each model can lead you to find a nice solution for a particular business issue.
Q3. What are the most effective machine learning algorithms?
I really like neural network, forest and gradient boosting. As universal estimators, you don’t need to care about the form of the input data. You can concentrate more on the data preparation and the model developing – including training and validation. However, I believe that the choice of a particular model or a set of models relies on personal preferences, available tools, and infrastructure, among others. I also like techniques for unstructured data, like social network analysis. You can raise very good insights from an approach like that. Nevertheless, it is important to know which model to apply in each case.
The type of the input data and the target can drive you in this choice. If you are handling a binary classification problem, you can develop a logistic regression, a decision tree, a forest, a gradient boosting, a neural network, or even a support vector machine, and see which one performs better for the particular data and problem you have. The important thing here is if you have the opportunity to develop different models, as long as your data changes over time, you may have the ability to monitor the models’ performance and deploy the best one on a particular timeframe.
Q4. What is your experience with data blending?
Working for large telecommunications companies over the past years, I get used to handle many different sources of data, in multiple formats, with missing values, inconsistencies, etc. Sometimes the same information is processed in distinct transactional systems, and it might have different values, one in each system. Quite often we have to apply demographic information to personal information and they are in totally different granularities. The process, to cleanse, format, filter, impute, and standardize the data is very important during the model development. It can makes the whole difference at the end.
Q5. Predictive Modeling: How can you perform accurate feature engineering/extraction?
You can use the business knowledge, from yourself as a data scientist, and from business analysts. You can remove, change or merge some of the input variables according to the business knowledge and expertise. But you can also use some techniques to perform feature selection, like regression and decision tree. Particularly in predictive modeling, once you have a target, regressions and decision trees can work pretty well in selecting the important variables to your model based on the target. These models can select the predictors, the input variables that most contribute to the prediction. For instance, it is quite useful to run a decision tree to select the inputs and then pass them along to a neural network. On the production environment, this would save lot of time and resources during the data preparation and scoring stages.
Q6. Can data ingestion be automated?
In terms of deployment, once the model is trained and validated, you have to put it into production. This is the purpose.
In this case, the data ingestion must be automated. The entire process, from data acquisition to model scoring should be automated. In terms of exploratory analysis, if you have, for instance, a specific environment to perform all sort of experiments and data analyses, before actually develop a model, it would be very nice to have the data ingestion automated too. It is clear that as an experimental environment, new analysis may demand new data, and from new sources. But as long as the process evolves, all these new data ingestion tasks should be added to the automated workflow. That would make the exploratory analyses faster and more effective.
Q7. How do you ensure data quality?
Data quality is a whole process and it needs to be performed in the proper way. It may change completely the accuracy of the predictive models. It may change completely the insight of unsupervised models. In addition to the data cleansing, something very important, particularly to predictive modeling, is the identity resolution. To identify multiple instances of the same entity can dramatically change the model performance. Assume you are developing a predictive model to detect fraud.
Fraudsters can use multiple identities, different addresses, distinct accounts, and so forth. The ability to identify that in some cases all these multiples instances are actually assigned to the same entity can lead the predictive model in a totally different way.
Q8. When data mining is a suitable technique to use and when not?
I can see data mining techniques being not suitable when they are used in the wrong way. Applying a linear regression to predict a binary target, or adding multiple input variables with lots of missing values to a neural network, or using a target with outliers in a decision tree, etc. Those are cases when the technique will be not suitable. In all other cases, when data mining techniques are deployed to raise knowledge about the business; as in clustering models; or to classify or estimate a particular business event; as in predictive models; they will be always suitable. More knowledge about any subject matter is always good, it is always beneficial.
Q9. What techniques can you use to leverage interesting metadata?
As I mentioned before, techniques associated to feature selection can be used to build an interesting metadata, comprising input variables important to some particular predictive models. In this case, regressions and decision trees would be good techniques to leverage interesting metadata. You can have variables ranked by importance and associated to different targets. For instance, aging is 0.8 for churn but 0.1 for cross-sell.
Q10. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
The most straightforward approach to evaluate the outcomes from analytical models is by checking and confirm them with business analysts, at the first place, and secondly, by putting the models in production to monitor how they work as for real. The feedback from business analysts can change the models, as well as the feedback from the production, when the company is taking business actions according to the model’s results.
Q11. What were your most successful data projects? And why?
It is hard to tell. I have developed several analytical projects in my career and I hope to do much more. The ones I liked most were based on social network analysis. I can select two. One in Brazil, where we developed a set of models to understand the portability in telecommunications and how some customers can affect others, initiating events in cascade. The network analysis detected the communities and identified the core customers, the ones could play as influencers and affect others. Predictive models identified the customers’ likelihood to make churn. Different business actions were deployed based on a combination of churn scores and customers influence. The other one was in Turkey. In addition to the social network analysis to detect the communities and to identify the customers’ roles, we created a set of equations to classify the type of the community according to its overall behavior. Communities were layered as friend, business and family, and then sub-clustered accordingly. Each group, considering the proportionality of the distinct types, could drive particular campaigns, bundles, products, services, relationships, etc.
Q12. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?
The problem in analyzing massive data is often associated to time consuming. Some shortcuts as design the experiments in advance and properly work with samples can minimize the time to try different analytical approaches.
Q13. What are the differences between Data Science and Business Intelligence (BI)?
There is a big difference to me. Business Intelligence is more assigned to queries and reports. Even though you have flexibility to manipulate queries and reports on the fly by using OLAP cubes, you are limited to the information previously prepared and published. All the questions has to be made beforehand and the answers should be published in the form of queries and reports. You can definitely analyze a particular fact under multiple dimensions. But both fact and dimensions should be there, computed in some way. Data Science involves structured analysis, such as performed by unsupervised and supervised models, and unstructured analysis, like in social network analysis, text mining, sentiment analysis, image analysis, among others. It also comprises experimental and exploratory analysis, in relation to specific business issues.
It can also consider optimization algorithms depending on the type of the problem, as in supply chain, transportation, network, resources, and others. You may have the question but you need to find the answer from the available data. And sometimes, you may don’t have the question, which makes the process even more fun. However, there is no best or most important. Both processes are quite relevant and should be counted by companies.
Q14. How do you convince/explain to your managers the results you derived from analyzing (Big) Data?
This totally relies on the maturity of the company in terms of analytics. For companies well matured, with analytics already in place and working well, you may need just to present the model’s rules, the relevant variables, the training and validation assessment, and so on. For instance, sharing the business rules derived by a decision tree or a regression, plus the model performance, the decision matrix, the ROC curve, the lift chart, the cumulative captured response, all these assessments can be suffice to convince the managers. For companies in early stages of analytics, control groups for campaigns, comparing random selection of customers or prospects and targets selected by the model might be necessary. Particularly for large companies, small implementations considering specific products, customers or geographic locations can also help to speed up the process and mitigate mistakes.
Q15. What data infrastructure do you use to support your work?
I am a big fan of SAS. Currently I am working with SAS® Viya™, a new open platform, cloud-ready, reliable, scalable, secure, distributed, and in-memory environment to perform machine learning tasks, from data preparation to model building, assessment and scoring.