Q&A with Data Scientists: Partha Deka

Partha Deka is a Staff Data Scientist with GE Digital. He has 10 years of experience as a Big Data Engineer and Data Scientist with a demonstrated history of working in electrical, electronic, manufacturing and retail industry. Partha has a Bachelor’s degree in Electrical Engineering from National Institute of Technology, Silchar, India and a Masters in Electrical and Computer Engineering from Wichita State University, Kansas with focus on Control-theory, Probability theory, Discrete Math and Advanced data mining. 

Q1. Is domain knowledge necessary for a data scientist?

The answer to this question is not very straightforward. There is no absolute yes or no to this question. The techniques of Statistics are generic in nature for e.g. for feature analysis one would use techniques like Linear Discriminant Analysis for a Supervised Learning Classification problem, this helps to understand the features that maximize the discrimination between classes. Another example is Principal Component Analysis, used to understand how each of the features contribute to the directions of maximum variance and get an idea of the underlying clusters in an Unsupervised learning problem.

These techniques may not provide an accurate analysis of the features for e.g. in a traditional Supply Chain world where we want to optimize inventory we need experienced domain experts. They play a very crucial role in building customer’s trust. There could be gaps in the model if we don’t run through a domain expert to understand the business importance/ aspect of the features used. A feature identified as low importance through statistical analysis could be of importance to the Domain expert.

An experienced domain expert could potentially provide great insights in deriving custom features, imputation of missing data etc. which would help in building a good predictive model. At the same time say in a generic short term future housing prices prediction problem for a City or a County in US, where the features are – size of the house, square footage, number of bed rooms, school district rating, average house prices in the area, neighborhood crime rate, employment rate etc., the role of a domain expert would not be relevant in building a good predictive model as the data speaks all. The Application of deep learning, Computer Vision, Reinforcement learning (policy based learning) in the Autonomous Car industry consulting a domain expert might not be critical.

Q2. What should every data scientist know about machine learning?

Statistics, Probability theory, Discrete Math, Data Mining are always there and always part of a Math, Statistics or Engineering degree curriculum. But we have seen the emergence of applied machine learning in real life use cases in the last decade. The advent of distributed , scalable, massive parallel processing computing platforms such a Hadoop / Map Reduce, Spark, GPU based computing infrastructure, Tensor flow for deep learning et al have contributed to Applied machine learning.

First of all, a Data Scientist must be an expert in various Supervised learning techniques, clustering techniques, regression, Reinforcement learning based models etc. Every problem is different, for supervised learning the ability to try out various data pre-processing methods, outlier removal methods, feature extraction, selection, scaling techniques and tuning the hyper parameters to create the most optimal model is an art.

Data Science is a vast field. Ability to learn from historical data to predict something in future, ability to build segments of similar data points from unlabeled datasets, ability to build self-learning agents that can build the most optimal learning policy to behave in the most optimal way in its environment (Self Driving car is an example) are all different forms of machine learning.

In a nutshell data scientist should be well versed with various techniques/ algorithms of three fields – Supervised Learning, Unsupervised learning and Reinforcement Learning. Good expertise in Scalable architecture is nice to have.

Q3. What are the most effective machine learning algorithms?

There is no one universal effective model for all problems. Finding the most effective algorithm for a particular type of dataset, for a particular  problem, for the set of dimensions etc. is an art.
Let’s take the example of Supervised learning, say we want to build a model for Stock price prediction, we might want to try out linear regression, polynomial regression, KNN based regression, ARIMA, exponential smoothing to track the trend of Stock.

A classification problem instance will be when we are trying to build a student intervention system that would predict whether a student would graduate or not.  For this case, we might want to try out models such as – Gaussian Naive Bayes, Decision Trees, Ensemble Methods (Bagging, Boosting, Random Forest, Gradient Boosting), K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Logistic Regression etc. Generally, a boosting or bagging based ensembled model of decision trees for this type of binary classification problems are very effective.  

For Clustering, K-means is used in most clustering applications in the industry. DBSCAN can form good clusters even with datasets that have outliers contrary to K-means and can handle clusters of different sizes and shapes. MeanShift clustering aims to discover clusters in a smooth density of data points. It is a centroid based algorithm.

Collaborative filtering is popularly used for Recommender systems.

Dimensionality reduction techniques like PCA, LDA are effectively used.

Convolution Neural Network is very popular for image Classification. If we have lots of data with hidden structure or the relationship between the features and the target variable is non-linear; Neural networks are used for regression and classification problems.

Additionally, hyperparameter tuning for a certain model plays a great role in finding the sweet spot between bias and variance.

Model Evaluation is very crucial. Usually for a Regression problem, the R-squared value is preferred. For a Classification problem the confusion matrix that provides the precision and recall score is  based on True positives, False positives, False negatives .

The chart from Sklearn provides a very good overview:

Inline image 1

Q4. Can data ingestion be automated?

My answer is yes, Data ingestion has been automated in the industry with technologies like Sqoop, Kafka, Hadoop, Spark etc.
The caveat is if there are major changes in source system the data ingestion pipeline breaks and needs to be updated.

You may also like...