Vikas Rathee works as Staff Data Scientist with GE Digital. He has 10 years of experience building data based software products at top product companies and startup. Vikas has Bachelor’s and Master’s in Computer Science and Engineering. Areas of expertise includes Algorithms, Machine Learning, Big Data, GIS, Economics, Cloud Platforms.
Q1. Is domain knowledge necessary for a data scientist?
Domain knowledge is not necessary for a data scientist, however it is an important element for creating value for the business, product and technology of application. Think of a data scientist as a person who has the skills to solve the problem by following research methodologies, but in order to create something useful he will look for a domain expert to help him understand systems and process. In the absence of an expert a data scientist will need to spend time learning the inner working of domain which is not always available. Without a domain knowledge a data scientist is nothing other than a locked potential.
In my projects I have been involved with both kinds of projects where a domain expert was available and in which domain expert was not involved. In projects where domain expert was available there are clear guidance on what the data means and how it can be used. Business owner was able to guide on what is the ideal solution they were looking for.
This helped to get clarity from very start. For the projects where domain expert was not available there was huge effort to first understand the problem and define data collection strategies. It was followed by investing large amount of time in understanding data.
Q2. What should every data scientist know about machine learning?
Machine learning is a great tool available to Data Scientist. It is an amalgamation of mathematics, statistics and computational techniques.
In order to solve problems at hand a Data Scientist should have deep understanding of machine learning algorithms, their computational complexities and application limitations. Data Scientist can debate about which algorithms is better to solve the problem and why their solution would work and give better results.
What every Data Scientist should always keep in mind is a machine learning algorithms is as good as the data and having more data can cover for not having used a great algorithm.
Q3. What is your experience with data blending?
Data blending is part of Feature engineering stage of data science problem solving. I have been doing data blending since long time now. The goal is to understand how different data sources relate to each other and how information from them can be used in machine learning algorithm training. It is a challenge because we need to understand what are the structure of the data, what are the primary keys on which join of two datasets can be done. This requires is a heavy data exploration.
Recently we have seen the emergence of software tools which can help with feature engineering (data cleaning and data blending). They can be very helpful in big data problem as lot of heavy exploratory work can be done automatically.
Q4. Predictive Modeling: How can you perform accurate feature engineering/extraction?
Feature engineering is the most important stage of a data science project. The problem with it is that no one knows in advance what the best way to do it. In order to solve a problem, we first need to look at all the possible data which might be relevant. Once a reasonable amount of data related to the problem has been collected we would like to do correlation analysis. For more complex problem we might need to do feature transformation or look for higher dimension of features which might not be obvious. This is a highly iterative process.
Future feature requirement depends on the machine learning algorithms prediction accuracy. If after feature engineering, we find the models are not accurate enough to be used. We might decide to invest more time and effort to collect feature which might be hard to collect or sometimes data might not be collected and data collection strategies need to be defined.
This is an iterative process until we get the desired results. Deep learning algorithms which have started to emerge try to generalize the learning and lot of heavy work of feature engineering are not required with them.
Q5. Can data ingestion be automated?
I believe data ingestion can be automated using the available software technologies to a large extent as long as we are able to collect it. However, I see there can be some technical, physical or other limitation due to which we might not be able to use in real time and exact usage might be delayed.
Q6. How do you ensure data quality?
Data quality is very important to make sure the analysis is correct and any predictive model we develop using that data is good. Very simply I would do some statistical analysis on the data, create some charts and visualize information. I also will clean data by making some choice at the time of data preparation. This would be part of the feature engineering stage that needs to be done before any modeling can be done.
Q7. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
Getting Insights is what makes the job of a Data Scientist interesting. In order to make sure the insights are good and relevant we need to continuously ask ourselves what is the problem we are trying to solve and how it will be used.
In simpler words, to make improvements in existing process we will need to understand the process and where the improvement is requirement or of most value. For predictive modeling cases, we need to ask how the output of the predictive model will be applied and what additional business value can be derived from the output. We also need to convey what does the predictive model output means to avoid incorrect interpretation by non-experts.
Once the context around a problem has been defined and we proceed to implement the machine learning solution. The immediate next stage is to verify if the solution will actually work.
There are many techniques to measure the accuracy of predictions i.e. testing with historic data samples using techniques like k-fold cross validation, confusion matrix, r-square, absolute error, MAPE (Mean absolution percentage error), p-value etc. We can choose from among many models which show most promising results. There are also ensemble algorithms which generalize the learning and avoid being over fit models.