Q&A with Data Scientists: Anya Rumyantseva
Anya Rumyantseva, Data Scientist with Pentaho, a Hitachi Group Company. Anya’s current projects are focused on using IoT data for improving business operations in rail and manufacturing industries. Anya has BS/MS in Physics from Lomonosov Moscow State University, Russia. Her PhD thesis in the University of Southampton, UK, was focused on autonomous underwater vehicles.
Q1. Is domain knowledge necessary for a data scientist?
Yes, it always helps. However, I think the ability of data scientists to collaborate with people within an organization is important as well. Even if a data scientist not specialized in a specific domain, (s)he can speak to a domain expert /business consultant and discuss business use cases, what data sets are relevant, perform feature engineering and evaluate if developed data science solution has an impact on business problems. Therefore, the lack of domain expertise can be substituted with excellent communication and team-working skills.
Q2. What should every data scientist know about machine learning?
Solid background in Linear Algebra, Statistics, Discrete Math and other relevant academic disciplines is important for implementing machine learning and understanding performance of different algorithms. It is important to know pros and cons of widely used machine learning algorithms for clustering, classification and regression. Which algorithms are useful when there are many features, does an algorithm require significant amount of hyper-parameter tuning etc.
In addition, Data Science is a broad field and not limited by machine learning. Data Science problems can involve mathematical optimization, graph theory and other disciplines.
Q3. What are the most effective machine learning algorithms?
In my opinion, it really depends on a specific problem. One algorithm cannot solve all the problems with the same efficiency. In classification tasks, the Logistic Regression model is fast to learn and easy to understand; Support Vector Machines perform well on relatively small data sets with many features; Decision Trees do not require much hyperparameter tuning and efficiently deal with missing values. Deep learning algorithms (e.g. artificial neural networks) can show better performance on relatively large data sets and can model very complex relationships in data, but require substantial expertise to be implemented. The general approach is to define a performance metric and start with simple models.
Q4. What is your experience with data blending? (*)
Data aggregation, blending and cleansing are time consuming tasks in data science. They involve dealing with data coming from different sources and in different formats. This is why in Pentaho we focus a lot on developing toolsets for accelerating data prep for analytics and machine learning implementation. Pentaho Data Integration is designed for data cleansing, transformation and preparation for machine learning implementation in a simplified and reproducible manner.
Q5. Predictive Modeling: How can you perform accurate feature engineering/extraction?
First, I would start by brainstorming the problem, then evaluating what data sources are relevant and can benefit the model performance. Then, it all comes down to data cleansing and preparation. Multiple methods exist for evaluating feature importance in the data set and eliminating irrelevant features. These techniques can help to extract a subset of features that best represents the underlying problem. In general, feature engineering is paramount for achieving good performance of an algorithm on a given task.
Q6. Can data ingestion be automated?
The wide range of information gathered by a business is rarely stored in a single database or format. Data integration implies consolidation information from multiple data sources for use in a single application. Pentaho Data Integration allows users to extract information from any data source for preparation and delivery to a data warehouse or Hadoop cluster.
Q7. How do you ensure data quality?
Quality of data has a significant effect on results and efficiency of machine learning algorithms. Data quality management can involve checking for outliers/inconsistences, fixing missing values, making sure data in columns are within a reasonable range, data is accurate etc. All can be done during the data pre-processing and exploratory analysis stages.
Q9. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
I would suggest constantly communicating with other people involved in a project. They can relate insights from data analytics to defined business metrics. For instance, if a developed data science solution decreases shutdown time of a factory from 5% to 4.5%, this is not that exciting for a mathematician. But for the factory owner it means going bankrupt or not!