Q&A with Data Scientists: Andrei Lopatenko
Andrei Lopatenko was born in Odessa, Ukraine. He got a master degree in Applied Mathematics and Physics from Moscow Institute of Physics and Technology, Russia in 1997 and PhD degree in Computer Science from the University of Manchester, UK in 2007. Before coming to the USA he lived in Ukraine, Russia, Austria, United Kingdom, Italy, Switzerland (plus Australia and Canada considering one month extended stays there).
After completion of his PhD he worked in Google (2006) where he worked on key search algorithms for Google Web Search. He was a founding engineer of Apple Maps. After working two times for Google and two times for Apple, he was leading a search team of Walmart Labs responsible for search improvements at walmart.com, where his teams in California and Bangalore brought significant improvements to the company’s revenue.
Currently he works as a Director of Engineering for Recruit Institute of Technology (RIT), a part of Recruit Holdings Co., where he works on conversational interfaces, natural language processing, data integration. RIT is located in Mountain View, CA.
Q1. Is domain knowledge necessary for a data scientist?
Domain knowledge is useful for many data scientist roles but it is not necessary for many other roles. There are roles where a data scientist is supposed to bring gain based on data science in a certain business.
For example, the data scientist may be asked to improve sales of a certain product and increase users’ engagement for a certain service provided by a business and understanding of business, what key factors affect customers and how, what data must be obtained to analyze it, and what are the expected outcomes. These points help gain a lot of good results fast.
But frequently new valuable results can be obtained by data scientists without deep knowledge of domain by digging in data.
In big companies data scientists must frequently address the needs of very different businesses of the company and nobody has deep domain knowledge of all businesses.
Q2. What should every data scientist know about machine learning?
Data scientists must know key concepts of machine learning toolboxes and how to use them for daily data science work. Certainly a data scientist must be good at feature engineering, feature extraction, and must understand how to evaluate new features. A data scientist must have a good knowledge of evaluation of machine learning methods, namely cross-validation and model selection. Data scientists must understand how to avoid overfitting, different loss functions and how they affect training. A data scientist must understand how to deal with omitted values in data, how to deal with noise in labels.
A data scientist must understand key principles of ensemble models, model aggregation. I expect a good data scientist to have a good knowledge of foundations of machine learning and deep understanding of its methods. I do not expect data scientist to be very knowledgeable in computational aspects of machine learning algorithms and implementation details.
Q3. What are the most effective machine learning algorithms?
It depends on the task. Frequently with an algorithm such as logistic regression, with good feature engineering and parameter tuning, one can go surprisingly far. Random forests are very effective for a big set of tasks
Convolutional Neural Networks are very effective for many image problems. Recurrent Neural Networks proved to be very effective for a large selection of sequential problems in particular Natural Language Processing tasks.
Ensemble methods proved to be very efficient for a large set of real world problems. Gradient Boosting Regression Trees and alike methods for learning to rank tasks.
Q4. What is your experience with data blending?
I think there is no industrial data scientist without data blending experience. In most cases data are extracted or acquired from many different data sources and one have to perform work to blend them to get clean data. Typically it’s a heavy programming work. Most data sources are noisy and one have to do work to detect errors in data. Then data must be blended, entities must be linked. In most cases there are no training sets for that and one have to build them. But training sets must be representative, it required a deep digging into data to build a good set. There are no good software for data blending and one have to build one’s own.
Q5. Predictive Modeling: How can you perform accurate feature engineering/extraction?
Very important to have a fast experimental pipelines so once the feature is extracted, it can be put into experiment as fast as possible, so you can get results of experiments/feature evaluation as soon as possible. Usually it’s easy for projects when you have full training sets available and you need to fit a model, cross-validate, but many projects require life evaluation plugging your feature as, for example, a search signal into live search engine and measuring impact and quality of impact. Fast experimental framework are crucial for effective feature engineering in such case.
It’s good to have a software library to combine features, validate features, check how they correlated to target data, so the process of feature engineering is very fast, since it requires a lot experiments and trials of new ideas.
Q6. How do you ensure data quality?
Data quality is not enough, it must be automatically checked. In real world applications it rarely happens that you get data once. Frequently you get a stream of data. If you build an applications about local business, you get a stream of data from provided of data about businesses. If you build an ecommerce site, then you get regular data updates from merchants, and other data providers. The problem is that you can almost never be sure in data quality. In most cases data are dirty.
You have to protect your customers from dirty data. You have to work to discover what problems with data you might have. Frequently problems are not trivial. Sometimes you can see them browsing data directly, frequently toy can not.
For example, in case of local business latitude longitude coordinates might be wrong because provided has a bad data geocoding system. Sometime you do not see problems with data immediately, but only after using them for training some models, where errors are accumulated and lead to wrong results and you have trace back what was wrong.
To ensure data quality once I understand what problems may happen, I build data quality monitoring software. At every step of data processing pipelines I embed tests, you may compare them with unit tests for traditional software development which checks quality of data. They may check total amount of data, existence or non existence of certain values, anomalies in data, compare data to data from previous batch and so on. It required significant error to build data quality tests, but it pays back, they protect from errors in data engineering, data science, incoming data, some system failures , it always pays back.
From my experience almost every company build a set of libraries and code alike to ensue data quality control. We did it in Google, we did it in Apple, we did it Walmart.
In the Recruit Institute of Technology we work on Big Gorilla tools set , which will include our open source software and references to other open source software which may help companies build data quality pipelines
Q7. What techniques can you use to leverage interesting metadata?
First, metadata must be discoverable. The biggest problem is when you go to large scale with many data providers, it’s hard to find metadata about them. So to leverage metadata, the first step is to make it discoverable. For example, in Walmartlabs we had a huge win in quality of search by promoting higher quality items, after we discovered certain data looking into metadata by accident. If we had better techniques to find appropriate metadata, we could have done it earlier.
Second, metadata must be connected. You need a technique to understand that data described by metadata A are connected to data described by metadata B and this connection actually is a meta-information about data and must be in metadata, but frequently it;’s not there because metadata for A and B were written separately by different team or different companies
There are open tools for discovering metadata.
Apache Atlas , Google developed system Goods which is used internally and published their results here , we are working on an open source system for metadata search named Usagi. (Rabbit in Japanese, because Rabbit has almost 360’ vision)
Q8. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
Most frequently companies have some important metrics which describe company business. It might be the average revenue per session, the conversion rate, precision of the search engine etc. And your data insights are as good as they improve this metrics. Assume in e-commerce company, the main metrics is average revenue per session (ARPS). And you work on a project of improving extraction of a certain item attribute, for example, from non-structured text.
Questions to ask yourself, will it help to improve ARPS by improving search because it will increase relevance for queries with color intents or faceted queries by color, or by providing better snippets, or by still other means. When one metric does not describe company business and many numbers are needed to understand it. Your data projects might be connected to other metrics. But what’s important is to connect your data insight project to metrics which are representative of company business and improvement of these metrics will be as a significant impact to the company business. Such connection makes a good project.
Q9. What were your most successful data projects? And why?
I’ve significantly improved quality of Google search in several projects. I with my collaborators built an Apple maps search and significantly improved quality of AppStore search. In Walmartlabs, my team made a huge company level improvement into ATC (Add To Card) metrics by improving the search quality on walmart.com.
How to make a successful project? Good knowledge helps. Working on spell correction, one must understand different methods of spell corrections, certainly noisy channel and other foundations methods such as Latent Dirichlet. Working on Type Ahead (AutoSuggest), one need to understand certain sequential methods well. So good practical knowledge of foundations is very important for success in data science projects.
One must be brave. When I started one project, I knew that 6 people failed to do it before, but I started it nevertheless and it was successful. Frequently someone tried to address the same problem you address but you need to make it even more, projects in data science are never complete, you want to squeeze more from data, get more improvements. So being brave and insistent is important.
It’s important to be inventive. Both in problem seeking and in solution seeking.
Many successful projects in my experience were found by myself, they were not asked to be done by management, to get a successful projects one must find it by looking into data and seeking for opportunities. The same applies to finding a solution, standard methods rarely work. You have to find your way
Q10. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?
Typical mistake – assuming that data are clean. Data quality should be examined and checked.
Q11. What are the ethical issues that a data scientist must always consider?
Privacy is very important one. Most data are consumer data, they represent user interests, frequently they private interests.
Metrics should consider ethical issues into consideration. Assume you compute just click through for the web search.
Then porno results might have a higher CTR, even for an audience that you certainly do not want to show for any query including non relevant. Pushing them in ranking will increase your metrics. If you define CTR as the main optimization criteria, then data science projects may lead to intensive optimization for this criteria which might include (unintentional) promotion of porno results. Your metrics and experiment evaluation systems must be aware of such issues.