Q&A with Data Scientist: Vikas Agrawal
Dr. Vikas Agrawal works as a Senior Principal Data Scientist in Cognitive Computing for Advanced Analytics for Oracle Corporation. Vikas has created activity context-aware Virtual Personal Assistants, Real-Time Asset Management systems, Reliability Risk Management and Predictive Maintenance systems for manufacturing, mining, healthcare, insurance, pharma, retail and investment banks while at Intel Corporation and Infosys Limited. Vikas received a BTech in Electrical Engineering from the Indian Institute of Technology, New Delhi, an MS in Computer Science and a PhD in Computational Modeling from University of Delaware (Newark, DE), and conducted post-doctoral research at California Institute of Technology (CalTech), Pasadena, CA) in PDE-based modeling of stem-cell population development and differentiation with Jet Propulsion Labs (JPL) researchers for NSF FIBR. For additional details, please see here.
Q1. What should every data scientist know about machine learning?
Every data scientist must be aware of the limitations of machine learning, and that the models only apply within boundary conditions.
0. Before talking about any algorithms, tools or techniques, one must define what is the business problem one is solving, what are the technology options available, and what is the cost of potential solutions vs. the business benefits. The validation of the machine learning models crucially depends on the purpose of the models. These models are good for specific purposes (perhaps somewhat extensible) and come with specific precision, recall and boundary conditions, and are based on specific prior datasets. So, one can call a model validated and useful, if it performs at a level of precision, recall and accuracy that is necessary in business use, for a solid return on the investment in modelling.
1. A false positive rate of 5%, or a false negative rate of 10% or a confidence level of 95% may sometimes be too weak for most practical large scale applications, especially for industrial control, supply chain optimization, manufacturing quality and reliability models. Depending on the training data size, and the expected volume of data the model will encounter, one may need p-values of 0.001 rather than 0.05, and confidence levels of 99.9% or better may be needed where one wants to make serious decisions based on the models.
2. Before one does any machine learning, one needs to get a well-engineered and cleaned version of the data, and creating that data pipeline may take a large chunk of time.
3. At one level, machine learning is a way to summarize past data, and simulations from that learning simply reflect the outcomes from past data upon new data. Sometimes the assumption about the new data looking like the past may not hold true, and one must have a good feedback mechanism as errors begin to exceed acceptable thresholds. WIth the feedback one can improve the models.
4. The choice of features, feature-engineering, choice of training/test samples and sample sizes are critical to the discriminating quality in the end results.
5. Overfitting can become a serious issue. Some algorithms will give some reasonable looking fit, no matter what, such as decision trees.
6. Black-box vs. Representational/White-Box Models: While it is sometimes fashionable to lead towards neural networks based or “black-box” type models which have no one-to-one correspondence between the model parameters and the world they intend to model, these are good only if one does not need to explain these models, or make specific directed changes to models.
One needs to understand if the models make physical sense in the domain.
7. Often one needs a combination of domain knowledge driven expert systems, statistical models, partial differential equations and black-box models to solve a single problem end to end. Very rarely will a single algorithm suffice.
Q2. Is domain knowledge necessary for a data scientist?
Domain knowledge is particularly important in the following cases:
1. Where there are well-known and well-established physical relationships between variables. For instance, one can already express certain relationships in terms of equations in physics, chemistry, signal processing etc. It is a waste of time trying to re-infer those relationships or ignore them. Even if one is trying to discover novel relationships or disprove well-known relationships, it really helps to formulate a well-defined null hypothesis.
2. General purpose vs. domain specific solutions: Sometimes we try to create general purpose anomaly detection algorithms or create platforms that allow others to create such solutions. It is important to realize peculiarities of certain datasets.
For instance, in streaming data from sensors on pumps, motors, trucks, pipelines one may get high velocity time series datasets which may need to go through FFTs / DFTs / Wavelet transforms before one can do a reasonable Kernel Density Estimation and estimate Kullback-Leibler Divergence using Monte-Carlo or Variational approximation methods. Without frequency domain information of time series data it is hard to make sense of it. Similarly, how one deals with electric current data, will be different from how one processes particulate density in motor oils.
3. Where there are behavioral, psychological, and other human factors to actions and their intents. For instance, we may know that people managing very large portfolios are unlikely to send out emails with negative sentiments or are likely to keep the conversations on a professional level, concealing their potential intent to move their portfolio. So, if they send out hundreds of positive emails, but just one negative email, that negative signal must have very high recall in our algorithms, even at the risk of low precision for positive signals, if we are trying to predict risk of flight of the portfolio holder.
4. Where we can use domain knowledge to determine the requirements such as precision and recall tradeoffs. In legal applications high recall is the norm, in advertising high precision is the norm, and in engineering both high precision and recall are required.
5. Especially for highly regulated environments, one may need to choose algorithms that are well-accepted and that can be interrogated. Therefore, one has to very sensitive to “traditional” usage, taking a two phase approach, where one may begin with the most effective algorithms, find the key pieces of information, and then translate the models back to an regulatory-compliant set of algorithms. For example, in fraud detection at banks, one has to create deep neural network based models for what is normal, find deviations from the normal using distribution divergence techniques, and then use these models to find tighter or optimal bounds for expert systems rules that are acceptable to regulators.
Q3. What is your experience with data blending?
I have worked with several datasets where one needed to bring together email, phone call recordings, account KPIs, customer notes, social media data (or temperature, pressure, vibration, particulate count, current drawn) etc. which were collected at different time grains, occurred across different dimensions, complex to find common identities with respect to people and transactions to join datasets, and processed multiple levels of missing information. The key challenge in data blending is to start from the top level data-set schema required, along with dimensional granularity needed to create machine learning based models. Domain knowledge of what specific transformations make raw data into features with meaningful impact on dependent variables is crucial.
For instance, one may need to convert time series data into chunked-time data, summarized as frequency domain, and then use a time series of the frequency domain summaries. Or one may need to extract entities, relationships, and sentiments expressed about them from emails and phone call recordings, order them by time/user, and aggregate them to highlight them in directed search instances. Or one may need to begin from the KPIs and marry every other signal back to the KPIs which occur in structured datasets. These steps require multiple levels of summarization, keeping algorithmic provenance of all data transformations, multiple levels of inference along with natural language processing, time series analytics, and physics-of-the-domain based models.
Q4. Predictive Modeling: How can you perform accurate feature engineering/extraction?
Accurate predictive modeling depends crucially on the set of predictors that one finds as statistically significant.
1. Often good feature engineering depends on strong knowledge of entities in the domain and their relationships. It also depends on transformations of the existing features into versions that better discriminate between the cases in ways meaningful to the quantities being predicted. For instance, rate of change of current or temperature or pressure or velocity may be predictive of a potential failure. Or the accumulation of stress-inducing flexes on a board might be more predictive of its fracture. Or a combination of temperature, pressure and particulates in a pipeline may best predict its risk rather than just one of them.
2. Beyond domain knowledge, experimentation with variants and invariants in the area of image recognition, speech recognition, and handwriting recognition have yielded interesting features.
3. More recently, I have found success with large scale feature generation which amounts to large scale hypothesis generation, such that one can pick out variable combinations and transformations that are more likely to be predictive based on their success or lack thereof in explaining variation in the source data.
Q5. Can data ingestion be automated?
As long as there is a need for humans to provide credentials to access the datasets, and as long as there are fields with complex mappings that cannot be inferred based on machine learning or statistical techniques, data ingestion will remain semi-automated.
For a given set of data sources this can be automated assuming those sources do not change significantly over time.
Q6. How do you ensure data quality?
There are four steps to ensuring data quality:
1. We do not admit any spurious data source into our repository. If we do admit the spurious source, we must clearly keep track of the origin, the dates of ingestion and the specific parts that have low quality information. If low quality is discovered at a later stage, it must be back-propagated to the source system and its owners. It is much more expensive to remove spurious data after it has been included.
2. For each data source, we must keep an algorithmically reproducible provenance of the data, including the owner/producer, and the exact serioes of transformations in a currently executable programming language with dates.
3. Running data quality metrics on all sources, and flagging missing data, poorly distributed datasets, and highlight physically impossible/inaccurate/incorrect datasets based on models of the domain.
4. Have every element of the data pipeline peer-reviewed – there is no substitute to having smart creative colleagues!
Q7. What techniques can you use to leverage interesting metadata?
Metadata can take multiple forms, including basic schema-level information, to statistical summaries, indexes, frequency of usage by users, occurrence of usage in conjunction with other sets, list of similar sources, link to original source, and algorithmically reproducible transformations that the data has undergone. In addition, metadata can be extracted from sources like text of documents, emails, notes, call audio and meeting video to be made available for directed search.
1. Natural language processing based metadata can be leveraged for enterprise search with Apache Solr and Lucene.
2. Transformations metadata can be used to ensure data quality and provenance at any point.
3. Indices and frequency of usage or co-usage can be used as episodic memory to derive semantic memory for generalizations that can help predict user intent, and help guide the user with chatbots and intelligent context-aware speech enabled virtual personal assistants.
Q8. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain? Thank you. RVZ
Here is my criterion for an insight being “correct”, “good” or “relevant”:
1. Right Granularity, Real Time Availability and Relevance: The insight is relevant if it is at a level that the business user can directly use for decision making. For instance, information about vibration levels at a pump is likely irrelevant to a site supervisor, but information that this pump will fail in four weeks is useful, and furthermore if the model can tell the supervisor that if this pump fails it will cost $11M in down-time for the plant, and it will cost $58K to ship a part from Switzerland to repair it overnight, then this is information one can directly act upon.
The data should be processed as quickly and feasible, and insights made available alongside the transaction user interfaces where decisions are made. Traditional warehouse driven BI systems (OLAP) do not provide this information at the time of online transactions (OLTP). If the insight is found after the time for action has passed, it is obviously good, but not really useful – as they say hindsight is always 20/20.
2. Accuracy/Dependability and Correctness: The insight is “correct” if it is high precision, high recall, and has a low margin of error, based on the defined business needs. Getting higher accuracy than the business need at the pre-defined cost is always good, but may be superfluous if it is at a very high expense of resources or time. This is usually true of hybrid models that use a combination of domain knowledge based mathematical models, statistical boundary conditions and machine learning models. Of course, it is better to have no insights, than to have incorrect or misleading so-called insights.
3. Degree of Improvement or Comparability to Human Judgement: The models included for insight generation are good for the business if I can show to the human experts that if they put in all their effort, all their expertise, and took the time, and analyzed the data we have, they cannot come up with a better decision-driving information given the time and data available.
We were able to demonstrate this for multiple cases where probably approximately correct results provided at the time of transactions turned out to be better than fully correct hindight in a month or incorrect insights available instantly.
This is often trivially true for business analytics and aggregations, done at scale with expert systems like rules for alerts, and also for certain kinds of image processing problems. This is also true for customer recommendations systems at online retailers. This needs to be made true for industrial systems like airplanes, chemical plants, mines, supply chain systems and hospitals.