Q&A with Data Scientists: Slava Akmaev
Slava Akmaev, Ph.D. is Senior Vice President and Chief Analytics Officer at Berg, LLC.
Dr. Akmaev is a leader in developing artificial intelligence applications in healthcare, drug development and diagnostics. Dr. Akmaev is the chief architect of the proprietary bAIcis™ platform, a highly parallelized Bayesian network learning software typically deployed on an HPC cluster and used to analyze extremely large data sets. He works closely with the drug and biomarker development teams and directs research informatics and health IT analytics within Berg. Prior to joining Berg, Slava was Vice President of Scientific Affairs at a Big Data analytics company, Scientific Associate Director at Genzyme Genetics and a Bioinformatics Investigator at Genzyme. Dr. Akmaev holds a Ph.D. in Applied Mathematics from the University of Colorado at Boulder. He has published over 20 peer-reviewed scientific manuscripts and presented his work at numerous scientific and commercial meetings.
Q1. Is domain knowledge necessary for a data scientist?
I believe domain knowledge is critical in certain industries. I spend most of my time in Life Sciences working with high-throughput molecular data such as genomics, proteomics, and metabolomics. I cannot overemphasize the importance of understanding the data source as it defines systematic and technical trends in variation that need to be accounted and normalized for.
For example working with metabolomics data coming from a mass spectrometer, it is understood there is a sensitivity threshold where the signal is lost and the accurate estimation of the analyte concentration is impossible leading to missing data. Researchers often use techniques to impute missing information at the lowest observation levels with empirical noise estimates. When we talk about proteomics, most of the time label-based high-throughput proteomics technologies are used to generate data. The data points presented for the analysis are the relative analyte concentrations between the study sample and a reference sample used for the entire study. In this instance, missing data does not imply a low concentration, it may simply imply issues with the analyte in the references biological sample and the imputation procedure must be more sophisticated and suitable for this scenario.
Another aspect to think about is information flow. We often deal with data frames or matrices that represent variables and observations. Sometimes, it is important to understand the signal flow among those variables. Speaking in the context of molecular biology, a data frame may have information about genes or DNA, mRNA expression, protein expression, and, more downstream, the metabolite concentrations. In most scenarios, we would not be looking to predict gene variability based on any of the downstream data because, in biology, we understand the DNA drives the molecular signaling cascade in the cell. It is worth mentioning though that domain knowledge can sometimes lead to biases in data analysis. I often recommend taking a fresh look at some of the “well known” hypotheses and keeping an open mind to data driven methodology and analysis outcomes.
Q2. What should every data scientist know about machine learning?
Machine learning algorithms are powerful tools for exploratory data analysis. With the advancements in computing, a tremendous variety of these algorithms are currently available as user-friendly software that can be run on a conventional laptop computer and are capable of analyzing relatively large data sets. Often times the software is plug-and-play where deep understanding of the underlying mathematics is not required and the analysis can be performed by a data analyst or even a domain expert. I believe it is helpful for the data scientists that work with machine learning tools to understand the limitations of their data and have a general idea of potential pitfalls of the machine learning methodology. It is very common these days to use deep learning or, in more scientific terminology, multi-layer neural networks for pattern recognition. The neural network approaches have been very successful in image analysis, text search and consumer experience. On the other hand, these methods have not been very successful in Life Sciences research and, specifically, in finding reproducible pathology patterns in high-throughput molecular data such as genomics.
Q3. What are the most effective machine learning algorithms?
In my opinion, there is not a “one size fits all” concept in data science and machine learning. The choice of a specific algorithm depends on the data and its properties.
The aspects such as normalization and systematic variability, technical deviations and variance fluctuation, the normality assumption and parametric modeling play a critical role in the algorithmic approach. What may be even more important is the data “shape”, i.e., whether the data set is deep with a large number of observation and a relatively small number of factors/variables or long where the number of variables greatly exceeds the number of observations. For the latter case, I strongly prefer techniques such as Bayesian Artificial networks. BNs offer a data driven approach for the inference of probabilistic causality. Through network learning or effectively through optimization of a global variable structure model, one can identify critical factors driving a specific outcome, decipher explicit causality mechanisms, and create parameterized mathematical models of real world processes.
Q4. What is your experience with data blending? (*)
I often work with myriad of data types in healthcare. It can be a mix of molecular data and digital health IT data such as claims information, patient records, lab results, pharmacy records etc. It is important to use tools that can create inferences in a data-agnostic manner. Some of the machine learning techniques are “blending” ready, for example, Bayesian networks. Other tools might require more time spent upfront in data normalization, discretization, variance stabilization and other types of data processing tasks. In general, having an ability to analyze disparate data in data science is a profound advantage as the predictive power is greatly enhanced by combining data streams and data modalities.
Q5. Can data ingestion be automated?
This is a great question and I guess the simple answer to it is yes, data ingestion can be and should be automated. Unfortunately for all of us data scientists, it is very rarely automated. Here is my rule of thumb for any given project: the project team is likely to spend three times the amount of time “ingesting” customer data than analyzing it.
What are the issues here? As we hear more and more about data warehousing and all types of cloud solutions to streamline data acquisition and storage, the reality is a lot of real world healthcare data is disorganized, it is stored and managed by legacy software. Some of the data formats are archaic; they date back to the 1980s. It is unclear if this is a temporary issue related to the latest advances in IT, or a permanent cycle in the IT innovation curve. There is no guarantee future generations of data scientist will not be struggling “ingesting” Hadoop data into the most modern data management platform.
Q6. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
In a data rich domain, evaluation of the insight correctness is done either by applying the mathematical model to new “unseen” data or using cross-validation. This process is more complicated in human biology. As we have learned over the years, a promising cross-validation performance may not be reproducible in subsequent experimental data. The fact of the matter is, in life sciences, laboratory validation of computational insight is mandatory. The community perspective on computational or statistical discovery is generally skeptical until the novel analyte, therapeutic target, or biomarker is validated in additional confirmatory laboratory experiments, pre-clinical trials or human fluid samples.
Q7. What were your most successful data projects?
I would like to refer the reader to our recent publication in Artificial Intelligence in Medicine. It is a good example of a successful data project where we used a data driven approach to discover a novel drug-outcome relation in a hypothesis free manner.
We have discovered a strong relationship between kidney disease and asthma diagnoses. It puzzled us. However what became clear after some digging into the scientific and clinical research in kidney disease was the high utilization of certain therapeutic agents by the asthma patient population with a scientifically proven, in biological models, potential to induce kidney damage.
This is a novel hypothesis that needs to be validated in confirmatory epidemiological studies. Nevertheless, this is a remarkable story on how artificial intelligence methods can assist the human researcher in an unbiased and strictly data driven discovery.