On Big Data Analytics. Interview with Anthony Bak
“The biggest challenge facing data analytics is how to turn complex data into actionable information. One way to think about complexity is that there are many stories happening simultaneously in the data – some relevant to the problem being solved but most irrelevant. The goal of Big Data Analytics is to find the relevant story, reducing complexity to actionable information.”–Anthony Bak
On Big Data Analytics, I have interviewed Anthony Bak, Data Scientist and Mathematician at Ayasdi.
Q1. What are the most important challenges for Big Data Analytics?
Anthony Bak: The biggest challenge facing data analytics is how to turn complex data into actionable information. One way to think about complexity is that there are many stories happening simultaneously in the data – some relevant to the problem being solved but most irrelevant. The goal of Big Data Analytics is to find the relevant story, reducing complexity to actionable information. How do we sort through all the stories in an efficient manner?
Historically, organizations extracted value from data by building data infrastructure and employing large teams of highly trained Data Scientists who spend months, and sometimes years, asking questions of data to find breakthrough insights. The probability of discovering these insights is low because there are too many questions to ask and not enough data scientists to ask them.
Ayasdi’s platform uses Topological Data Analysis (TDA) to automatically find the relevant stories in complex data and operationalize them to solve difficult and expensive problems. We combine machine learning and statistics with topology, allowing for ground-breaking automation of the discovery process.
Q2. How can you “measure” the value you extract from Big Data in practice?
Anthony Bak: We work closely with our clients to find valuable problems to solve. Before we tackle a problem we quantify both its value to the customer and the outcome delivering that value.
Q3. You use a so called Topological Data Analysis. What is it?
Anthony Bak: Topology is the branch of pure mathematics that studies the notion of shape.
We use topology as a framework combining statistics and machine learning to form geometric summaries of Big Data spaces. These summaries allow us to understand the important and relevant features of the data. We like to say that “Data has shape and shape has meaning”. Our goal is to extract shapes from the data and then understand their meaning.
While there is no complete taxonomy of all geometric features and their meaning there are a few simple patterns that we see in many data sets: clusters, flares and loops.
Clusters are the most basic property of shape a data set can have. They represent natural segmentations of the data into distinct pieces, groups or classes. An example might find two clusters of doctors committing insurance fraud.
Having two groups suggests that there may be two types of fraud represented in the data. From the shape we extract meaning or insight about the problem.
That said, many problems don’t naturally split into clusters and we have to use other geometric features of the data to get insight. We often see that there’s a core of data points that are all very similar representing “normal” behavior and coming off of the core we see flares of points. Flares represent ways and degrees of deviation from the norm.
An example might be gene expression levels for cancer patients where people in various flares have different survival rates.
Loops can represent periodic behavior in the data set. An example might be patient disease profiles (clinical and genetic information) where they go from being healthy, through various stages of illness and then finally back to healthy.
The loop in the data is formed not by a single patient but by sampling many patients in various stages of disease. Understanding and characterizing the disease path potentially allows doctors to give better more targeted treatment.
Finally, a given data set can exhibit all of these geometric features simultaneously as well as more complicated ones that we haven’t described here. Topological Data Analysis is the systematic discovery of geometric features.
Q4. The core algorithm you use is called “Mapper“, developed at Stanford in the Computational Topology group by Gunnar Carlsson and Gurjeet Singh. How has your company, Ayasdi, turned this idea into a product?
Anthony Bak: Gunnar Carlsson, co-founder and Stanford University mathematics professor, is one of the leaders in a branch of mathematics called topology. While topology has been studied for the last 300 years, it’s in just the last 15 years that Gunnar has pioneered the application of topology to understand large and complex sets of data.
Between 2001 and 2005, DARPA and the National Science Foundation sponsored Gunnar’s research into what he called Topological Data Analysis (TDA). Tony Tether, the director of DARPA at the time, has said that TDA was one of the most important projects DARPA was involved in during his eight years at the agency.
Tony told the New York Times, “The discovery techniques of topological data analysis are going to have a huge impact, and Gunnar Carlsson is at the forefront of this research.”
That led to Gunnar teaming up with a group of others to develop a commercial product that could aid the efforts of life sciences, national security, oil and gas and financial services organizations. Today, Ayasdi already has customers in a broad range of industries, including at least 3 of the top global pharmaceutical companies, at least 3 of the top oil and gas companies and several agencies and departments inside the U.S. Government.
Q5. Do you have some uses cases where Topological Data Analysis is implemented to share?
Anthony Bak: There is a well known, 11-year old data set representing a breast cancer research project conducted by the Netherlands Cancer Institute-Antoni van Leeuwenhoek Hospital. The research looked at 272 cancer patients covering 25,000 different genetic markers. Scientists around the world have analyzed this data over and over again. In essence, everyone believed that anything that could be discovered from this data had been discovered.
Within a matter of minutes, Ayasdi was able to identify new, previously undiscovered populations of breast cancer survivors. Ayasdi’s discovery was recently published in Nature.
Using connections and visualizations generated from the breast cancer study, oncologists can map their own patients data onto the existing data set to custom-tailor triage plans. In a separate study, Ayasdi helped discover previously unknown biomarkers for leukaemia.
You can find additional case studies here.
Q6. Query-Based Approach vs. Query-Free Approach: could you please elaborate on this and explain the trade off?
Anthony Bak: Since the creation of SQL in the 1980s, data analysts have tried to find insights by asking questions and writing queries. This approach has two fundamental flaws. First, all queries are based on human assumptions and bias. Secondly, query results only reveal slices of data and do not show relationships between similar groups of data. While this method can uncover clues about how to solve problems, it is a game of chance that usually results in weeks, months, and years of iterative guesswork.
Ayasdi’s insight is that the shape of the data – its flares, cluster, loops – tells you about natural segmentations, groupings and relationships in the data. This information forms the basis of a hypothesis to query and investigate further. The analytical process no longer starts with coming up with a hypothesis and then testing it, instead we let the data, through its geometry, tell us where to look and what questions to ask.
Q7 Anything else you wish to add?
Anthony Bak: Topological data analysis represents a fundamental new framework for thinking about, analyzing and solving complex data problems. While I have emphasized its geometric and topological properties it’s important to point out that TDA does not replace existing statistical and machine learning methods.
Instead, it forms a framework that utilizes existing tools while gaining additional insight from the geometry.
I like to say that statistics and geometry form orthogonal toolsets for analyzing data, to get the best understanding of your data you need to leverage both. TDA is the framework for doing just that.
Anthony Bak is currently a Data Scientist and mathematician at Ayasdi. Prior to Ayasdi, Anthony was at Stanford University where he worked with Ayasdi co-founder Gunnar Carlsson on new methods and applications of Topological Data Analysis. He did his Ph.D. work in algebraic geometry with applications to string theory.
– Extracting insights from the shape of complex data using topology
P. Y. Lum,G. Singh,A. Lehman,T. Ishkanov,M. Vejdemo-Johansson,M. Alagappan,J. Carlsson & G. Carlsson
Nature, Scientific Reports 3, Article number: 1236 doi:10.1038/srep01236, 07 February 2013
Follow ODBMS.org on Twitter: @odbmsorg