Yanpei Chen is a data-scientist-product-manager at Splunk, in charge of looking at internal and customer data about Splunk to build better products. He holds a computer science PhD with MBA minor from UC Berkeley, where he was a member of the AMP Lab. In a past life as a software engineer at Cloudera, Yanpei contributed to several industry standard performance benchmarks on Hadoop, SQL-on-Hadoop, and machine learning.
Q1. What were your most successful data projects? And why?
Two examples of my data projects come to mind – each of them are outlined below.
The first project is my dissertation research work at UC Berkeley AMP Lab. We developed a new method to capture, scale, and replay real-life big data workloads. Using this method, we addressed several non-trivial system design challenges.
The “data” for this project consisted of real-life, production system traces from several different high-tech companies.
The “analysis” involved some nuanced application of statistical theory and empirical modeling. Insights from the project made their way into the first-generation industry standard big data benchmarks from the TPC consortium, and remain relevant today.
The second project is a part of my current work at Splunk. In my role, I work with our product management and engineering organizations to develop a series of metrics based on product instrumentation data from customer deployments. Using these metrics, we seek to improve our product along various customer-visible areas such as quality, usability and scalability.
Our leadership team reviews these metrics at a regular cadence. Multiple product enhancements have been launched to address various customer challenges illuminated by the data, which then led to timely, innovative improvements in our products and our internal operations.
Both these projects primarily share the following three success factors:
The projects took place within a fantastic, supportive culture. Both the Splunk leadership team and the AMP Lab faculty nurtured open environments, where it is assumed that the data-driven world view is an integral part of running the business or doing computer science research. Within both environments, there is a community of peers able and willing to check each other’s blind spots, and enhance each other’s work with additional information.
The projects benefited from collaborative domain and numerical expertise. At Splunk, I have had everyone from members of executive staff to line engineers share with me highly technical and highly detailed perspectives and feedback related to my work. At AMP Lab, of course all the students and faculty simultaneously possess numerical expertise and domain expertise in computer system design. The coexistence of domain and numerical expertise allows for a rapid translation from the data to its implications.
The projects further benefited from the readily available and easily accessible technical infrastructure.
At AMP Lab, we had a blank check to use cloud computing resources as we saw fit. At Splunk, our product telemetry data resides in an enterprise-wide deployment of our own product. This infrastructure allows us to do the entirety of our analysis, integrate product data with other business-critical data, and develop live dashboards to automate data presentation.
Q2. Is domain knowledge necessary for a data scientist?
Yes, for a project team, domain knowledge helps in several ways:
• Identifying high impact problems
• Connecting domain nuances with numerical nuances
• Assessing whether the data is a direct measure or a proxy for the behaviors of interest
• Understanding whether the data crosses the confidence threshold required for the decision at hand
In my experience, many successful domain experts and data scientists are characterized by openness to learning and genuine intellectual curiosity. All domain experts already know that X+1 amount of a “good thing” is better than X amount of the same thing. Many data scientists pick their professional domain expertise based on existing interest. People already have the raw ingredients to be both domain experts and data experts.
Whether people become simultaneously domain experts and data scientists depends a lot on the environment.
Across the industry, we genuinely need more support from organizational leaders – we need their help to foster environments where openness and learning can thrive. People should be willing to ask “dumb questions” without risking disrespect, or to offer alternate data interpretations without risking confrontation. Definitions of success should include room for exploration that potentially leads to dead-ends. Incentives should be created to record discoveries and share experiences – both positive and negative.
Q3. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant”?
Here’s a list of things I watch for:
• Proxy measurement bias. If the data is an accidental or indirect measurement, it may differ from the “real” behavior in some material way.
• Instrumentation coverage bias. The “visible universe” may differ from the “whole universe” in some systematic way.
• Analysis confirmation bias. Often the data will generate a signal for “the outcome that you look for”. It is important to check whether the signals for other outcomes are stronger.
• Data quality. If the data contains many NULL values, invalid values, duplicated data, missing data, or if different aspects or the data are not self-consistent, then the weight placed in the analysis should be appropriately moderated and communicated.
• Confirmation of well-known behavior. The data should reflect behavior that is common and well-known. For example, credit card transaction volumes should peak around well-known times of the year. If not, conclusions drawn from the data should be questioned.
My view is that we should always view data and analysis with a healthy amount of skepticism, while acknowledging that many real-life decisions need only directional guidance from the data.
Q4. What should every data scientist know about machine learning?
In my view, data scientists should think of machine learning as an advanced calculator, another tool in their tool belt.
The value of a calculator comes from using it to solve high-impact problems, and feeding it data specific to the problem at hand. The focus should be on the problems that it can solve, and not entirely on the calculator itself.
Data scientists need to understand and communicate the nuances of the buttons on this calculator. A calculator will always spit out an answer when you feed it data, press any button, and trigger any algorithm. Assessing whether the answer is appropriate, or even relevant, will demand a greater understanding of machine learning beyond merely “pressing the button” or “getting an answer”.
Most problems in the world only require using a small subset of the buttons available on the “machine learning calculator”. Think about splitting the dinner bill or doing the household budget – we may well launch the calculator app on our phones, but likely we do not need to use the trigonometric or exponential functions. Machine learning should be approached with the same mindset. There are some problems that absolutely need the latest and greatest algorithms. For most problems, simple numerical discipline will take you a long way.
Calculators have become increasingly easy to use. So is the case with machine learning. There are many tools available today. Speaking with a partial voice – I do find Splunk’s Machine Learning Toolkit to be very easy to use. It genuinely allows me to focus more attention on exploring the problem at hand, rather than how to configure and then hand-crank the calculator.
Q5. Can data ingestion be automated?
Again, speaking with obvious allegiance, Splunk does it well. Some of our largest customers ingest data from hundreds of thousands of machine data sources. Ingestion *must* be automated at that scale.
One more biased comment on automation. Splunk integrates data ingestion, storage, analysis, and presentation within a single platform. This means that I do not need to connect together and configure multiple different technologies in the ingestion-to-presentation data pipeline. It is a different kind of automation. Speaking from first-hand experience, the consolidated functionality makes a huge difference. It frees data scientists to spend more time iterating on the data, which translates to faster time to value and higher return on investment.
Q6. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?
Our discussions elsewhere already touched upon some issues:
Technical mistakes – The “data calculator” is not good enough. It does not scale to the required data size, diversity, and complexity. Data ingestion is not automated. Everyone spends their time gluing together technology instead of focusing on the actual analysis.
Cultural mistakes – Defensiveness and grandstanding replaces openness and learning. Data is viewed as a niche asset and not a top-level corporate or institutional asset. Data is only allowed to confirm existing bias, and not allowed to alter existing policy, business, or scientific decisions. Decision makers are unaware that they are making blind decisions, or, in the other extreme, they are paralyzed by over-analyzing the problem.
Organizational mistakes – There is no dedicated infrastructure to store the data and to run the analysis. There is insufficient budget to create and operate such infrastructure at scale. Legal infrastructure is not in place to seek consent to collect relevant data. Internal data security and governance policies are not in place. Incentives from different parts of the organization are not aligned. Management silos prevent collaboration across different departments and functions.
To an extent, it falls on data scientists to recognize and account for these types of problems in their work. Data scientists have the skill set needed to quantify, estimate, and attribute causes. They should also be able to leverage these skills to quantify the cost of such mistakes and prioritize avoiding mistakes with the greatest negative impact. They should take pains to communicate objectively and diplomatically, so that the discussion can rise above the tensions that come from an imperfectly executed project.
Q7. What are the ethical issues that a data scientist must always consider?
Data scientists should confront and address ethical issues as they arise. They should participate in broader discussions and not entirely defer to others for the answers. Data science is similar to other disciplines – the greater the potential for positive impact, the greater the potential for ethical issues.
There is one overarching consideration. In their raw form, statistical aggregates are weighed in favor of the “common case”. Any un-moderated optimization for the “average” will, by definition, benefit the majority and the “mainstream”. Additionally, people become visible through data when they have a level of engagement with technology, which is already one definition of the well-resourced “mainstream”.
In contrast, across different cultures, we constructed many of our institutions and values specifically to protect the minority, the under-resourced, and the invisible. We specifically encourage and benefit from expressions of individualism and diversity within the “mainstream”. Thus, the raw “data-favoritism for the majority” should be tempered by broader definitions of economic surplus and social wealth. Striking this balance requires data scientists to work alongside our colleagues in government, law, finance, healthcare, social care, and other areas.
Ultimately, data science should provide an objective, shared understanding of the world, highlighting the bright spots and the blemishes. Data science should focus precious human attention on the most important topics, and enrich the discussion with appropriate data.