Data Wisdom for Data Science
Bin Yu, Departments of Statistics and EECS, University of California at Berkeley
(Invited note for ODBMS.ORG–The Resource Portal for Big Data, New Data Management Technologies and Data Science)
In the era of big data, much of the research in academia and development in industry is about how to store, communicate, and compute (via statistical methods and algorithms) on data in a scalable and efficient fashion. These areas are no doubt important. However, big (and small) data can only be turned into true knowledge and useful, actionable information if we value “data wisdom” just as much. In other words, with all the excitement over big data, it is necessary to recognize that the size of the data has to be adequate relative to the complexity of the problem in order to get a reliable answer out of big data. Data wisdom skills are crucial for us to extract useful and meaningful information from data, and to ensure that we do not misuse expanding data resources.
I admit that “data wisdom” is a re-branding of essential elements of the best of applied statistics as I know it. They are more eloquently expressed in the writings of great statisticians (or data scientists) such as John W. Tukey and George Box:
Data wisdom is a necessary re-branding because it conveys these elements (to a first approximation), better than the term “applied statistics”, to people outside the community. An informative name such as data wisdom is a good step towards recognizing the importance of the best of applied statistics skills in data science.
Revising the first sentence of Wikipedia’s entry on “wisdom”, I would like to say
data wisdom is the ability to combine domain, mathematical, and methodological knowledge with experience, understanding, common sense, insight, and good judgment in order to think critically about data and to make decisions based on data.
Data wisdom is a mix of mathematical, scientific, and humanistic abilities. It combines science with art. It is something that is best learned by working with someone who has it.
It is very difficult to learn by reading a book without guidance from experienced practitioners.
That said, there are questions that one can ask to help form or cultivate data wisdom. Here are 10 basic sets of questions that I encourage one to ask before embarking on and during any data analysis project. These questions are naturally sequential in the beginning, but their order does not have to be respected during the iterative process of data analysis.
These questions are not meant to be exhaustive, but give the flavor of data wisdom.
The beginning of a data science problem is always something outside of statistics or data science. For example, a question in neuroscience: how does the brain work? Or a question in banking: to which group of customers should a bank promote a new service?
Associated with such a question are domain experts that a statistician or a data scientist needs to interact with. These experts help provide a broader picture of the question, domain knowledge, prior work, and a reformulation of the question if necessary. It takes strong interpersonal skills to establish relationships with (most likely very busy) domain experts.
This interaction is indispensable for the success of the data science project to come.
With the abundance of data, it often happens that questions are not precisely formulated before data collection. We find ourselves in the game of “exploratory data analysis (EDA)” as Tukey called it. We fish for questions to ask and enter the iterative process of statistical investigation (as Box discussed in the paper linked above). We have to be vigilant not to overfit or interpret patterns in data due to noise. For instance, overfitting can happen when the same data is used to formulate a question and again to validate the answer to that question. A good rule-of-thumb is to split the data, while respecting the underlying structures (e.g. dependence, clustering, heterogeneity) so both parts are representative of the original data. Use one part to fish for a question and the other part to find the answer via, for example, prediction or modeling.
What are the most relevant data to collect to answer the question in (1)?
Ideas from experimental design (a subfield of statistics) and active learning (a subfield of machine learning) are useful here. The above question is good to ask even if the data has already been collected because understanding the ideal data collection process might reveal shortcomings of the actual data collection process and shed light on analysis steps to follow.
The questions below are useful to ask: How were the data collected? At what locations? Over what time period? Who collected them? What instruments were used? Have the operators and instruments changed over the period? Try to imagine yourself at the data collection site physically.
What does a number mean in the data? What does it measure? Does it measure what it is supposed to measure? How could things go wrong? What statistical assumptions is one making by assuming things didn’t go wrong? (Knowing the data collection process helps here.)
Can the data collected answer the substantive question(s) in whole or in part? If not, what other data should one collect? The points made in (2) are pertinent here.
How should one translate the question in (1) into a statistical question regarding the data to best answer the original question? Are there multiple translations? For example, can we translate the question into a prediction problem or an inference problem regarding a statistical model? List the pros and cons of each translation relative to answering the substantive question before choosing a model.
Are the data units comparable or normalized so that they can be treated as if they were exchangeable? Or are apples and oranges being combined? Are the data units independent? Are two columns of data duplicates of the same variable?
Look at data (or subsets of them). Create plots of 1- and 2-dimensional data. Examine summaries of such data. What are the ranges? Do they make sense? Are there any missing values? Use color and dynamic plots. Is anything unexpected? It is worth noting that 30% of our cortex is devoted to vision, so visualization is highly effective to discover patterns and unusual things in data. Often, to bring out patterns in big data, visualization is most useful after some model building, for example, to obtain residuals to visualize.
Statistical inference concepts such as p-values and confidence intervals rely on randomness. What does randomness mean in the data? Make the randomness in the statistical model as explicit as possible. What domain knowledge supports such a statistical or mathematical abstraction or the randomness in a statistical model?
One of the best examples of explicit randomness in statistical modeling is the random assignment mechanism in the Neyman-Rubin model for causal inference (also used in AB testing).
What off-the-shelf method will you use? Do different methods give the same qualitative conclusion? Perturb one’s data, for example, by adding noise or subsampling if data units are exchangeable (in general, make sure the subsamples respect the underlying structures, e.g. dependence, clustering, heterogeneity, so the subsamples are representative of the original data). Do the conclusions still hold? Only trust those that pass the stability test, which is an easy-to-implement, first defense against over-fitting or too many false positive discoveries.
It is one form of reproducibility (for more information on the importance of stability, see my paper at http:\\projecteuclid.org/euclid.bj/1377612862).
Reproducibility has recently drawn much attention in the scientific community; see a special issue of Nature.
Marcia McNutt, the Editor-in-Chief of Science, pointed out that “reproducing an experiment is one important approach that scientists use to gain confidence in their conclusions.” Similarly, business and government entities should require that the conclusions drawn from their data analyses be reproducible when tested with new and similar data.
How does one know one’s data analysis job is well done? What is the performance metric? Consider validation with other kinds of data or prior knowledge. New data might need to be collected to validate.
There are many more questions to ask, but I hope the above list gives you a feel or a sense on what it takes to gain data wisdom. As a statistician or data scientist, the answers to these questions have to be found OUTSIDE statistics and data science. To find reliable answers, sources of useful information include the “dead” (e.g. scientific literature, written reports, books) and the “living” (e.g. people). Excellent interpersonal skills make the search much easier for the right sources to dig into, even if one is after a “dead” information source. The abundance and availability of information makes people skills ever more important since knowledgeable people almost always provide the best pointers in my experience.