On Data, Exploratory Analysis, and R. Q&A with Ronald K. Pearson
Q1. What is exploratory data analysis (EDA)?
Exploratory data analysis (EDA) is the art of applying data visualization tools and a variety of computational data characterizations to achieve a better understanding of what is contained in a dataset: what variables are included, what ranges of values they take, how they relate to one another, and what kinds of unusual features may be present.
A very good practical description of the art of exploring data is that offered by Stanford University Mathematics and Statistics Professor and 1982 MacArthur Fellow Persi Diaconis, cited in the first chapter of my book:
We look at numbers or graphs and try to find patterns. We pursue leads suggested by background information, imagination, patterns perceived, and experience with other data analyses.
The aspects of exploratory data analysis emphasized in my book are data visualisation —i.e., what do we look at and how?
—simple data characterizations like the mean, median, standard deviation and MAD scale estimator, important data anomalies like outliers, inliers, missing data and metadata errors, and simple association measures and modeling tools like correlations and regression analysis, with an introduction to some newer methods like outlier-resistant correlation measures and machine learning models.
Q2. What are the best techniques to find “interesting” good, fea- tures in data? and distinguish them from bad, and ugly features?
This question actually has two aspects, a computational one that can be automated and a non-computational one that can’t. Specifically, we can use a variety of computational tools to look for “surprising” data features, in the sense that they violate our initial expectations of how the data val- ues should behave. A specific and important example is outliers, defined by Barnett and Lewis in their book on the subject essentially as “data points that are inconsistent with the majority of the other data points.”
Various tools have been developed to detect outliers, and I discuss several of them at length in the book (e.g., the well-known but often ineffective “three-sigma edit rule,” along with various robust alternatives like the Hampel identifier and the boxplot rule).
In contrast, the interpretation of outliers as “good, bad, or ugly” is not a mathematical problem. Gross measurement errors represent one important source of “bad or ugly” out- liers, but “good” outliers can represent extremely valuable discoveries: two examples include Lord Rayleigh’s discovery of the noble gas argon, which led to his receiving the Nobel Prize in Physics in 1904, and giant magnetoresistance, an effect in materials whose discovery lay the foundation for hard disk technology and was the basis for the 2007 Nobel Prize in Physics.
Other “interesting” data features include inliers, corresponding to data points that are consistent with the overall distribution of values, but which are not what they appear (e.g., numerical codes for missing data observations), misalignment errors, where data values are recorded in the wrong place in the dataset, and metadata errors, where the numbers may be right but their stated or assumed characteristics are wrong (e.g., their measurement units).
As with outliers, it is often possible to detect the “surprising” character of these data features using computational proce- dures, but some of these features require real care to find and must be interpreted using judgement and critical thinking. An example of an extremely “bad” data anomaly was the metadata error that caused the $125 Million Mars Climate Orbiter to burn up in the Martian atmosphere nine months after it was launched, before it could send back any of the data it was designed to collect. The problem was that key coursecorrection data values were provided in one set of units (pound-seconds) but used assuming a different set of units (Newton-seconds).
Q3. How easy is it to craft data stories?
A data story involves two essential components: the results of our data analysis, and text that explains to others what we have learned from that analysis. For those who struggle with either the analysis or the writing, crafting an effective data story can be challenging, but over the years, I have learned two tricks that I think help a lot. The first is a strategy that Bob McClure, my data analysis mentor at the DuPont Company, taught me: he would write a draft of his analysis summary before he had actually started the analysis. Obviously, this draft couldn’t include the results that he didn’t have yet, but he found it a useful way to clarify the analysis objectives from the outset. In particular, this approach forced him to describe what he intended to do, why he thought he needed to do it, what data he needed to collect, and what he expected to learn from analyzing it. Then, after he collected and analyzed the data, he could fill in the parts that were consistent with his expectations, and revise (sometimes drastically) those parts that were not consistent with his expectations.
The other trick that I find useful is to build a data story around a set of figures. That is, I will usually start by constructing a set of figures that highlight the main findings of my analysis, create short, informative captions for them, and then explain each one with a paragraph or two. In the course of doing that, questions come up that make me realise what I haven’t yet explained in the document, and answers to those questions provide the basis for drafting or revising the introductory sections of the document. Also, it is important to keep the audience for the data story clearly in mind while drafting it, because that ultimately determines how much detail and which details need to be included in the summary.
Q4. What is better for EDA: Python or R?
This is a really loaded question because these two languages are so different: Python is much broader in scope than R, with add-on packages that support a much wider range of applications than R does; as of 3/26/2018, the Python Package Index (PyPI) listed 133,424 available packages, while CRAN listed 12,362 available R packages.
Conversely, R is a much more focused language, concentrated on data visualization, analysis and modeling. For this reason, while I rate these languages as far and away my favorite two out of all computer languages I have ever had any real exposure to (this list includes FORTRAN, PL/I, APL, Matlab, its open-source equivalent Octave, C++, Perl, Pascal, several flavors of Basic, SAS, and a number of assembly languages), I prefer R for exploratory data analysis because it provides a broad and growing set of graphical and computational tools that are well suited to data exploration. For data visualization, I am thinking of graphical tools like beanplots, sunflowerplots, stacked barplots, QQ-plots and extensions like Poissonness plots, correlation matrix plots like those available in the corrplot package, and others like the partial dependence plots provided by the plotmo package.
In addition, extremely useful computational packages include the robust statistics package robustbase that provides a lot of tools for characterizing data in the presence of outliers, the size-corrected binomial probability and confidence interval estimators available in the PropCIs package, and the mixture distribution modeling tools available in the mixtools package. Certainly, all of these tools could be built in Python, but as far as I am aware, most of them are not available as built-in modules at present. This point is important because developing new implementations of sophisti- cated tools like these and verifying that they are performing correctly can be an enormous amount of work.
Ronald K. Pearson holds the position of Senior Data Scientist with GeoVera, a property insurance company in Fairfield, California, and he has previously held similar positions in a variety of application areas, including software development, drug safety data analysis, and the analysis of industrial process data. He holds a PhD in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology and has published conference and journal papers on topics ranging from nonlinear dynamic model structure selection to the problems of disguised missing data in predictive modeling. Dr. Pearson has authored or co-authored books including Exploring Data in Engineering, the Sciences, and Medicine (Oxford University Press, 2011) and Nonlinear Digital Filtering with Python. He is also the developer of the DataCamp course on base R graphics and is an author of the datarobot and GoodmanKruskal R packages available from CRAN (the Comprehensive R Archive Network).
Exploratory Data Analysis Using R
Ronald K. Pearson
May 4, 2018 Forthcoming by Chapman and Hall/CRC
Textbook – 548 Pages – 11 Color & 132 B/W Illustrations
ISBN 9781138480605 – CAT# K348778