Underhyped – Big Data as an Advance in the Scientific Method
Underhyped – Big Data as an Advance in the Scientific Method
Yanpei Chen, Performance Engineering Team, Cloudera— December18, 2014.
Big data is underhyped. That’s right. Underhyped. The steady drumbeat of news and press talk about big data only as a transformative technology trend. It is as if big data’s impact goes only as far as creating tremendous commercial value for a selected few vendors and their customers. This view could not be further from the truth.
Big data represents a major advance in the scientific method. Its impact will be felt long after the technology trade press turns its attention to the next wave of buzzwords.
I am fortunate to work at a leading data management vendor as a big data performance specialist. My job requires me to “make things go fast” by observing, understanding, and improving big data systems. Specifically, I am expected to assess whether the insights I find represent solid information or partial knowledge. These processes of “finding out about things” and “separating what you know from what you don’t know” lie at the heart of the scientific method.
My work gives me some perspective on an under-appreciated aspect of big data that I will share in the rest of the article.
Past advances in the scientific method
There have been only a small number of true breakthroughs in how we “find out about things”. We rarely think about them, because each of these advances has so thoroughly improved our understanding of the world and transformed our daily lives. Here are two such breakthroughs.
An initial breakthrough is the adoption of empirical, quantitative measurement. Galileo pioneered this approach when he pointed his telescope to the skies and convinced himself that the moons of Jupiter were real. Newton formalized the method when he used equations to describe exactly how fast those moons moved. Quantitative measurement jump-started modern science. It helped early scientists abandon the unguided trial-and-error of alchemy and the unexplained occult forces of metaphysics. The early scientists started doing experiments that we would recognize today.
A second breakthrough is the understanding of probability and multivariate statistics. This approach found an early application in agriculture. Crop yield is understood to be affected by factors including soil quality, irrigation, fertilizer, weather, seasons, and pests. Some of these factors are controllable, some others vary across the crop’s geographic cultivation range. Multivariate statistics allowed us to understand the combination soil, irrigation, and fertilizer that gives a high probability of a good yield. The fact that we can enjoy every kind of produce at every city regardless of seasons owes a large part to multivariate statistics.
These past advances set the stage for big data to make the next breakthrough. Let us look at two recent big data stories, so that we can see how exactly big data represents a new way of “finding out about things”.
Big data enabling a new kind of health care
Children’s Healthcare of Atlanta has implemented an application that allows them to visualize vital sign changes during eye exams on their neonatal patients so they can tell when the patients are responding well to treatments or when they are in pain. Unlike adults, infants in the neonatal care unit can’t easily communicate with practitioners to let them know when something hurts, so understanding these physiological factors is imperative to monitoring their care. The data is then correlated with the health and recovery of the new-born patients. Upon seeing the data, the doctors’ jaws dropped. There were immediate, no-brainer steps to improve care, simply by adjusting previously unseen factors indicating how well the patients were responding to procedures.
Implicit in the story is the application of empirical measurement: to quantify the patients’ physiological response, as well as the patients’ recovery. Multivariate statistics also played a part, in computing the probabilistic correlation between multiple physiological factors and multiple care outcomes.
What is new, and uniquely associated with big data, is the ability to monitor all the physiological factors with high granularity, as well as the ability to quickly and economically identify the probabilistic correlation across a large group of variables.
Had the study been performed ten years ago, it likely would have involved caregivers manually recording the patients’ vital signs at regular intervals of seconds or even minutes. This monitoring method has a critical drawback – it likely misses the bursts and transitions in physiological responses that indicate the caregiver is causing pain.
Further, the probabilistic correlation would have been done on a desktop computer. It would have used data from a small number of known factors in order to complete in reasonable time. This analysis method would have missed a key outcome of the study – the discovery of previously unseen factors that indicate how well the patients are responding to care.
The success of the study required the new capabilities brought by big data: fine-grained, automated monitoring, combined with the ability to analyze quickly and economically large volumes of data involving many variables. Without big data, the study simply would not have been possible.
Big data helping cut residential electricity use
Opower, a smart meter data visualization company is working with utilities across the country to graph smart meter data for residential consumers. The graphs appear on each household’s online electricity bill. They show the household’s electricity usage down to a granularity of minutes. An aggregate view also shows each household’s historical use compared with all households of similar size, and efficient households as defined by the top few percentiles of lowest-use households.
Personally speaking, I have poured over these graphs from my own electricity bills. They helped me identify minor adjustments in how I use electric appliances that yielded substantial reductions in my monthly bill. Beyond my household, Opower has helped cut US annual residential electricity usage by 1-3%. This is significant, as it potentially neutralizes the historical annual growth rate in US residential electricity use.
In this story, empirical measurement played its part long ago – we used it to figure out how to generate electricity and how to use it. A generator this large and this heavy will need this much torque to rotate at this angular velocity, which in turn produces this much voltage and current. Connect a light bulb of this many ohms at this many volts will draw this many amps of current and give off this many lumens of light.
Multivariate statistics gave us the electrical grid: There are these many generators, running on coal, hydro, wind, solar, and natural gas. A number of these are stand-by generators that are more costly to operate. Some are intermittent sources that produce different power levels with different probabilities. On the consumer side, there is a combination of residential, commercial, and industrial users. Each user has different demands based on the time of day, the day of week, and the weather across a range of geographies. The system needs to balance generation and demand, constrained by transmission capacities, while optimizing for both monetary and environmental cost.
The part of this story uniquely associated with big data is the ability to automatically log electricity use at fine time granularity for each household in the electric grid. Another contribution of big data is to allow utility operators to store, compute over, and display the data at affordable cost. This affordability is no small achievement, as utility operators are expected to be especially cost-efficient in their business.
Again, big data is essential to this success story. Ten years ago, smart meters were not yet widely deployed. For the few trial deployments, the data was heavily aggregated. The ability to identify efficient households would have been there. However, without visualizing the data at small time granularity, a household would have few clear action items to reduce their electricity use.
A new way of “finding out about things” through big data
Across the two stories, we see the same new capabilities enabled by big data: to measure and record data with granularity, thoroughness, and quantity that were previously not possible; to compute statistics and probabilities over more variables and larger data volumes than previously economical. These new capabilities represent a fundamental advance in our ability to “find out about things” – we can see more things at finer detail, and make sense of what we see quickly and economically.
Simply put, big data enables studies that could not have been done with previous scientific methods. As such, big data represents a new step in the scientific method.
Viewing big data as a new scientific method also gives us some insights into the future of big data. The current trade press predominantly talks about big data technologies and tools. Tools matter less in the most recent big data success stories. Rather, the focus is more on the new knowledge discovered and the improved goods and services enabled by the new knowledge. Just as we hardly ever talk about the tools we use to do quantitative measurement or multivariate statistics, we should expect soon that big data tools will “disappear” and become an accepted part of knowledge discovery.
For big data tools to truly “disappear”, several things still need to happen. First, there needs to be more success stories of big data enabling knowledge discovery in natural science, social science, behavioral science, and other disciplines. These stories are steadily emerging.
Second, big data tools need to improve to the point where they “just work”. Think about how easy it is to do arithmetic on a calculator or to enter data into a spreadsheet. Big data tools will one day become equally straightforward. Imagine the day when middle school students learn about big data alongside an introduction to science.
Closing – looking beyond commercial success
As big data professionals, we should be grateful that we are all bringing about a historic advance in how knowledge is discovered. Yes, the trade press necessarily covers topics of commercial interest. Yes, the commercial interests provide ample incentives for our work. At the same time, we should realize that our work has meaning far beyond the success of this or that vendor. Our colleagues will remember the big data tools we build. The world at large will remember the patients cured and kilowatt-hours saved through big data.