Who Invented Big Data (and Why Should We Care)?
By Shomit Ghose, General Partner, ONSET Ventures
Despite the current level of visibility and frenetic activity surrounding Big Data, it turns out the concept was first pioneered in the 1940s by Hari Seldon, professor of mathematics. At Streeling University. On the planet Trantor.
In Isaac Asimov’s Foundation science fiction trilogy. The premise underlying Asimov’s books was that Professor Seldon had developed a branch of probabilistic mathematics that allowed the future to be accurately predicted. This is, as it turns out, exactly the promise of Big Data: predicting what will happen next based on analysis of enormous volumes of historical data.
Hari Seldon’s “work” illustrates why Big Data is of such keen interest today among Silicon Valley entrepreneurs and investors. Big Data promises to be for the 21st Century what oil was to the 20th: the fuel driving all that we do. This fuel is being created in fantastically large volumes, with 4.4 zettabytes (4.4 x 1021) of data produced in 2013, and that volume growing to 44 zettabytes of data produced in the year 2020, according to EMC’s annual Digital Universe study. In 2013, 22% of this annual data volume had semantic value, but only 5% of it was actually analyzed, according to EMC. By 2020, 35% of the 44 ZB of data will have semantic value. The challenge and opportunity of Big Data will be to find a way to make sense of all of that valuable data.
Although data is being produced in astoundingly large quantities and at an incredible rate, it is neither the volume nor the velocity that best defines the opportunity of Big Data. Rather, the opportunity in Big Data is best defined by the sheer variety of the data being created, and the ability to understand the connections between (i.e., correlate) that disparate data; it’s all about using Big Data to find the patterns we seek in nature and society.
It’s a little like viewing a painting by the pointillist French painter, Georges Seurat. We cannot be transfixed by staring closely at three tiny lavender dots of paint. Only by drawing back and viewing the canvas of data in its entirety are we able to discover the pattern that those tiny dots of lavender paint are helping communicate.
As a venture capitalist focused on Big Data investing, this brings me to a few observations about how Big Data might best be harnessed. First of all, in the bygone Small Data era, it used to be that we had a large volume of questions relative to the volume of data. Consequently, we had the luxury of being able to define and ask the questions against that data. (Ahh, those simple, halcyon days of decisions driven by SQL queries.) Today, the volume of data far outstrips our ability to know what questions to ask. It is now incumbent upon the data to tell us what questions we need ask. Thus, it is only through the further development of unsupervised machine learning techniques that we will be able to understand what the data is trying to tell us. And it goes without saying that it will be increasingly challenging to detect patterns and anomalies in data when those patterns and anomalies are being rendered ever more statistically insignificant due to the accretion of ever more data.
A second observation is that if data is good, more data is even better. But the correlation of this data shouldn’t be limited to just what’s available inside your organization. It needs to include everything you can possibly get your hands on, with the further corollary that data is always more valuable if it is shared. But how can you stitch together and correlate all of this disparate data? In a multi-source data world there are no longer common keys to tie data together. In such a world probabilistic joins will provide the solution for combining and correlating data that is seemingly varied and discordant.
The final observation is that “data exhaust” – i.e., the incidental data that is produced while we’re busily engaged in some primary task – contains just as much information, if not more, as the primary data itself. A simple example is the time and geography tags accompanying photos taken on your cell phone. Needless to say, this data exhaust can be used for both beneficial and malicious purposes, with all of the attendant ethical issues this raises.
Data exhaust is particularly powerful in that it can provide information on our “real” behavior versus our “represented” (self-reported) behavior. An excellent example of how data exhaust can be applied was the Flap system implemented at the University of Rochester by Sadilek, Kautz and Bigham. Flap was able to geo-locate Twitter users, even when they weren’t geo-locating their tweets, simply by geo-locating the tweets of the friends of the Twitter user.
So where do future entrepreneurial opportunities lie in the Big Data world? Everywhere, as it turns out. 3D printing? Merely an exercise in personalized manufacturing. Where does the personalization come from? Data. An abundance of data that allows companies to cater to markets of one. How about ERP? ERP is no longer constrained by the “small data” supply and demand information contained within its own database. We can now infer customer demand by extracting trends from the Big Data contained in social media.
IT security in a post-Target and post-Snowden world? This is answered by a Big Data-focused approach of continuous authentication of the individual user based on key-stroke dynamics and patterns of file access.
Static “authenticate once” security mechanisms are clearly obsolete. The Internet of Things? Wearables? Continuous (versus episodic) healthcare? All founded exclusively on the massive volumes of data produced at the edge of the network and analyzed as Big Data in the Cloud. And the world’s biggest and fastest growing source of data going forward is the mobile phone, which has succeeded in instrumenting virtually the entire human population.
Finally, I should mention that data quality is a significant issue in a world that will ultimately be completely data-driven. That is, if a machine can be “learned” to do the right thing, it can equally easily be “mis-learned” to do the wrong thing. Security mechanisms to prevent this kind of data bombing are a crucial area of future need.
The range of opportunities within Big Data for computer scientists, entrepreneurs and investors is exciting and vast.
We are at the dawn of a new industrial age founded entirely on data, and have just invented our first Spinning Jenny. Start-ups that combine a disruptive business model with an intense focus on data – leveraging the good uses, or preventing the bad – are well positioned to be The Next Big Thing. Hari Seldon’s psychohistory, aka. computational social science in today’s parlance, may not be a goal we wish to attain. But Big Data already gives us the power to address and solve many of the problems affecting business and society today.