Big data, big trouble

Big data, big trouble

By Bernardo A. Huberman
Director, Mechanisms and Design Lab
HP Labs
1501 Page Mill Rd.
Palo Alto, CA 94304

The interactive nature of the web has created research opportunities exploited by a number of researchers from the social and information sciences. Patterns that were hard to discern when operating with limited data sets have become apparent as enormous repositories of data collected by large services such as Twitter, Facebook, and Google are accessed by researchers and business professionals.

There is however a serious problem with many of these studies. As pointed out by journal by Ravetz (Nature 481, 25 (2012), Science is unique in that peer review, publication and replication are essential to its progress.
And yet,

many of the big data results that are coming out are obtained from private sources that are not accessible to researchers beyond the authors of the work.

Even worse, in some cases the source of the data itself remains hidden, leading not only to problems of verification but also about the generality of the results.

While ideally one would like to have the authors share the data, at least these data sources should be accessible to others to verify the findings. This is common practice within the physical and biological communities.
More importantly, we need to recognize that these results will only be meaningful if they are universal, in the sense that many other data sets reveal the same behavior. This actually uncovers a deeper problem. If another set of data does not validate results obtained with private data, how do we know if it is because they are not universal or the authors made a mistake? Moreover, as many practitioners of social network research are starting to discover, many of the results are becoming part of a “cabinet de curiosites” devoid of much generality and hard to falsify.
Besides the potential for fraud,

if this trend continues we’ll see a small group of scientists with access to private data repositories enjoy an unfair amount of attention in the community at the expense of equally talented researchers whose only flaw is the lack of right “connections” to private data.

You may also like...