Challenges in Evaluating Big Social Media Data

Challenges in Evaluating Big Social Media Data

Fred Morstatter and Huan LiuIra A. Fulton Schools of EngineeringArizona State University

Big data has created opportunities for researchers across almost all disciplines. One area that has particularly benefited from the massive amounts of available data is computational social science, where researchers have used big data gain a deeper understanding of their questions, and to answer questions previously thought unanswerable. While this big social data allows for new forms of analysis, it brings with it some questions that need to be answered in order to ensure the veracity of results. In this article we will discuss two challenges: the representativeness of the data we are studying, and the ways in which we can scale human-centric methods to the needs of big data.

1 Studying Bias in Big Data Streams

Twitter sits at the forefront of big data research. One of the reasons for Twitter’s massive adoption in the research community comes from its open data sharing policies. By distributing a sizable portion of its messages, or “tweets”, for free each day, Twitter has enabled anyone with an internet connection to accrue a big social media dataset. This data outlet has been widely adopted for research in big data, being used across many tasks such as predicting human behavior, studying mass protests, and measuring the public reaction to products or ideas. In addition to Twitter’s free APIs, it also allows access to the entire population of tweets through its “Firehose”, albeit for a substantial price. This cost barrier has prevented many researchers from adopting the Firehose, so the Streaming API is still the standard for social media research.

While this data outlet has proven useful, recent research has found some indications of bias in the way that this outlet distributes its data [1, 2]. In Morstatter et. al 2013 [2], the authors compare the free APIs with the full API, and see how well certain measures hold up when compared across the two data outlets. The findings of this work indicate that the way in which Twitter’s APIs distribute the data may in fact change the results when common measures are performed. For example, when the authors inspected the top hashtags in the data, they found that the top hashtags that come through each outlet are significantly different. Biases such as these can cause researchers to find patterns in the data that don’t truly exist, but only appear due to technical artifacts in the way the data is distributed.

Researchers have attempted to correct for this bias. Morstatter et. al 2014 [1] hypothesizes that the bias in the stream is a function of the amount of data in the stream at a particular time period, and attempts to find unbiased time periods in the data. To find these unbiased time periods, the authors compare the Streaming API with another Twitter data outlet, the Sample API which returns a uniformly random sample of 1% of all public tweets on Twitter. By comparing the data returned in the Streaming API with that returned in the Sample API, an analyst can identify and remove periods of bias from their data.

2 Evaluation at Scale

One salient problem with social media data is that it often lacks the labels crucial for many supervised machine learning tasks. For example, in the context of sentiment analysis, most social media posts do not come with an explicit label saying that the post is either “happy”, or “sad”. Instead, a researcher wishing to build a sentiment classifier will have to collect these labels himself in order to establish a training dataset.

Amazon has created a tool called Amazon Mechanical Turk (AMT), which allows for researchers to have their data labeled by humans quickly. The tool allows for researchers to submit labeling tasks, called HITs, which are then solved by non-expert humans for a small reward. The labels produced by these non-expert AMT workers are surprisingly accurate. In Snow et. al 2008 [3], the authors find that, when aggregated, the results of AMT workers rival those of domain experts in Natural Language Processing tasks.

While Mechanical Turk provides an effective way to produce labeled data, it can become costly in terms of both time and money when one considers the cost of labeling big data. When we consider the truth that more data and simple algorithms often outperform complex models, we enter a quandary that while Mechanical Turk is useful, the cost alone may prohibit us from extracting the most information from our big data.

Consider the task of topic modeling. In this task, a topic modeling algorithm is run over a corpus of data, and topics are produced automatically. One of the drawbacks is that many of the topics produced will not be useful. To get around this issue, topics are often shown to AMT workers, and the topics are evaluated based on how well the workers can interpret the topics. Those topics that are determined to be interpretable are kept, and those that are not are discarded. The issue is that the number of HITs required for this process scales with the number of topics generated, which poses a scalability problem to researchers. To get around this issue, researchers have discovered and evaluated measures which correlate with the answers obtained from the AMT workers. By replacing the AMT responses with these measures, researchers are able to evaluate their topics at scale.

3 Summary

Evaluation on big social media data is challenging due to the lack of ground truth information associated with the data as well as possible issues that may arise in the collection. We have provided an overview of some of the evaluation challenges that arise on social media and the solutions for illustration. Researchers need to be diligent to ensure the quality of the data used in their work.


  1. [1]  Fred Morstatter, Juergen Pfeffer, and Huan Liu. When is it Biased?: Assessing the Representativeness of Twitter’s Streaming API. In WWW 2014, pages 555–556, 2014.
  2. [2]  Fred Morstatter, Juergen Pfeffer, Huan Liu, and Kathleen M Carley. Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. In ICWSM, pages 400–408, 2013.
  3. [3]  Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y Ng. Cheap and Fast—But is it Good?: Evaluating Non-Expert Annotations for Natural Language Tasks. In EMNLP, pages 254–263. Association for Computational Linguistics, 2008.

You may also like...