Beating the news using social media: the case study of American Idol
© Ciulla et al.; licensee Springer 2012
Received: 11 June 2012
Accepted: 13 July 2012
Published: 31 July 2012
Skip to main content
© Ciulla et al.; licensee Springer 2012
Received: 11 June 2012
Accepted: 13 July 2012
Published: 31 July 2012
We present a contribution to the debate on the predictability of social events using big data analytics. We focus on the elimination of contestants in the American Idol TV shows as an example of a well defined electoral phenomenon that each week draws millions of votes in the USA. This event can be considered as basic test in a simplified environment to assess the predictive power of Twitter signals. We provide evidence that Twitter activity during the time span defined by the TV show airing and the voting period following it correlates with the contestants ranking and allows the anticipation of the voting outcome. Twitter data from the show and the voting period of the season finale have been analyzed to attempt the winner prediction ahead of the airing of the official result. We also show that the fraction of tweets that contain geolocation information allows us to map the fanbase of each contestant, both within the US and abroad, showing that strong regional polarizations occur. The geolocalized data are crucial for the correct prediction of the final outcome of the show, pointing out the importance of considering information beyond the aggregated Twitter signal. Although American Idol voting is just a minimal and simplified version of complex societal phenomena such as political elections, this work shows that the volume of information available in online systems permits the real time gathering of quantitative indicators that may be able to anticipate the future unfolding of opinion formation events.
The recent global surge in the use of technologies such as social media, smart phones and GPS devices has changed the way in which we live our lives in a fundamental way. Our use of such technologies is also having a much less visible, but not less significant, consequence: the collection on a massive scale of extremely detailed data on social behavior is providing a unique and unprecedented opportunity to observe and study social phenomena in a completely unobtrusive way. The public availability of such data, although limited, has already ignited a flurry of research into the development of indicators that can act as distributed proxies for what is occurring around the world in real time. In particular, search engine queries or posts on microblogging systems such as Twitter have been used to forecast epidemics spreading , stock market behavior  and election outcomes [3–6] with varying degrees of success. However, as many authors have pointed out, there are several challenges one must face when dealing with data of this nature: intrinsic biases, uneven sampling across location of interest etc. [7–10].
In this paper we intend to assess the usefulness of open source data for the forecasting of societal events. We analyze in depth the microblogging activity surrounding the voting behavior on the contestants in American Idol, one of the most viewed American TV shows as a simple case study of this type of processes. In this program, the audience is asked to choose which contestant goes forward in the competition by voting for their favorites. The well delineated time frame (a period of just a few hours) and frequency (every week) over an extended period (an entire TV season) provides a close to ideal test ground for the study of electoral outcomes as many of the assumptions implicitly used in the analysis of social phenomena are more easily arguable, if not trivially true, in the case of the American Idol competition. In particular, we assume that:
The demographics of users tweeting about American Idol are representative of the voting pool.
The self-selection bias, according to which the people discussing about politics on Twitter are likely to be activists scarcely representative of the average voter, seems to become almost a positive discrimination factor in the case of a TV show where the voters are by definition self-selected.
Voting fans are the most motivated subset of the audience (the population we are trying to probe) that are willing to make an extra effort for no personal reward, and, crucially, they are allowed to vote multiple times.
Users are not malicious, and engage only in conversations they have a particular interest in.
The influence incumbency, which strongly affects the outcome of political elections, is not a factor determining the outcome of American Idol.
For the above reasons we can consider TV show competitions as a case study for the use of open source indicators to achieve predictive power, or simply beating the news, about social phenomena. It is thus not surprising that other attempts to use open source indicators in this context have been proposed in the past. Here however we benefit from the constant growth of Twitter that makes it easier to collect significative statistical sample of the population. Furthermore, TV shows are now leveraging on Twitter and other social platform which are becoming in all respects a mainstream part of the show. This amplifies the importance of the indicators one can possibly extract from these media in monitoring the competition. Finally, the increasing use of smartphone and mobile devices produces geolocalized information about Twitter activity that we can mine. We show that including the geographical information is a key ingredient in achieving predictive power. This final consideration clearly points out that the prospective use of Twitter data for predicting social events in other settings shall consider analysis that go well beyond the aggregated number of tweets.
The first episode of the 11th season of American Idol was aired on January 18, 2012 with a total of 42 contestants. After an initial series of eliminations made by the judges, a final set of 13 participants was selected. All further eliminations were decided by the audience through a simple voting system. During this final phase of the competition, two episodes are aired each week: On Wednesday the participants perform on stage and the public is invited to vote for two hours after the show ends. Voting can take one of three forms: toll-free phone calls, texting and online voting. The rules of the competition only allow for votes casted by the residents of the US, Puerto Rico and US Virgin Islands. There is no limit to the number of messages or calls each person can make, while the online votes are limited to 50 per computer as identified by its unique IP address. Every week, hundreds of millions of votes are counted and the contestant that gathers the least number is eliminated. The show airs at 8.00 PM local time on each coast. As a result of the time zone difference of three hours between the East and West coast, the total voting window between the first and last possible vote is 10.00 PM-3.00 AM EST. During the season’s final performance episode the voting window is extended to four hours after the show airs, resulting in an extended voting window between 9.00 PM-4.00 AM EST.
Text markers in support for each of the contestants
#phillipphillips #philatic #philatics
@JSanchezAI11 @TeamJSanchez @TeamJaaySanchez
#blujays #teambluJay #TeamJessicaSanchez #BBchez #JessicaSanchez
@JLedetAI11 @JoshuaLedetNet @TeamJoshua @JoshuaLedetNet @WeLuvLedet @TeamLedet
#jjewels #JoshuaLedet #teamjosh #ledet #teamjoshei #JoshuaIsLegend
#holliepop #teamhollie #holliepopsfamily #holliecavanagh
#skoutlaw #skoutlaws #skollie #skylarlaine
@HHanAI11 @MadiHeartHeejun @HeeHangels
The main dataset was obtained by extracting matching tweets from the raw Twitter feed used by Truthy  for the entire duration of the current season of American Idol, corresponding to . The feed is a sample of about 10% of the entire number of tweets that provides a, statistically significant, real time view of the topics discussed within the Twitter ecosystem. This allowed us to make a post-event analysis of the last 9 eliminations. This dataset was further complemented by the results of automatically querying the Twitter search API every 10 minutes for tweets containing one or more of the keywords we identified as related to American Idol. The search API data cover the period since May 16, giving us a more detailed view of the last elimination before the season’s finale and resulted in tweets.
Tweets in our dataset often contain georeferenced location information. They could be GPS coordinates, assigned automatically by smart phones, or self reported one. We consider both. Users with smart phones can use their accounts also in other devices. Geographical coordinates could be then present just in a fraction of their tweets. In order to increase the number of geolocalized tweets we analyzed the whole set of Twitter data collected since the beginning of 2012. We mine the data finding the geographical coordinates of all users with at least one geolocalized tweet in the dataset. Using these information we were able to assign geographical references also for tweets that did not contain them, as long they were sent by a user we previously geolocalized.
45 ± 4
64.2 ± 2.2
92.8 ± 1.9
15 ± 3
9.8 ± 1.3
1.4 ± 0.9
40 ± 4
26 ± 2.0
5.8 ± 1.7
Our fundamental, and seemingly naive, assumption is that the number of votes each contestant receives is proportional to the number of tweets that mention her. In other words, the larger the number of tweets referred to a contestant - the Twitter volume - the larger the number of votes she will get. This gives a natural measure to rank each contestant. It is important to note that this is a very simple measure, and that we intentionally choose not to take into account many of the factors that in principle might affect the results, such as the presence of negative or neutral tweets, or attempts to directly affect the counts by spamming the system with automatically generated tweets. In fact, one of the goals of this paper is to test whether or not a minimal set of measures applied to Twitter data can be good indicators of the actual voting outcome. Past attempts have met with ambivalent results and we are interested in testing the limits of this naive approach by building an unsophisticated prediction system assembled in less than one week.
While our dataset spans the entire duration of the current season, we focus only on the top-ten phase of the show, when just 10 contestants remained and test the predictive power of the Twitter proxy against the last 9 eliminations. For 7 of those, the ‘bottom-three’ contestants, the least three voted contestants (2 in the elimination of May 3rd) were revealed during the iconic part of the show: elimination day. We consider not just the success in predicting the contestant that will be eliminated but also the three that received the least votes.
Our methodology considers the number of tweets as the main indicator of the popularity of each contestant. By construction, n tweets posted by n different users or by the same user counts equally. This might introduce biases in the results, since the simple tweet counting could in principle be affected by bots or spam campaigns. Therefore, in order to test our measure, we analyzed the data ranking each contestant by the number of unique users. The results are unchanged within the statistical errors.
On May 23 the last episode of the program was aired. Around 10.00 PM EST, the winner of the 11th season was announced: Phillip.
Three days before, on May 20, we submitted to arXiv the first draft of this paper containing the methodology and the post-event analysis of the previous nine eliminations . At that moment, two more episodes of the show were to be aired, the final exhibitions on Tuesday, May 22, and the season finale on the next day.
We made our predictions based on the data collected on May 22, between the beginning of the show on the East at 8.00 PM EST and the end of the voting period in the west, at 4.00 AM EST. They were discussed in an updated version of the paper submitted to arXiv hours before the official announcement .
Our analysis shows the importance of geolocated signals. Far from being an additional piece of information, the geographical origin of the tweets turned out to be essential in gaining a clear understanding of the situation we were addressing. This is likely to be a general message, valid for any election-like processes, where the global popularity risks to be a poor indicator, and might induce wrong interpretations/forecasts.
We have shown that the open source data available on the web can be used to make educated guesses on the outcome of societal events. Specifically, we have shown that extremely simple measures quantifying the popularity of the American Idol participants on Twitter strongly correlate with their performances in terms of votes. A post-event analysis shows that the less voted competitors can be identified with reasonable accuracy (Table 3) looking at the Twitter data collected during the airing of the show and in the immediately following hours.
It is worth noting that our analysis aims to be extremely simple in order to establish a valid baseline on what it is possible to deduce by social media. As such, we purposefully do not consider a number of refinements and techniques that could improve the accuracy of our predictions. Distortions due to overactive users can be controlled by evaluating the number of unique users tweeting on each contestant. The text of the tweets could be scrutinized by using sentiment analysis techniques to select and compare only specific positive or negative tweets as a proxy for success/failure. Corrections to the demographic representations of Twitter users could be considered. All these techniques have been or are being developed in the analysis of a wealth of social phenomena and could be tested in a very clear and simple setting such as those of American Idol or similar shows.
Furthermore, we have illustrated that open source data can provide a deeper insight into the composition of the audience, with the eventual possibility of pointing out possible sources of anomalous behaviors. A geographical projection of the data reveals a non-uniform distribution of the basins of fans, and likely of voters, for the different participants. Interestingly, the same inspection highlights also that a strong activity concerning some of the candidates may come from non-US countries, whose audience are officially forbidden to vote.
Finally, our work casts a word of warning on the possible feedback between competitive TV shows and social media. Indeed, while the former rely more and more on the online voting of the audience, and the votes are kept secret and revealed only at the end of the show, all of the data necessary to monitor and even forecast the outcome of these shows is publicly available on the web. Given the large economic interests that lay behind such programs, such as the revenues of betting agencies and the major contracts of the show participants, it is obvious that this situation can lead to a number of undesirable outcomes. For example, the audience could be induced to alter their behavior in function of the situation they observe, and the job of betting agencies could be dramatically simplified. On a more general basis, our results highlight that the aggregate preferences and behaviors of large numbers of people can nowadays be observed in real time, or even forecasted, through open source data freely available in the web. The task of keeping them private, even for a short time, has therefore become extremely hard (if not impossible), and this trend is likely to become more and more evident in the future years.
The authors would like to thank Duygu Balcan for generating the cartograms used in this manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.