Skip to main content

Design and analysis of tweet-based election models for the 2021 Mexican legislative election

Abstract

Modelling and forecasting real-life human behaviour using online social media is an active endeavour of interest in politics, government, academia, and industry. Since its creation in 2006, Twitter has been proposed as a potential laboratory that could be used to gauge and predict social behaviour. During the last decade, the user base of Twitter has been growing and becoming more representative of the general population. Here we analyse this user base in the context of the 2021 Mexican Legislative Election. To do so, we use a dataset of 15 million election-related tweets in the six months preceding election day. We explore different election models that assign political preference to either the ruling parties or the opposition. We find that models using data with geographical attributes determine the results of the election with better precision and accuracy than conventional polling methods. These results demonstrate that analysis of public online data can outperform conventional polling methods, and that political analysis and general forecasting would likely benefit from incorporating such data in the immediate future. Moreover, the same Twitter dataset with geographical attributes is positively correlated with results from official census data on population and internet usage in Mexico. These findings suggest that we have reached a period in time when online activity, appropriately curated, can provide an accurate representation of offline behaviour.

1 Introduction

In 1824, newspapers reported arguably the earliest public opinion polls in the context of the United States presidential election campaign of the same year [1]. Since then, the use of diverse forms of media, such as newspapers, radio, and television, have been systematically used to track and discuss elections all over the world. The more recent dawn of social media led to the rapid development of the online arena as the preferred venue for large-scale political discussions and campaigning [26]. The microblogging and social networking service Twitter has become increasingly relevant in political discussions, both from the perspective of engaging citizens [57] and campaigning politicians [8, 9]. In the early days of Twitter, imbalanced representation and population biases were a major concern for modelling and forecasting [10, 11]. The increase of internet users worldwide [12, 13] with hints of increasing homogenization among Twitter users [14, 15] highlights the potential of social media as a powerful tool for political and societal analysis [16, 17].

Twitter data has been used to study a wide range of topics, including troll activity [18], cognitive reflection [19], digital trace data to study migration and migrants [20], expressed sentiment alterations during the COVID-19 pandemic [21, 22], misinformation spread during earthquakes [23], and spatial analysis of gunshots reports [24]. In the context of politics, there has been focus on the influence of fake news [7, 25], campaigning [8, 9, 26], echo chambers [27], polling [28], and elections [29]. For online polling and election predictions the preferred methods of analysis are based either on sentiment [6], volume [5], or social networks [28], and hybrid methods are the exception rather than the rule [29]. These analyses have led to a wide variety of results, including those where Twitter opinion predicts the results from aggregated polls [28] or outperforms polls [30], but also those where forecasting failed to predict the broad outcome of an election [6] (see the recent surveys by [29, 3133] for a thorough review of the use of social media data for election predictions).

In this paper we gauge representativeness of online activity with respect to real-life behaviour. We compare Twitter data with conventional aggregated polls [34] and official results [35] of the 2021 Mexican legislative election (Sect. 2). In Sect. 3 we present our methodology, particularly data mining (Sect. 3.1), allegiance determination (Sect. 3.2), and election models (Sect. 3.3). In Sect. 4 we present our results, which are discussed in Sect. 5. Finally, Sect. 6 presents the concluding remarks of this paper.

2 Background

The 2021 Mexican legislative election was a federal election that took place in Mexico on June 6, 2021 (Election Day). In this election, 500 seats were elected and allocated in the Chamber of Deputies. This election takes place every three years and is organised and overseen by the National Electoral Institute (INE, for its acronym in Spanish). The INE is an autonomous, public agency in charge of organising and reporting the results of federal elections. Additionally, the INE is in charge of approving coalitions made among parties. In Mexico, coalitions among parties have become increasingly common in the last decades.

For the 2021 Mexican legislative election, there were two major coalitions, of three parties each, and four additional individual parties. The coalition “Juntos Hacemos Historia” was comprised of the parties MORENA, PT, and PVEM. At the time of the election, MORENA was the party which held the majority of seats in the Chamber of Deputies. Moreover, the president of Mexico, elected on 1 December 2018, has been affiliated to MORENA since the gestation of the party in 2011. Therefore, here and hereafter, we refer to the coalition Juntos Hacemos Historia as the ruling parties. The coalition “Va por México” was comprised by the strong alliance of (historically opposing) parties PAN, PRI, and PRD. In order to analyse this multi-party election, we opt for a bipartisan model where the coalition “Va por México” and the remaining, smaller parties (MC, PES, FxM, and RSP) are agglomerated as the opposition.

3 Methods

We queried Twitter via the Twitter API for Academic Research to retrieve election-related tweets (Sect. 3.1.1). Our goal is to compare Twitter data to both official results and aggregated polls (Sect. 3.1.2). We process the Twitter data in order to determine the allegiance of each tweet with respect to the ruling parties or the opposition (Sect. 3.2). We then construct nine different models of the election (Sect. 3.3).

3.1 Data mining

3.1.1 Twitter

We retrieved data from Twitter using the Twitter API for Academic Research via the twarc Python library [36]. Between November 2021 and February 2022, we queried and retrieved the pertinent tweets posted between December 1st, 2020 and May 31st, 2021 at 05:00:00 UTC, which corresponds to midnight in Mexico City. This corresponds to the 6 complete months preceding Election Day in June 6, 2021. We make individual, tailored queries for each of the 10 individual parties in the 2021 Mexican legislative election [35]. We agglomerate the individual parties in a bipartisan model: the ruling parties vs the opposition. The ruling parties are comprised of 3 individual parties with the following acronyms: MORENA, PT, and PVEM. The opposition is comprised of 7 individual parties with the following acronyms: PAN, PRI, PRD, MC, PES, FxM, RSP. We tried to keep the queries as simple and concise as possible.Footnote 1 All queries contain at least the name of the party and, when available, the verified Twitter handle. Most queries also included hashtags; these could be as simple as a reference to the queried party (e.g. #PAN for PAN) or include hashtags frequently used by the official Twitter accounts (e.g. #SomosPT as used by PT). The case of MORENA is a peculiar one, as the acronym (intentionally) means “brunette” in Spanish. This query requires some additional filters, determined by doing manual tests on small data and excluding keywords from the query, that retrieve almost exclusively tweets associated to the election. As a result, this query is ad hoc and long [37], but within the allowed number of characters. Finally, during post-processing (i.e. not during the extraction via the Twitter API) all queries were restricted to results in Spanish. This results in ≈15 million tweets from ≈723,000 unique users posted between December 2020 and May 2021.

3.1.2 Official results and polling aggregates

We use the official vote count as provided by the INE [35]. For the polling aggregates, we use publicly-available data as provided by oraculus, a website which specializes in political analyses [34]. The data provided by oraculus contains public polls from news outlets and market research agencies. The data includes a total of 26 polls between December 2020 and May 2021, comprised of both phone and in-person polls. We use the effective data, which does not incorporate undecided nor unresponsive interviewees. Oraculus uses the effective data from November 2018 to May 2021 to build a Bayesian model.

3.2 Allegiance determination

Our modeling analysis relies on determining the allegiance of each individual tweet, including retweets, with respect to the query from where it was retrieved. Figure 1 presents the allegiance distributions of the ruling parties and the opposition, in the form of violin plots, for different subsets of the data of May 2021 (the month preceding Election Day). We determine the allegiance in the following way. First, we create a matrix of tweets that we tokenize using the Tokenizer Package from the Natural Language Toolkit (NLTK) [38]. Then we convert the text into a matrix using the CountVectorizer function from the scikit-learn Python library [39]. We proceed to use the multinomial Naive Bayes classifier, from the scikit-learn library, in the tokenized text data to estimate the probability of the tweet to have a positive connotation. This probability is close to 0(1) when the tweet expresses a negative(positive) connotation. For all parties we used supervised machine learning training with manually classified tweets. The manually classified tweets are categorized as either positive (p) or negative (n). We use a tweet dataset pertinent to the election to train our model by manually categorizing a fraction of the tweets (85%) and then use the rest of the data (15%) to test our trained model. Each party was classified using an individual training set with 100–1000 messages. Table 1 shows the summary of our training models. We have also tested the effectiveness of a pre-trained transformer model [40], however we have found that our Naive Bayes Classifier is better at correctly classifying tweets that discuss simultaneously the ruling parties and the opposition. The output of our algorithm retrieves a matrix that, for each tweet, indicates the following attributes: tweet ID, user ID, region, country, party-of-interest (e.g., PAN), estimated allegiance (\(0.0\leq\mathcal{A} \leq1.0\)), date, and coalition. This matrix is the main output we use to perform the analyses in this manuscript.

Figure 1
figure 1

Violin plots comparing the determined allegiance (\(\mathcal{A}\)) of unique users to the ruling parties (red) and the opposition (blue), where \(\mathcal{A}\approx 0\) means disapproval and implies a negative allegiance, while \(\mathcal{A}\approx 1\) means approval and implies a positive allegiance. The data is from May 2021, the month preceding Election Day. For all the distributions, the bulk of the data is around \(\mathcal{A}\approx 0\) and there are local maxima around \(\mathcal{A}\approx 1\). The number of tweets used to construct each plot decreases from top to bottom. For the ruling parties(opposition), there is a total of 2.9(2.3) million messages, which encompass 304(201) thousand unique users, decreasing to 11(13) thousand messages with geodata posted by 2.4(2.7) thousand unique users. We denote the quartiles in each distribution with white dashed lines

Table 1 Summary of the testing and training accuracy of our allegiance determination model. For each party/coalition, we show the total number of tweets, total number of messages of the training set, as well as the number of them that have a negative (n) or a positive (p) connotation. We also provide the \(F_{1}\) and ROC AUC score

3.3 Election models

In order to compare our Twitter data with official results and estimates from polls, we explore different election models from the literature. These models capture different nuances in their assessment of when and how can a tweet be representative of a vote. Arguably, the simplest approach is to use a volumetric tweet-based model (VT) where each tweet that mentions a party is assigned a vote for that party’s coalition [5]. This model has obvious shortcomings; for example, a single enthusiastic user can easily generate a large number of tweets which are reflected as votes for a party. This problem is alleviated by using a volumetric user-based [41] model (VU) where each account that mentions parties in one coalition more often than those in the other coalition is counted as a voter for that coalition. However, both volumetric approaches ignore tweet sentiment. A tweet mentioning a party could be expressing a negative sentiment; Fig. 1 shows that a large proportion of the tweets in our dataset fall in this category. Consequently, we introduce a suite of models that incorporate results from sentiment analysis into their vote-share predictions [6]. First, an allegiance score, \(A^{(y)}\), between 0 and 1 is computed and assigned to each tweet that mentions a party in coalition y. Here \(A^{(y)}=0\), \(A^{(y)}=0.5\), and \(A^{(y)}=1\) correspond to negative, neutral, and positive attitudes towards y, respectively. We then compute vote-share predictions based on tweets (AT) and users (AU). The former approach simply sums the allegiance scores for each coalition and the ratio of these scores is used to estimate their vote shares. The user-based approach computes an average allegiance for each user and for each coalition. These averages are then totalled and the ratios of these totals is used to estimate vote shares.

Both the volumetric and sentiment-based approaches fail to consider if Twitter users are actually representative of the general electorate. We propose two modifications to the sentiment approach to address this issue. The first alternative modifies the previously described models by only considering the subset of tweets that carry geolocation data, referred to as geodata hereafter. This allows us to compare the geographical distribution of people with our tweets and thus quantify one aspect of representativeness. In our results, models labeled with a G prefix use this subset while those labeled with a C prefix use the complete dataset. The second alternative model assumes that the subset of users posting predominantly positive tweets are more representative of the electorate; we will find this alternative model produces a good estimate of the election results.

The alternative election model does not rely on geodata and focuses on the online positive allegiance, per month, of unique users to a party. Vote share estimates in this model are constructed as follows. First, we label the tweets depending on whether they reference the ruling parties (\(y= 0\)) or the opposition (\(y= 1\)); tweets which reference both parties simultaneously receive both labels. For each label we average the allegiance of tweets per unique users in the form \(\overline{\mathcal{A}}^{(y)}_{i} := \sum_{n=1}^{N^{(y)}}\mathcal {A}^{(y)}_{i,n}/N^{(y)}\), where i is the index of a unique user, and \(N_{i}^{(y)}\) is the number of tweets posted by user i with label y. We then focus on users that express a positive connotation, exclusively, to either the ruling parties or the opposition; i.e., if one user expresses positive attitude about both the ruling parties and the opposition they are excluded from the subsequent analysis. A positive connotation is defined by setting lower (\(x_{\mathrm{{low}}}\)) and upper (\(x_{\mathrm{{upp}}}\)) bounds for user allegiances and filtering them by considering only the users within those limits. Therefore if \(x_{\mathrm{{low}}} \leq\overline{\mathcal{A}}^{0}_{i} \leq x_{\mathrm{{upp}}}\), and \(\overline{A}_{i}^{(1)}< x_{\mathrm{low}}\), the model determines that user i will vote in favour of the ruling parties. The default model uses \(x_{\mathrm{{low}}}=0.6\) and \(x_{\mathrm{{upp}}}=1\), however we do present results where other values have been used (in the ranges, \(0.1 \leq x_{\mathrm{{low}}} \leq0.7\) and \(0.7 \leq x_{\mathrm {{upp}}} \leq1\)).

4 Results

4.1 Election model comparison

Figure 2 shows the estimated vote share for the ruling parties for all of our models during May 2021. For each of our models, we do 1000 random samplings with replacement on the monthly data, estimate the mean value for each sample, and present the vote-share distribution as box plots defined by the median and the lower and upper quartiles. Additionally, we present the reported official results from the election (44.37% for the ruling parties) and from aggregated polls (49% for the ruling parties). Throughout the six months preceding the election, the vote-share estimate fluctuates (Fig. 3). Some models differ significantly, with preference for the ruling party being as low as 34.0% in December 2020 according to the alternative model and as high as 86.6% in March 2021 for the CVT model. However, some features are shared among all models. The political preference for ruling parties increases until March or April 2021 and then drops and converges. For polling aggregates, this drop happens earlier, in February 2021, resulting in convergence since March 2021. This fluctuating evolution of political preference could be, at least in part, due to the large amount of swing voters, which are the voters who decide on how to vote late on the election [42]. Models which use geodata and the alternative positive-allegiance analysis differ from the official results by 3.6 percentage points. They are more accurate than conventional polling methods, which differ to official results by ≈4.6 percentage points, and significantly more accurate than models based on the complete data. The precision of all models (<5.5 percentage points), i.e., the width of the distribution or the size of the boxplot, is better than that of aggregated polls (≈5.8 percentage points). Figure 4 shows the monthly evolution of the number of tweets depending on the model. Our models suggest that 100,000 users (63,500 for the alternative model) are representative of voting intention in Mexico, and even as little as 10,000 users (5200 in the GVU and GAU models) from a Twitter dataset with geodata can be used to model elections. For comparison, in the voting booth the total number of votes was ≈49 million out of ≈93 million registered voters [35]. Examining the model vote-share predictions, we observe that the tweet-based (T) predictions are closer to the actual vote share than the user-based (U) predictions. Similarly, the allegiance-based models (A) tend to outperform the volume-based models (V). Particularly, the alternative allegiance-based model performs much better than most other models based on the complete data set. Most importantly, the differences between these models becomes much smaller and the predictions tend to become more accurate when geodata is used.

Figure 2
figure 2

Election models and results of the 2021 Mexican legislative election. The figure shows the vote-share for the ruling parties during May 2021 according to all models, as well as the official results (black thick dashed line) and results of polling aggregates (black thin dotted line) with reported uncertainties (grey shaded region). The election models are shown as box plots defined by the median and the lower and upper quartiles. Models using geodata (blue and red) outperform conventional polls. Additionally, we present an alternative model (yellow) which considers positive-allegiance tweets exclusively (see Sect. 3.3 for more details). Model nomenclature is defined with three letters as follows. The first letter determines if the database considered is complete (C) or only accounts for tweets with geodata (G). The second letter determines if the analysis is volumetric (V) or focuses on determined allegiances (A). The third letter determines if the analysis is performed in all tweets (T) or focuses on individual users (U)

Figure 3
figure 3

Vote-share monthly analysis of the 2021 Mexican legislative election. Color and model nomenclature is the same as presented in Fig. 2

Figure 4
figure 4

Monthly tweets related to the 2021 Mexican legislative election. Model nomenclature is the same as presented in Fig. 2

The alternative positive-allegiance model proposed here outperforms aggregated polls and most election models, including some that rely exclusively on geodata (Fig. 2). For this model, we perform a detailed analysis of the uncertainties in accuracy, precision, and volume for the data of May 2021 (Fig. 5). For each pair of limits, \(\{ x_{\mathrm{{low}}}, x_{\mathrm{{upp}}} \}\), we obtain a sub-model of the election. The accuracy of our sub-models improve when increasing both limits, which implies that the more we focus on users discussing the election positively, the better we are able to match the results of the election. The uncertainties of the alternative model is 5 percentage points for all sub-models. The volume of unique users increases proportionally with the width set by the limits, spanning from ≈12,000 to ≈169,000 unique users. For the alternative model, with \(\overline{\mathcal{A}}\geq0.6\), 63,500 unique users result in an accurate and precise \(0.4\pm2.3\) percentage points difference with respect to the official results of the election.

Figure 5
figure 5figure 5figure 5

Analysis of the alternative election model for May 2021. In this model, we determine political preference based on positive allegiance, that we define with respect to a lower limit (\(x_{\mathrm{{low}}}\), horizontal axis) and an upper limit (\(x_{\mathrm{{upp}}}\), vertical axis). Our default sub-model assumes \(\{ x_{\mathrm{{low}}}, x_{\mathrm{{upp}}} \}=\{0.6, 1.0\}\) and therefore \(\mathcal{A} \geq 0.6\). Panel a shows accuracy as the percentage difference between our model and the selection results. Panel b shows precision as the maximum bootstrapping uncertainties, in percentage points. Panel c shows volume as the number of thousands of unique users

4.2 Geo-analysis

Our results suggest that tweets with geodata may have been representative of voter intention during the 2021 Mexican legislative election (Fig. 2). We proceed to perform a quantitative geographical analysis of our geodata and compare it to results from the official census to assess the representativeness of our dataset. Figure 6 shows a barplot of the geographical analysis of the population of Mexico and our Twitter data. We compare the population [43], number of internet users [44], and location of Twitter users for each of the 31 states and capital city (Mexico City) comprising Mexico. In our data, 72,000 tweets out of a total of 15 million have geodata, i.e. ≈0.5%. The number of tweets with geodata exclusively from Mexico is ≈65,000, which corresponds to ≈8000 unique users; we focus on these unique users for our estimates. The distributions of both internet and Twitter users across the 31 states shows good agreement with the state populations. For Mexico City, the percentage of inhabitants (7.3%) and internet users (8.6%) is quite close, while the percentage of Twitter users from our geodata is over represented (20.4%). In order to quantify this level of agreement, we calculate the Pearson’s correlation coefficient (r) between the different populations. The value between the percentage of the population and percentage of internet users is \(r=0.98\), and decreases to \(r=0.67\) when comparing to the Twitter data (the Pearson coefficient of the population of internet and Twitter users is \(r=0.73\)). This value increases to \(r=0.96\) when we combine the data from Mexico City (MX), Hidalgo (HG) and the State of Mexico (MC), which encompasses the conurbation around Mexico City known as Greater Mexico City. This confirms that the population and internet users are highly positively correlated, and the Twitter population from our data is positively correlated. We have also examined the residuals for each region by subtracting the percentage of Twitter or internet users from the percentage of the overall population. These residuals show that in most, but not all, of the states the data is under-represented. This is an obvious feature for the Twitter data in which Mexico City is over-represented, as the other states need to be underrepresented for the total to add up to 100%. Outside Greater Mexico City, the internet and Twitter residuals are within 1.6 and 3.0 percentage points, respectively.

Figure 6
figure 6

Quantitative geographical analysis using official census results from 2020 and the subset (≈0.5%) of all of our Twitter data that contains geodata. Panel a shows the percentage of the population of Mexico (green), internet users in Mexico (red) and Twitter users in our data (blue) for the 32 states of Mexico. Mexico City (MX) is clearly over-represented and an outlier from the Twitter sample. The Pearson’s correlation coefficient (r) between the percentage of the population of Mexico and percentage of internet users in Mexico is \(r=0.98\), and decreases to \(r=0.67\) when comparing to our Twitter data. This value increases to \(r=0.96\) when we combine the data from MX, Hidalgo (HG) and the State of Mexico (MC), which encompasses the conurbation around Mexico City known as Greater Mexico City. The data are correlated. Panel b shows the residuals of the population of Mexico with respect to internet and Twitter users. This is done in a range where Mexico City is not visible. For the bulk of the data, the percentage of internet users is representative of the population within \(\lesssim 1.6\%\), while Twitter data is representative within \(\lesssim 3.0\%\)

We perform an additional analysis on the data to explore representativenes in the geodata. Figure 7 explores each model with geodata to reproduce the actual population distribution of each state. To do so, we use a subset of the model data to sample 1000 users which follow the real distribution. We repeat this process 1000 times (bootstrapping with replacement) to get a mean of the vote share for the ruling parties and create distributions to explore the uncertainties. These are presented as boxplots, similar to the presentation of our main results (Fig. 2). The original data and the geo-corrected distributions are in very good agreement. Uncertainties in models with the geo-corrected distributions (1000) is larger than in the original data (1000–\(10{,}000\)) given that they are subset of it (Fig. 4).

Figure 7
figure 7

Analysis of the results of the 2021 Mexican legislative election. Here we present only the models which include geodata. For each model, we present the results as solid black circles. The boxplots correspond to alternative realizations of the original distribution of each model, where 1000 users are drawn to match the population distribution of Mexico (Fig. 6). Results are in very good agreement, highlighting the correlation of the Twitter geodata with respect to the population. Color and model nomenclature are the same as presented in Fig. 2

5 Discussion

Here we have shown that, for the 2021 Mexican legislative elections, models based on a positive-allegiance analysis or using geodata perform very well, while models using the complete data perform poorly. The positive-allegiance model is simple to implement and is also more intuitive: if a user, on average, expresses themselves positively towards a party, they are likely supporting that party and would vote for them if they had the chance. However, the discrepancy between the models using complete or geodata is more difficult to explain. For both data, the bulk of the allegiances lies close to zero (Fig. 1), implying that most tweets are expressing something negative about the pertinent party. However, the same models using complete or geodata differ by 10 or 15 percentage points (Fig. 2). These results could suggest that geodata (i) is more representative to reality, (ii) provides users more trustworthy opinions, or (iii) filters out bots and trolls. We find it difficult to point out exactly the reason why models with geodata outperform those which use the complete data, and therefore refrain from making any conclusive remarks.

5.1 Polling, Twitter, and open-source intelligence (OSINT)

Concerns about the accuracy of conventional polling have been appearing in major news outlets in recent years [45, 46]. For example, phone-based polling now faces challenges that were either weaker or absent in previous decades. Our study used archival Twitter data to gauge representativeness of online activity with real-life decisions. The Twitter data was used to determine political preference and regional volumetric activity, which we then compared to the official election results and census. Our main finding is that bipartisan models of the Mexican election using geodata are more accurate and more precise than conventional polling methods, including Bayesian modeling of aggregated polls [34]. These findings build on previous evidence that Twitter data, used in the context of elections, not only matches national polling aggregates, but precedes their results by days [28], highlighting the power of real-time communication. The advantages of online polling are well known: it is quicker, cheaper, and reaches a large number of people [28, 42]. Our approach is different from polling, whereas instead of directly asking someone their opinion we build a contextual query of the topic of interest and retrieve the pertinent data; this is more similar to OSINT methods. The main criticism towards modeling and forecasting using online resources, in this case Twitter, is about the known and unknown biases from the data [47, 48]. However, OSINT-like methods can be useful to get around some biases from conventional polling [30], such as social desirability bias [49, 50], which is a tendency of survey respondents to answer in a way that they consider socially favourable rather than with their actual opinion. Other systematic uncertainties are shared between conventional polling and OSINT-like methods, such as over-reporting, which occurs when people respond to a poll or engage in online political discussions but do not end up voting [51]. Another known problem has been the use of a pure volumetric approach where more tweets are assumed to be directly correlated to more votes [5] and where the quality of predictions has been mixed. In our study, we have also included models which estimate vote share based on the inferred voting intent of unique accounts, an approach that weighs equally the average opinion of very active accounts and those that are less active. Finally, the demographics of users from Twitter or other social media is an ongoing topic of debate. Most of the analyses around demographics have been made around Western societies, and particularly in the United States. Americans that do not use the internet are, in their majority, 65 years or older, earn less than 30,000 USD per year, and did not go to University [13]; however, the number of Americans that are offline has increased from \(\approx50\%\) in 2000 to 7% in 2021 [13]. There is evidence that ideological segregation in social-media usage between left and right has been overestimated [27], that densely populated areas tend to be underrepresented consistently in non-spatial models [52], and that cognitive reflection correlates with behaviour [19], which could be linked to socio-demographic variables such as age, income, and education. Analyses on representativeness have been conducted in other countries, even if they are less common. In Japan it has been pointed out that “In the early days of the Internet in Japan, there was a temporary “liberal bias” because users were skewed toward the urban and highly educated segments [53]. However, this gap has now disappeared, and conservative parties such as the incumbent Liberal Democratic Party have adapted better to the online environment [14].” [54]. In Mexico, representativeness has been explored in the context of Slacktivist [55] and data voids [56], leaving room for future analyses on a more general representativeness.

5.2 Bots

Nowadays, the presence and activity of automated accounts (bots) is one of the main topics of debate regarding Twitter [57]. In the past, Twitter has reported that a few percent (≈5) of all accounts are bots. However, different academic studies suggest that the percentage of bots may be as high as ≈15 [58, 59]. The unequivocal classification of bot-like behaviour and bot accounts is non-trivial. Moreover, not all bot accounts are malicious, simultaneously active, or participating in political propaganda. Studies have explored that bots that do participate in political propaganda [60] tend to be spread across the whole political spectrum [61] and play a role in amplifying an exchange of content rather than creating a horde of partisan followers [62]. There is tentative evidence that verified accounts play a stronger role than bots during contentious political events [63]. We did not incorporate detection and filtering of bots in our analysis. Focusing on unique users and neglecting extremely positive or negative allegiances is useful to marginalise bots; both of these approaches are incorporated in our alternative model.

5.3 Comparison with the literature

The role of social media in elections has been a topic of interest and investigation for more than a decade now [3, 4, 64]. Most analyses have been performed in the context of the most powerful (and digitally connected) economies of the world, such as the United States [5], the United Kingdom [6] and Germany [2]. One of the seminal studies that explored the role of Twitter in Latin America made predictions, via a volumetric analysis, for elections in Venezuela, Paraguay, and Ecuador [41]. In Mexico, the connection between politics and social media has been investigated in the context of militias [65], civic engagement and slacktivism [55], and disinformation on political topics in the media [56], but predictions and quantitative analyses of Mexican elections are rarely found in the literature. Recently, a machine learning approach was used to predict four presidential elections in Latin America in 2018 [66]. This method collected 65,000 Facebook, Twitter, and Instagram posts that were jointly analysed with polls. For Argentina, Brazil, and Colombia, the analysis was successful, with predictions similar to the results of the elections. However, that was not the case for Mexico, where the winning candidate was under predicted by more than 10 percentage points. The authors were not able to identify the reason for this, but they do suggest that it could be related to a lack of data. Specifically for Mexico, they collected 9843 posts, 7146 which were from Twitter, spanning a similar timescale to that of our study; however, our (geo)data is (one)three orders of magnitude larger than the data collected for their study. Additionally, they combine the data from different social media; while they consider this homogenised their sample, we consider that it introduces more uncertainties in the joint analysis and adds more caveats to the interpretation of their results.

5.4 Methodological choices and their challenges

Here we address some of the nuances and challenges related to election modelling with social media data. Data mining is arguably the most difficult aspect of our analysis. It involves generating queries with keywords, analysing the retrieved data, and then further refining the original query to exclude unrelated tweets. In our pipeline, choosing exclusively tweets in Spanish helped us to vastly reduce tweets that had nothing to do with the election.

For our analysis we decided to perform a bipartisan study instead of an independent party study. In a multi-party election that allows for party coalitions, the most straightforward way to perform the analysis is per party or per coalition. With tweets, one needs to be careful, as a single tweet can make reference to multiple parties, multiple coalitions, and a mix of parties and coalitions. If the analysis is done per party, tweets that exclusively mention the coalition or mention more than one party in the coalition should be weighted or ignored. If the analysis is done per coalition, the role of individual parties without coalitions must be assessed. A bipartisan analysis allowed us to adequately incorporate the role of coalitions and provided a simple framework for comparing a broad range of election models. In the literature, a frequent approach for multi-party election studies is to perform an analysis of all parties or candidates, but focus on presenting the results for the most dominant party or candidate [66]. Future studies on elections, particularly in countries like Mexico, should explore multi-party election models; these could be then compared to simpler models like the one presented in this study.

6 Conclusions

In this paper, we have explored nine election models in the context of the 2021 Mexican legislative election. Four of these models are variants of common approaches based on tweet volume and sentiment. We have examined a further four models which utilise geodata, and, we have also introduced a new positive-allegiance model. Most of the literature modelling elections relies on at most a few election models per study [29]; here we implement several of the most commonly-used models and present direct comparisons of their predictions. Importantly, we have used bootstrapping statistics to include uncertainties in our vote-share estimates, a method that can be easily incorporated in other studies that rely on data from social media. We find that models which exclusively use geodata outperform all models that use the complete data except for the new positive-allegiance model. We find that the positive-allegiance model, which uses the full dataset, performs extremely well. The positive-allegiance model is intuitive, simple to implement, and it does not rely on geodata, which can be difficult to gather or might not be available at all. Our results suggest that it would be beneficial to use the positive-allegiance model in future election analyses. Additionally, we propose that analysis with geodata should be seriously considered in future analyses, as they might provide revealing insight to a problem at only a fraction of the data; for Twitter, today, that means an analysis that is 100 times faster (based on the subset of geodata out of the complete data).

The number of online users worldwide has been increasing since the creation of the internet [12]. In the last decade, the online population of adults in the United States has increased from 74% to 93% [13]. In Mexico, the percentage of internet users aged 6 or more grew from 57% in 2015 to 72% in 2020, and is likely larger in 2022 [44]. While Mexico has not yet reached the online presence of, e.g., South Korea, the United Kingdom, and Sweden, and is more similar in this aspect to, e.g., Italy and Brazil, it has been steadily growing and becoming more connected [12, 44]. Social media users have increased alongside internet coverage, and many studies have highlighted their role in politics [5, 6, 8, 28, 67]. Our study shows that positive-allegiance analysis and geo-analysis of Mexican elections via Twitter data is accurate, precise, and robust, and that there is a positive correlation between the population of Mexico, the number of internet users in Mexico, and the number of Twitter users discussing the 2021 Mexican legislative election. This is particularly enlightening given that, while in Mexico the majority of the population is connected online, the percentage of online adults have not yet passed 90%. Our study finds a positive correlation between the online activity of the population of Twitter users in Mexico and real-life behaviour of Mexican citizens. This suggests that countries similarly or more connected than Mexico have already transitioned into a period in time when online activity can be used to model and predict real-life behaviour.

Availability of data and materials

The tweet IDs and user IDs we used are available at 10.5281/zenodo.7877001 [37]. The scripts allowing to perform queries and data extraction for this manuscript, via the Twitter API for Academic Research, are available via GitHub at avigna/twitter-analysis. The Python module used for the data extraction itself is available via GitHub at DocNow/twarc [36].

Notes

  1. The string query for each search done to analyze the data in this manuscript is publicly available via Zenodo [37].

Abbreviations

API:

Application Programming Interface

NLTK:

Natural Language Toolkit

OSINT:

Open-source Intelligence

References

  1. Tankard JW Jr (1972) Public opinion polling by newspapers in the presidential election campaign of 1824. Journal Mass Commun Q 49(2):361–365

    Google Scholar 

  2. Tumasjan A, Sprenger T, Sandner P, Welpe I (2010) Predicting elections with Twitter: what 140 characters reveal about political sentiment. In: Proceedings of the international AAAI conference on web and social media, vol 4, pp 178–185. https://doi.org/10.1609/icwsm.v4i1.14009

    Chapter  Google Scholar 

  3. O’Connor B, Balasubramanyan R, Routledge B, Smith N (2010) From tweets to polls: linking text sentiment to public opinion time series. AAAI Publications

    Google Scholar 

  4. Bond RM, Fariss CJ, Jones JJ, Kramer AD, Marlow C, Settle JE, Fowler JH (2012) A 61-million-person experiment in social influence and political mobilization. Nature 489(7415):295–298

    Article  Google Scholar 

  5. DiGrazia J, McKelvey K, Bollen J, Rojas F (2013) More tweets, more votes: social media as a quantitative indicator of political behavior. PLoS ONE 8(11):79449

    Article  Google Scholar 

  6. Burnap P, Gibson R, Sloan L, Southern R, Williams M (2016) 140 characters to victory?: using Twitter to predict the UK 2015 general election. Elect Stud 41:230–233. https://doi.org/10.1016/j.electstud.2015.11.017

    Article  Google Scholar 

  7. Bovet A, Makse HA (2019) Influence of fake news in Twitter during the 2016 US presidential election. Nat Commun 10(1):1–14

    Article  Google Scholar 

  8. Dimitrova DV, Matthes J (2018) Social media in political campaigning around the world: theoretical and methodological challenges. Sage, Los Angeles

    Google Scholar 

  9. Grusell M, Nord L (2020) Setting the trend or changing the game? Professionalization and digitalization of election campaigns in Sweden. J Polit Mark 19(3):258–278. https://doi.org/10.1080/15377857.2016.1228555

    Article  Google Scholar 

  10. Kohut A, Keeter S, Doherty C, Dimock M, Christian L (2012) Assessing the representativeness of public opinion surveys. Pew Research Center, Washington

    Google Scholar 

  11. Barberá P, Rivero G (2015) Understanding the political representativeness of Twitter users. Soc Sci Comput Rev 33(6):712–729

    Article  Google Scholar 

  12. ITU (2021) Measuring digital development: facts and figs. 2021. https://www.itu.int/itu-d/reports/statistics/facts-figures-2021/. Online; accessed 26-May-2022

  13. Perrin A, Atske S (2021) 7% of Americans don’t use the internet. Who are they? https://www.pewresearch.org/fact-tank/2021/04/02/7-of-americans-dont-use-the-internet-who-are-they/. Online; last modified 02-April-2021

  14. Nishida R (2018) Politics armed with information. Kadokawa

    Google Scholar 

  15. Wojcik S, Hughes A (2019) Sizing up Twitter users. https://www.pewresearch.org/internet/2019/04/24/sizing-up-twitter-users/. Online; last modified 24-April-2019

  16. Conover MD, Gonçalves B, Flammini A, Menczer F (2012) Partisan asymmetries in online political activity. EPJ Data Sci 1:6

    Article  Google Scholar 

  17. Mussi Reyero T, Beiró MG, Alvarez-Hamelin JI, Hernández L, Kotzinos D (2021) Evolution of the political opinion landscape during electoral periods. EPJ Data Sci 10(1):31. https://doi.org/10.1140/EPJDS/S13688-021-00285-8

    Article  Google Scholar 

  18. Alizadeh M, Shapiro JN, Buntain C, Tucker JA (2020) Content-based features predict social media influence operations. Sci Adv 6(30):5824

    Article  Google Scholar 

  19. Mosleh M, Pennycook G, Arechar AA, Rand DG (2021) Cognitive reflection correlates with behavior on Twitter. Nat Commun 12:921. https://doi.org/10.1038/s41467-020-20043-0

    Article  Google Scholar 

  20. Armstrong C, Zook M, Ruths D, Soehl T (2021) Challenges when identifying migration from geo-located Twitter data. https://doi.org/10.1140/epjds/s13688-020-00254-7

  21. Jing E, Ahn YY (2021) Characterizing partisan political narrative frameworks about COVID-19 on Twitter. EPJ Data Sci 10(1):53. https://doi.org/10.1140/EPJDS/S13688-021-00308-4. arXiv:2103.06960

    Article  Google Scholar 

  22. Wang J, Fan Y, Palacios J, Chai Y, Guetta-Jeanrenaud N, Obradovich N, Zhou C, Zheng S (2022) Global evidence of expressed sentiment alterations during the COVID-19 pandemic. Nat Hum Behav 6(3):349–358

    Article  Google Scholar 

  23. Flores-Saviaga C, Savage S (2021) Fighting disaster misinformation in Latin America: the# 19s Mexican earthquake case study. Pers Ubiquitous Comput 25:353–373

    Article  Google Scholar 

  24. García-Tejeda E, Fondevila G, Siordia OS (2021) Spatial analysis of gunshot reports on Twitter in Mexico city. ISPRS Intl J Geo-Inf 10(8):540

    Article  Google Scholar 

  25. Grinberg N, Joseph K, Friedland L, Swire-Thompson B, Lazer D (2019) Fake news on Twitter during the 2016 US presidential election. Science 363(6425):374–378

    Article  Google Scholar 

  26. Bright J, Hale S, Ganesh B, Bulovsky A, Margetts H, Howard P (2020) Does campaigning on social media make a difference? Evidence from candidate use of Twitter during the 2015 and 2017 U.K. elections. Commun Res 47(7):988–1009. https://doi.org/10.1177/0093650219872394

    Article  Google Scholar 

  27. Barberá P, Jost JT, Nagler J, Tucker JA, Bonneau R (2015) Tweeting from left to right: is online political communication more than an echo chamber? Psychol Sci 26(10):1531–1542. https://doi.org/10.1177/0956797615594620. PMID: 26297377

    Article  Google Scholar 

  28. Bovet A, Morone F, Makse HA (2018) Validation of Twitter opinion trends with national polling aggregates: Hillary Clinton vs Donald Trump. Sci Rep 8(1):1–16

    Article  Google Scholar 

  29. Khan A, Zhang H, Boudjellal N, Ahmad A, Shang J, Dai L, Hayat B (2021) Election prediction on Twitter: a systematic mapping study. Complexity 2021:5565434

    Article  Google Scholar 

  30. Zhenkun Z, Matteo S, Luciano C, Guido C, Makse HA (2021) Why polls fail to predict elections. J Big Data 8:137

    Article  Google Scholar 

  31. Chauhan P, Sharma N, Sikka G (2021) The emergence of social media data and sentiment analysis in election prediction. J Ambient Intell Humaniz Comput 12:2601–2627

    Article  Google Scholar 

  32. Brito KDS, Filho RLCS, Adeodato PJL (2021) A systematic review of predicting elections based on social media data: research challenges and future directions. IEEE Trans Comput Soc Syst 8(4):819–843. https://doi.org/10.1109/TCSS.2021.3063660

    Article  Google Scholar 

  33. Santos JS, Bernardini F, Paes A (2021) A survey on the use of data and opinion mining in social media to political electoral outcomes prediction. Soc Netw Anal Min 11:1–39

    Article  Google Scholar 

  34. Oraculus (2021) Elección para la Cámara de Diputados 2021. https://oraculus.mx/diputados2021/. Online; last modified 02-June-2021

  35. INE (2021) Cómputos Distritales 2021 Elecciones Federales. https://computos2021.ine.mx/votos-ppyci/grafica. Online; last modified 11-June-2021

  36. Summers E, Brigadir I, Hames S, van Kemenade H, Binkley P, tinafigueroa, Ruest N, Walmir, Chudnov D, recrm, celeste, Lin H, Chosak A, McCain RM, Milligan I, Segerberg A, Shahrokhian D, Walsh M, Lausen L, Woodward N, Münch FV, eggplants, Ramaswami A, Hereñú D, Milajevs D, Elwert F, Westerling K, rongpenl, Costa S, Shawn (2022) DocNow/twarc: v2.10.4. Zenodo. https://doi.org/10.5281/zenodo.6503180

  37. Vigna-Gomez A (2022) Dataset from: design and analysis of tweet-based election models for the 2021 Mexican legislative election. Zenodo. https://doi.org/10.5281/zenodo.7877001

    Article  Google Scholar 

  38. Bird S, Klein E, Loper E (2009) Natural language processing with python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.

    MATH  Google Scholar 

  39. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  40. Canete J, Chaperon G, Fuentes R, Ho J-H, Kang H, Pérez J (2020) Spanish pre-trained bert model and evaluation data. Pml4dc at iclr

  41. Gaurav M, Srivastava A, Kumar A, Miller S (2013) Leveraging candidate popularity on Twitter to predict election outcome

  42. Hargittai E, Karaoglu G (2018) Biases of online political polls: who participates? Socius 4:2378023118791080

    Article  Google Scholar 

  43. INEGI (2020) Censo de Población y Vivienda 2020. https://www.inegi.org.mx/programas/ccpv/2020/. Online; last modified 16-March-2021

  44. INEGI (2020) Encuesta Nacional sobre Disponibilidad y Uso de Tecnologías de la Información en los Hogares (ENDUTIH) 2020. https://www.inegi.org.mx/programas/dutih/2020/. Online; last modified 22-June-2021

  45. Delkic M (2018) What it takes to make 2.8 million calls to voters. The New York Times. Online; accessed 14-Oct-2022

  46. Cohn N Who in the world is still answering pollsters’. phone calls? The New York Times (2022). Online; accessed 14-Oct-2022

  47. Holbrook AL, Krosnick JA (2010) Social desirability bias in voter turnout reports: tests using the item count technique. Public Opin Q 74(1):37–67

    Article  Google Scholar 

  48. Buskirk TD, Blakely BP, Eck A, Mcgrath R, Singh R, Yu Y Sweet tweets! Evaluating a new approach for probability-based sampling of Twitter. EPJ Data Sci https://doi.org/10.1140/epjds/s13688-022-00321-1

  49. Crowne DP, Marlowe D (1960) A new scale of social desirability independent of psychopathology. J Consult Clin Psychol 24(4):349

    Article  Google Scholar 

  50. Fisher RJ (1993) Social desirability bias and the validity of indirect questioning. J Consum Res 20(2):303–315

    Article  MathSciNet  Google Scholar 

  51. Silver BD, Anderson BA, Abramson PR (1986) Who overreports voting? Am Polit Sci Rev 80(2):613–624

    Article  Google Scholar 

  52. Petutschnig A, Resch B, Lang S, Havas C (2021) Evaluating the representativeness of socio-demographic variables over time for geo-social media data. ISPRS Intl J Geo-Inf 10(5):323. https://doi.org/10.3390/ijgi10050323

    Article  Google Scholar 

  53. Kobayashi T (2007) Socialization of Internet use and its political implications. In: Political reality and social psychology: dynamics of heisei koizumi politics, pp 229–263

    Google Scholar 

  54. Yoshida M, Sakaki T, Kobayashi T, Toriumi F (2021) Japanese conservative messages propagate to moderate users better than their liberal counterparts on Twitter. Sci Rep 11(1):1–9

    Article  Google Scholar 

  55. Howard PN, Savage S, Saviaga CF, Toxtli C, Monroy-Hernández A (2016) Social media, civic engagement, and the slacktivism hypothesis: lessons from Mexico’s “el bronco”. J Int Aff 70(1):55–73

    Google Scholar 

  56. Flores-Saviaga C, Feng S, Savage S (2022) Datavoidant: an ai system for addressing political data voids on social media. In: Proceedings of the ACM on human-computer interaction 6 (CSCW2), pp 1–29

    Google Scholar 

  57. Woolley SC (2016) Automating power: social bot interference in global politics. First Monday 21(4). https://doi.org/10.5210/fm.v21i4.6161

  58. Varol O, Ferrara E, Davis C, Menczer F, Flammini A (2017) Online human-bot interactions: detection, estimation, and characterization. Proc Int AAAI Conf Web Soc Media 11:280–289

    Article  Google Scholar 

  59. Rodríguez-Ruiz J, Mata-Sánchez JI, Monroy R, Loyola-González O, López-Cuevas A (2020) A one-class classification approach for bot detection on Twitter. Comput Secur 91:101715. https://doi.org/10.1016/j.cose.2020.101715

    Article  Google Scholar 

  60. Forelle M, Howard P, Monroy-Hernández A, Savage S (2015) Political bots and the manipulation of public opinion in venezuela. arXiv preprint. arXiv:1507.07109

  61. Bruno M, Lambiotte R, Saracco F (2022) Brexit and bots: characterizing the behaviour of automated accounts on Twitter during the UK election. https://doi.org/10.1140/epjds/s13688-022-00330-0

  62. Caldarelli G, De Nicola R, Del Vigna F, Petrocchi M, Saracco F (2020) The role of bot squads in the political propaganda on Twitter. Commun Phys 3(1):1–15

    Article  Google Scholar 

  63. González-Bailón S, De Domenico M (2021) Bots are less central than verified accounts during contentious political events. Proc Natl Acad Sci 118(11):2013443118

    Article  Google Scholar 

  64. Karpf D (2012) The MoveOn effect: the unexpected transformation of American political advocacy. Oxford University Press, London. https://doi.org/10.1093/acprof:oso/9780199898367.001.0001

    Book  Google Scholar 

  65. Savage S, Monroy-Hernández A (2015) Participatory militias: an analysis of an armed movement’s online audience. In: Proceedings of the 18th ACM conference on computer supported cooperative work & social computing, pp 724–733

    Google Scholar 

  66. Brito K, Adeodato PJL (2023) Machine learning for predicting elections in Latin America based on social media engagement and polls. Gov Inf Q 40(1):101782

    Article  Google Scholar 

  67. Radicioni T, Saracco F, Pavan E, Squartini T (2021) Analysing Twitter semantic networks: the case of 2018 Italian elections. Sci Rep 11(1):1–22

    Article  Google Scholar 

Download references

Acknowledgements

We thank Twitter API for Academic Research, under the project “Representativeness of Social Behaviour Trends based on Twitter Data”, for archival access to data. We thank D. D’Orazio, I. Mandel, and I. Rivadeneyra for useful discussions. We thank J. Naiman, J. Vigna-Gómez, and E. Wisbech for their comments on the manuscript.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Authors

Contributions

AV-G and JM conceived the project. AV-G, MR and IM performed the analysis and prepared the figures. AB assisted with the data mining. PKR contributed to the analysis and interpretation of the data. AV-G wrote most of the manuscript. All authors contributed to the analysis, discussion, and writing of the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Alejandro Vigna-Gómez.

Ethics declarations

Competing interests

JM, MR, AB, and IM are current employees of Metrics. AV-G and PKR report no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vigna-Gómez, A., Murillo, J., Ramirez, M. et al. Design and analysis of tweet-based election models for the 2021 Mexican legislative election. EPJ Data Sci. 12, 23 (2023). https://doi.org/10.1140/epjds/s13688-023-00401-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1140/epjds/s13688-023-00401-w

Keywords