Skip to main content

Evaluating Twitter’s algorithmic amplification of low-credibility content: an observational study

Abstract

Artificial intelligence (AI)-powered recommender systems play a crucial role in determining the content that users are exposed to on social media platforms. However, the behavioural patterns of these systems are often opaque, complicating the evaluation of their impact on the dissemination and consumption of disinformation and misinformation. To begin addressing this evidence gap, this study presents a measurement approach that uses observed digital traces to infer the status of algorithmic amplification of low-credibility content on Twitter over a 14-day period in January 2023. Using an original dataset of ≈ 2.7 million posts on COVID-19 and climate change published on the platform, this study identifies tweets sharing information from low-credibility domains, and uses a bootstrapping model with two stratifications, a tweet’s engagement level and a user’s followers level, to compare any differences in impressions generated between low-credibility and high-credibility samples. Additional stratification variables of toxicity, political bias, and verified status are also examined. This analysis provides valuable observational evidence on whether the Twitter algorithm favours the visibility of low-credibility content, with results indicating that, on aggregate, tweets containing low-credibility URL domains perform better than tweets that do not across both datasets. However, this effect is largely attributable to a difference in high-engagement, high-followers tweets, which are very impactful in terms of impressions generation, and are more likely receive amplified visibility when containing low-credibility content. Furthermore, high toxicity tweets and those with right-leaning bias see heightened amplification, as do low-credibility tweets from verified accounts. Ultimately, this suggests that Twitter’s recommender system may have facilitated the diffusion of false content by amplifying the visibility of low-credibility content with high-engagement generated by very influential users.

1 Introduction

The emergence of social media platforms has brought about a significant transformation in global patterns of information dissemination and consumption, as a large number of internet users now rely on these channels as their primary sources of information acquisition [14]. The rapid growth in social media membership, and consequently of digital traces circulating in these platforms, have been accompanied by a progressive rise in the importance of artificial intelligence (AI) based recommender systems, content pre-selection, ranking and suggestion systems used to customise users’ online experiences [5, 6].

The integration of AI-based recommender systems into social media platforms has led to a fundamental shift in the way users consume and interact with online information [7], significantly increasing the level of automated content curation while limiting users’ freedom of independent content discovery [8]. This paradigm shift towards the machine-learning based hyper-personalisation of social media content raises concerns regarding potential impacts on the quality and diversity of information available to users, with clear implications for the integrity of knowledge acquisition processes. Several recent studies have analysed these risks, concluding that engagement-based recommender systems – which form the majority of recommendation engines currently deployed within social media platforms [9, 10] – may be prone to bias [11, 12], user-manipulating behaviour [13], the creation of echo chambers [5, 14], and to the amplification of false or misleading content [15, 16].

Despite their status as critical infrastructure of social media platforms – and arguably of information circulation at a societal level – the internal architectures and practical functioning of recommender systems remain only superficially understood [17], and while several platforms have previously released white papers with information on their functioning [1820], limited evidence exists on the characteristics that guide their deployment. This is also the case for Twitter, (Now X Corp.) which recently made parts of its recommender system public, providing a window into the functioning of a social media content suggestion system [21]. However, while this release does provide new information on the system’s architecture, perhaps the most central part of the system – a ‘heavy ranker’- deep neural network used to make recommendation predictions [22] cannot be replicated with currently available information, limiting the possibility of testing the behaviour of this recommender system. This lack of evidence is a clear obstacle towards evaluating the magnitude of any form of algorithmic bias in content suggestion, and in particular, towards understanding whether, in their drive to maximise user engagement with a platform, recommender systems are acting as significant drivers of the diffusion of online disinformation and misinformation. This is crucial, as understanding how false and low-credibility content propagates within social media platforms is key to improving the safety of societal information commons.

To address the existing evidence gap, this study introduces a measurement approach that leverages existing digital traces to empirically observe – analysing recommendation outcomes through impression counts – the state of the promotion of low-credibility content on Twitter in a two week period in January 2023. The motivation for choosing Twitter as the object of this study is threefold: First, Twitter has a large user base with global reach, with a monthly user base of approximately 450 million users and over 300 million daily tweets [23], making it one the largest and most influential social media platforms globally. Second, Twitter has been often criticised for platforming and amplifying extremist content, disinformation and misinformation [24, 25], and studying its recommender systems may provide additional insights into how false information spreads on the platform. Third, Twitter was, at the time of data collection, the only platform which provided data on impressions (or views) through its API. While Twitter’s recommender system has undergone significant changes since January 2023, this study aims to offer a snapshot of the recommender system’s behaviour at a crucial juncture in the platform’s evolution, before the recent system-wide overhaul. Although the findings may not directly translate to the current recommender system, they may provide unique insights into a pivotal transitional moment, revealing baseline tendencies in how the system influenced the propagation of low-credibility content.

To map the behaviour of Twitter’s recommender system, this study uses an original dataset of ≈ 2.7 million tweets discussing COVID-19 and climate change published on the platform in a 14-day period in January 2023, and extracts tweets containing information from low-credibility URL domains and high-credibility URL domains [26], testing whether low-credibility information gets amplified visibility on the platform. Data on impressions – defined in the API documentation as the count of how many times a Tweet has been seen [27] – was initially made available through the Twitter API in January 2023 and, as a passive metric of how many users have been exposed to a tweet, provides a powerful window into the content that the Twitter recommender system tailors to users. Through this process, it is possible to estimate whether tweets sharing information from low-credibility domains generate exceptional impressions, which may point towards a recommendation bias towards low-credibility content, as well as a general lack of functioning integrity signals [6]. By analysing the visibility of low-credibility content – which can be used as a baseline for false or misleading information – this research seeks to provide insights into the dynamics that drive user exposure to false or misleading information, with potential implications for social media platforms and the broader digital information ecosystem.

2 Data and methods

2.1 Data

2.1.1 Twitter data

The data utilised in this study was collected from the Twitter API V2 in a 14-day time period between January 15th to January 29th, 2023. Given the research objective of assessing whether the Twitter recommender system amplifies the visibility false or misleading information, this work uses data from discussions on COVID-19 and climate change, two debates that are often considered publicly divisive and subject to a significant circulation of false content [2830]. Data discussing these topics was collected through an English-language keyword search using the R package AcademictwitteR [31], which at the time of collection, was chosen for its exclusive capability of collecting data on a tweet’s impressions. The data was collected at regular intervals ensuring similar uptime for each day of publication, and the resulting dataset comprises a total of ≈ 2.1m original tweets – hence excluding retweets – on COVID-19, and ≈ 600k original tweets on climate change. Data collected from the Twitter API is quite granular, and is made up of 21 variables, encompassing several tweet-level and user-level variables, such as engagement metrics, verification statuses, locations and users’ profile images.

2.1.2 URL domains classification

In order to extract tweets that are likely to contain false or misleading information, this study relies on the use of URL domain credibility ratings, which is often used in the literature on the study of disinformation and misinformation [3234]. While several datasets domain credibility exist, such as NewsGuard and IffyNews, this study identifies information reliability scores using the aggregate reliability scores from [26], which used principal component analysis to produce aggregate scores derived from the major available rating sets. This dataset provides credibility scores for 11,520 domains, where 0 represents the lowest credibility, and 1 represents the highest credibility. In line with the values used by major credibility ratings providers such as Newsguard, we then consider tweets with a credibility score lower or equal to 0.4 to be low-credibility and tweets with a credibility score equal or higher than 0.6 to classify as high-credibility [35]. This process results in a total of 87,769 tweets from low-credibility sources, and 187,643 tweets from high-credibility sources, with low-credibility domains present in 3.77% of tweets on COVID-19 and 1.69% of tweets on climate change. The most common low-credibility and high-credibility domains for both datasets are shown in Table 1, while their distribution split is shown in Fig. 1.

Figure 1
figure 1

Distribution of tweets with low-credibility and high-credibility domains as a percentage of the full data in each dataset under analysis

Table 1 Distribution of the five most prevalent high-credibility and low-credibility domains for both datasets

2.2 Measuring amplification

2.2.1 Baseline amplification benchmark

Measuring recommender-driven amplification is a notoriously difficult task, which requires clearly defined objectives and robust benchmarks to identify potential patterns of amplification. In this study, amplification is defined as a condition where tweets with similar characteristics drawn from two different groups – low-credibility and high-credibility – exhibit a significant difference in the outcome variable, which is the number of impressions obtained. Amongst the available metrics currently provided by the Twitter API, two main features make data on impressions the most suited for the study of recommender-based amplification. First, impressions directly measure the visibility of a tweet, providing a direct way to assess how often content is organically displayed in users’ feed, a characteristic that is crucial to understand the behaviour of recommender systems. Second, unlike metrics such as likes, retweets, or comments, impressions data is a passive metric of exposure independent of user engagement, and as such, it is expected to be more effective than alternative metrics in characterising the behaviour of recommender systems [36].

To produce a clear measure of amplification, it is therefore important to establish a robust benchmarking procedure to compare the two samples under analysis. For this purpose, this study compares the two previously described samples of high-credibility and low-credibility tweets through bias-corrected and accelerated bootstrapping (BCa), where the mean difference between the two samples is measured across 1000 randomly resampled iterations. BCa enhances traditional bootstrapping by introducing two key modifications: a bias-correction factor and an acceleration factor. The bias-correction factor adjusts for the bias in the bootstrap distribution, ensuring that the resampling process more accurately reflects the true nature of the data, while the acceleration factor corrects for the skewness of the bootstrap distribution, which is particularly important in data with asymmetrical distributions [37, 38]. The substantive numbers of iterations used in the BCa method further enhances the robustness of this approach, providing confidence that the effect measured reflects a real divergence between the two samples. Lastly, as a non-parametric statistical approach, BCa requires fewer assumptions about the distribution of the data to hold, which makes this approach particularly suited for the study of social media data.

However, given the existing limitations of impressions data, a simple bootstrapping benchmark is not sufficient to reliably determine whether a sample consistently received more impressions than the other, as it neglects potential user-level and tweet-level factors that could influence the number of impressions obtained by a tweet. To remedy this shortcoming, the baseline benchmark bootstrap comparison is performed with two stratifications, resampling the data by a tweet’s engagement level and the user’s number of followers. Engagement was selected as a baseline stratification variable as engagement-based recommender systems are known to highly value a tweet’s engagement performance [9], and high-engagement tweets are likely to be shown more than low-engagement tweets. Followers count was selected as a baseline stratification variable because, within any networked recommender system, the number of followers a tweet creator has will likely have a significant impact on how many people are exposed to that tweet. While these two variables alone may not account for the entirety of tweet-level and user-level factors that are likely to influence impressions, adding an excessive number of stratification variables with limited explanatory power is likely to be counterproductive, as it would significantly reduce the number of matched samples. Considering these limitations, stratifying the baseline benchmark by levels of engagement and followers appears the most effective strategy to maximise the accuracy and validity of the results.

As both engagement and followers count are discrete variables with a large range of values, the complexity of these variables is reduced by assigning the data to discrete clusters using quantile based discretization, an approach that allows for a grouping of the data into similar-sized buckets based on quantile rankings [39]. This approach was tested alongside more traditional approaches to clustering such as HBDSCAN and K-means clustering, and consistently provided a more effective grouping of the data. Following an exploratory analysis of the distribution of both variables, the arbitrary numbers of discrete groups to be identified is set to 4, a number that preserves the original variability of the data without placing undue restrictions on the bootstrapping process, producing a total of 16 combinations of strata of engagement and followers clusters. To guarantee consistent results in the bootstrapping stage, quantile-based discretization is applied to the combined datasets of low and high credibility data for each distinct dataset under investigation, and the values of both engagement and followers data was log-scaled in the process.

2.2.2 Additional stratification variables

After developing a method to compute the baseline level of amplification across the two datasets, we can add further individual stratification variables to test the influence of additional grouping variables across subgroups. For this purpose, each additional stratum is separately added to the baseline benchmark, computing any difference between the baseline amplification of amplification after the addition of a new stratification variable. At this stage, we test amplification across three additional stratification variables: toxicity scores, political bias and verified status.

Toxicity scores are obtained through the Perspective API by Jigsaw [40], which leverages a machine learning model trained on millions of Wikipedia comments to predict how likely it is that an input text will be perceived as toxic by a reader. Like tweets, Wikipedia comments are largely short and informal, making this model quite suited for the analysis of Twitter data [41, 42]. The Perspective API model produces a toxicity score ranging from 0 to 1 for each input tweet, with 0 having a null probability of being found toxic, and 1 having a high probability of a text being perceived as toxic. To avoid creating an excessive number of categories during the stratification process, the toxicity scores obtained from the Perspective API are used to create 3 clusters of toxicity levels with k-means clustering, allowing for the stratification of our data according to the degree of language toxicity.

Further, the political bias of the URL domains under analysis is obtained by annotating data through a zero-shot classifier leveraging the GPT-4 API. While the use of large language models in data annotation tasks is a new phenomenon, recent literature has extensively analysed the performance of GPT 3.5 and GPT-4 in data labelling tasks, including political stance identification, with both models exhibiting high accuracy [43, 44]. To maximise the usability and interpretability of the data, for this task, the model is asked to classify the political bias of the input domain into one of five categories: far-left, left, no bias, right and far-right. The model is also prompted to return a value of −1 whenever it does not have information on a domain, or if the domain is non-political. To validate annotations obtained from GPT-4, the labels obtained from the 20 most common domains (covering more than 100k tweets) are compared with static labels of political bias obtained from Media Bias and Fact Check, where there is a 95% agreement in macro political areas between the two datasets.Footnote 1 Through this process, we can identify 5596 political domains – around 10% of all domains in the data – which are then used as a stratification variable during bootstrapping to assess whether political bias has an influence on the amplification of low-credibility content. Furthermore, it should be noted that far-left sources were not identified in high-credibility domains, and to maximise the comparability, the data was grouped in two categories, right-bias and left-bias.

The third and last stratification variable used in this work is a user’s verified status, which is used to assess whether the latter is used to amplify low-credibility content on Twitter. Data on users’ verified status, a binary variable with values of ‘true’ for verified accounts and ‘false’ for unverified ones, was obtained via the Twitter API. However, it’s crucial to clarify that this data refers to the legacy verified status – a verification badge Twitter formerly awarded to prominent users as a safeguard against impersonation. As of November 2022, Twitter began phasing out this legacy verification in favour of Twitter Blue, a paid subscription service that enables users to purchase verification. While legacy blue ticks have largely been removed at the time of writing, information on legacy verified status was still available from the Twitter API at the time of data collection, and these accounts mainly include public or institutional profiles with a large following. To validate the verification statuses in the data, a random sample of 20 verified users and 20 unverified users was compared against their current verification statuses on Twitter. This step showed that users labeled as verified in the data were only those verified before November 2022. Meanwhile, some users marked unverified had since become verified through Twitter’s paid verification program. This discrepancy confirms that the verification labels in the data reflect legacy verification status prior to Twitter’s verification changes in November 2022.

3 Results

3.1 Baseline amplification analysis

Figure 2 illustrates the findings from the baseline amplification analysis, showing the mean percentage difference obtained from each fold of the bias corrected and accelerated bootstrapping (BCa) procedure [45]. Here, results reveal that on average, across 1000 stratified bootstrapping samples stratified by engagement level and followers count, samples of low-credibility tweets generate more impressions than high-credibility samples across both datasets, with low-credibility tweets on COVID-19 receiving a baseline impressions amplification of +19.2% (median +17.3%) and low-credibility tweets on climate change generating +95.8% impressions (median = +90.1%). In absolute values, this amounts to a mean difference of +113.7 impressions (median = +111.4) for COVID-19 tweets, and +474.6 impressions (median = +447.2) for climate change tweets. This observation suggests that at an aggregate level across 1000 balanced samples of high-credibility and low-credibility tweets, the latter experiences heightened amplification within these two issue domains, with greater amplification in the context of climate change, indicating that, in aggregate, Twitter users were more likely to view low-credibility information. Results here also show that this behaviour is observed quite consistently, as 84.4% of COVID-19 samples have a positive mean difference, and 97.9% of climate change samples have a positive mean difference, suggesting that it is very rare that low-credibility samples will outperform high-credibility samples in impressions counts.

Figure 2
figure 2

Raincloud plot illustrating the average percentage difference in impressions between high-credibility and low-credibility tweets, based on 1000 resamples from each dataset under study

However, when dealing with skewed distributions such as those of social media impressions, looking at the aggregate mean may not be sufficient to fully explain an amplification effect. Rather, we must also assess inter-stratum breakdowns of variabilities, which are shown in Fig. 3, containing heatmaps of the mean differences in impressions across all 16 strata combinations as well as the size of each stratum. This step of the analysis delivers a more nuanced understanding of the results, showing that within bootstrapped samples, the observed difference in impressions is primarily generated by a difference in the highest-engagement and highest-followers strata \((3,3)\). For COVID-19, this amounts to a mean difference of +3148 impressions between low-credibility and high-credibility tweets, while for the climate change dataset, this amounts to +9197. While the percentage of amplification within these strata appears extensive in absolute terms, it is more moderate in relative terms – showing +30.2% amplification for the COVID-19 stratum 3.3 and +129% for the climate change data within the same stratum. Qualitatively assessing tweets from these strata, they appear to largely be conspiratorial tweets from large right and far-right outlets. For example, the low-credibility tweet with the highest impressions in such strata cites an article from the Daily Mail, and states: “A shadowy Army unit secretly spied on British citizens who criticised govt’s Covid lockdown policies… artificial intelligence deployed to ‘scrape’ social media for keywords”.

Figure 3
figure 3

Heatmaps illustrating the percentage difference between low-credibility and high-credibility tweets in each of the 16 strata under analysis. The color yellow represents positive values, while blue represents negative values

This finding on the distribution of intra-stratum variabilities is crucial for the interpretation of the bootstrapping results, as it shows that while on aggregate users on Twitter were more likely to be exposed to low-credibility content, this effect is largely attributable to a difference in high-engagement, high-followers tweets, which are very impactful in terms of impressions generation, and are more likely receive amplified visibility when containing low-credibility content. This is consequential, as this minority of tweets are responsible for a large share of impressions generated on Twitter. This finding also aligns with the broader understanding of Twitter activity dynamics, where the majority of engagement and impressions are typically generated by a limited number of highly engaging posts, adding a further layer of understanding that this subgroup of posts is likely to achieve greater amplification when containing low-credibility content.

3.2 Analysis of additional stratifications

Building on these observations, Fig. 4 illustrates the impact of adding additional stratification variables – in this case toxicity, political bias and verified status – to the amplification levels observed in the base model. Here, results showcase the average raw change of adding an additional stratification variable to the percentage amplification value observed in the baseline model. This step provides several insights into how additional stratification variables impact the amplification of low-credibility content.

Figure 4
figure 4

Raw impact of each value of additional stratifications on the baseline model of low-credibility amplification. The color yellow represents positive values, while blue represents negative values. Here, raw values imply that if in the base model a stratum had an amplification of +10%, and the addition of a third stratification pushed this to +12%, we would report this as a +2% difference, not +20%. In this sense, results here are a raw difference between percentages

The first stratification variable added to the baseline model is a tweet’s toxicity profile, which provides insights into how the presence of inflammatory language influences the algorithmic amplification of low-credibility content. Examples of high-toxicity phrasing include strong insults, profanity, threats, and intentionally harmful or misleading rhetoric. These types of inflammatory expressions tend to appear in conspiracy theories and highly controversial claims, precisely the types of questionable information examined in this study. Here, results show no consistent relationship between amplification and toxicity for low and medium toxicity tweets across both the climate change and COVID-19 datasets. However, high-toxicity tweets exhibit a heightened algorithmic reach versus the baseline model for both datasets, at a value of +9.9 and +15.2 respectively. This is important, as it suggests that tweets containing overtly negative, rude or disrespectful language see greater visibility on the platform when containing false or misleading domains. By showing a relationship between high-toxicity and algorithmic amplification across both datasets, these results support the understanding that content that is emotionally charged, especially when having a negative or controversial nature, may benefit from algorithmic amplification within engagement-based recommender systems.

Furthermore, the addition of political bias as a stratification variable provides insights on the role of political partisanship in recommender-based amplification. Here, results show that for both COVID-19 and climate change datasets, tweets expressing right-leaning political bias see heightened amplification compared to the baseline model, and compared to left-leaning tweets. For COVID-19, right-leaning shows an increase of +6 from the baseline for low-credibility content versus high-credibility content, compared o a change of +0.03 for left-wing tweets. For climate change, right-leaning tweets exhibit an additional amplification of +24.3, compared to a value of +6 for left-leaning tweets. These results indicate that in the moment this analysis was carried out, tweets containing domains with a right-leaning political bias were more likely to be amplified in discussions on both COVID-19 and climate change, particularly if compared with tweets with a left-wing bias. This finding emerges as a crucial aspect of the current analysis, as it suggests that the Twitter recommender system, in its pursuit of maximising user engagement, may inadvertently expose users to content characterised by right-wing political biases. This observation gains particular significance in the context of ongoing debates surrounding filter bubbles, polarised information ecosystems, and the potential consequences of such dynamics on the public sphere. As the Twitter recommender system warrants increased visibility to content with pronounced right-wing political leanings, there is a clear risk that users may be driven towards more radical viewpoints – a trend that has been previously identified in Youtube’s recommender system [46, 47] – exacerbating existing social and political divisions while undermining the platform’s capacity to serve as a space for diverse and open discourse.

Lastly, to warrant more in-depth insights into user-level factors contributing to the performance of low-credibility tweets, the third additional layer of stratification is the verification status of a tweet’s author. Here, results clearly indicate that low-credibility tweets from users with a legacy verified status obtained evident amplification, with a change over the baseline model of +155% for COVID-19 data, and + 138% for climate change data. In contrast, tweets from unverified authors see minimal change compared to baseline amplification levels. This effect is very large, particularly if compared with the previous two stratifications, and indicates that low-credibility information spread by verified users is far more likely to be amplified on Twitter compared to low-credibility content shared by non-verified users, suggesting that the legacy checkmark, which acted as a credibility signal within the algorithm, may have been weaponised to amplify the reach of false or misleading content. This finding has important implications regarding the role of status cues and authority in algorithmic amplification, as it demonstrates that peripheral credibility signals like the verification status can override actual content quality in terms of driving engagement and reach.

4 Discussion and conclusions

The recent addition of impressions data on the Twitter API offers a unique opportunity to investigate the role of Twitter’s recommender system in promoting the circulation of disinformation and misinformation. While this opportunity has now been limited by restrictions to the Twitter API, this study presents an initial attempt to use an inferential approach based on impressions data to assess any differences between low-credibility and high-credibility tweets on Twitter. The main analysis of this work revealed that tweets containing URLs from low-credibility domains achieve higher visibility on the platform, with an average difference after bootstrapping of +19.2% for COVID-19 data, and +95.8% for climate change data. However, results also show that within bootstrapped samples, the majority of this effect comes from low-credibility tweets’ overperformance for high-engagement and high-followers users, which account for the majority of Twitter’s impressions.

This work also set out to uncover notable features that may explain any impressions-based amplification, and found that toxicity scores obtained from Jigsaw’s Perspective API showed a clear pattern where high-toxicity tweets exhibited heightened amplification when containing low-credibility content, confirming the existing understanding that toxic content may be more easily amplified by engagement-based recommender systems. These results support findings from prior works, particularly those showing clear connections between negative emotional content and viral misinformation spread [48]. By confirming the role of engagement-based recommender systems in amplified toxic content, these findings highlight the need for alternative approaches to social media recommendation, such as bridging-based recommender systems, which focus on connecting users with diverse perspectives and high-quality information rather than maximising engagement alone [49]. While developing recommender systems that balance engagement, equality and diversity is indeed challenging, and further research is needed on this topic, this works lends support to the notion that alternatives to engagement-based recommendations are worth pursuing to improve online discourse and reduce the amplification of misinforming, toxic content.

Furthermore, one of the most notable results from the additional stratification analysis is the substantial amplification observed for politically biased low-credibility tweets, especially those with right-leaning partisanship. This finding highlights growing concerns about political polarisation and the rise of filter bubbles on social platforms. It suggests that in its pursuit of maximising engagement, the version of Twitter’s recommender system under analysis may have inadvertently amplified tweets that express partisan viewpoints, regardless of their truthfulness. While this finding alone does not show the formation of echo chambers – as cross-partisan exposure may still occur – this is concerning, given the vulnerability of high-profile socio-political discussions like climate change or COVID-19 to manipulation by misinformation campaigns. The fact that right-leaning tweets see substantially more amplification of false claims on climate change points to the ability of partisan disinformation to successfully exploit algorithmic loopholes on Twitter, highlighting concerns about social media deepening societal divides and potentially favoring hyper-partisan information. Lastly, results showed that low-credibility tweets from legacy verified accounts enjoy significant amplification on Twitter. While legacy verified accounts are only a minority of the platform’s population, it is important to note that Twitter recently changed its verification policy, and the blue tick can now be purchased as part of a monthly subscription plan. While data on paid-for verified users is not available via the Twitter API, this finding is concerning, as it suggests that it is possible that the blue tick, which warrants amplified visibility within the recommender system, may be weaponised by malicious actors to spread false and misleading information, particularly during highly emotionally loaded emergencies such as that of COVID-19. This point is crucial, and deserves further attention in future research.

Finally, this research offers initial empirical evidence on the promotion of low-credibility content on Twitter, revealing key factors that may contribute to a significant impressions overperformance of tweets containing false or misleading information. Given ongoing concerns about Twitter’s impacts on information quality and integrity under its new management, this historical assessment of recommendation patterns can also offer an important benchmark for future analysis. Moreover, the study’s methodological approach of leveraging impression counts to empirically analyse recommendation outcomes remains broadly relevant. Evaluating the visibility afforded to different types of content on social platforms is key to understanding how algorithmic curation shapes online information landscapes. By offering a framework to probe the role of recommender systems in propagating low-credibility information, this research retains value for inspiring further investigation – both of Twitter’s evolving dynamics and other influential networks. Capturing past system behaviours can provide a foundation to monitor emerging risks and work towards safer, more accountable social information ecosystems.

However, it is also important that the results presented in this work have limitations, which should form the basis for future research into recommender-based amplification of low-credibility content. Firstly, while impressions data may currently be the best available metric to test the behaviour of recommender systems, this metric is still imperfect, as it may also be influenced by exogenous factors that can be difficult to control, such as the level of public interest on a specific topic like climate change or COVID-19. Furthermore, despite current mitigating measures based on data stratification, it is important to acknowledge the cyclical nature of the relationship between engagement, followers count, and impressions, where each metric may, in practice, compound the other. This interplay adds additional complexity to the interpretation of the findings, where the conclusion that Twitter’s algorithm favours low-credibility content, should be considered with an understanding of the potentially reciprocal relationship between these key variables. Finally, this works analyses Twitter’s recommender system as a static system – both temporally and in terms of behaviour – using a frequentist statistical approach. In practice, any recommender system operates on top of expressed and implied human preferences, and future analyses may attempt the implementation of different approaches to account for this, such as Bayesian analysis.

Recommender systems are arguably the most prevalent application of AI and machine learning globally, affecting billions of internet users daily. However, despite their ubiquitous nature, there is an evident absence of regulation on their large-scale deployment. The results of this study highlight a need, as a minimum, for increased transparency in the field of social media recommender systems, as only through system-wide access it will be possible to produce clearer causal explanations for the effects observed in this work. However, as the recent release of parts of the Twitter algorithm eloquently shows, in principle transparency is not enough. Rather, it is crucial that transparency is accompanied the possibility of full replication of a recommender system,which would allow for the definition of protocols for comprehending, auditing, and stress-testing these systems to prevent the unintentional promotion of disinformation, to curb the perpetuation of harmful biases, and to safeguard the security of knowledge dissemination and acquisition processes.

Addressing this challenge is admittedly difficult, as recommender systems are proprietary, high-value assets at the heart of social media platforms. Nevertheless, it is vital to stress the significance of transparency and oversight for systems that mediate access to information and shape public discourse on a global scale. This study contributes to emphasising these imperatives, while also offering initial insights into how Twitter’s recommender system may have specifically amplified misleading and false information. Future research building on these findings should further investigate the dynamics of disinformation circulation on social media and explore potential interventions to curb the unintentional spread of misinformation by recommender systems. Ultimately, continued research in this domain is essential to ensure the responsible design and implementation of AI systems.

Data availability

All corresponding result files are made available at https://osf.io/gewr4/, while the Python and R code for the analysis can be found at https://github.com/giuliocorsi/Twitter-Amplification-Eval. However, to comply with Twitter’s rules on data collection and publication, only the tweet IDs are provided in the raw data. For more information on the data used, please contact the author directly.

Notes

  1. The labels provided by MBFC are different from the ones obtained through GPT-4, as MBFC uses several sub-labels such as right-center and left-center. For this reason, it is more appropriate to compare performance on macro-political areas (left or right), which is also the format used later in the analysis. The validation sheet is available in Additional file 1.

Abbreviations

AI:

Artificial Intelligence

API:

Application Programming Interface

BCa:

Bias-corrected and accelerated bootstrapping

COVID-19:

Coronavirus Disease 2019

URL:

Uniform Resource Locator

References

  1. Pentina I, Tarafdar M (2014) From “information” to “knowing”: exploring the role of social media in contemporary news consumption. Comput Hum Behav 35:211–223

    Article  Google Scholar 

  2. Shearer E, Gottfried J (2017) News consumption across social media in 2017. https://www.pewresearch.org/journalism/2017/09/07/news-use-across-social-media-platforms-2017/. Accessed: 20-Apr-2023

  3. Walker M, Matsa KE (2021) News consumption across social media in 2021. https://www.pewresearch.org/journalism/2021/09/20/news-consumption-across-social-media-in-2021/. Accessed: 20-Apr-2023

  4. Cinelli M, De Francisci Morales G, Galeazzi A, Quattrociocchi W, Starnini M (2021) The echo chamber effect on social media. Proc Natl Acad Sci 118(9):2023301118

    Article  Google Scholar 

  5. Anandhan A, Shuib L, Ismail MA, Mujtaba G (2018) Social media recommender systems: review and open research issues. IEEE Access 6:15608–15628

    Article  Google Scholar 

  6. Thorburn L (2022) How platform recommenders work. Understanding Recommenders. Accessed: 20-Apr-2023

  7. Santos FP, Lelkes Y, Levin SA (2021) Link recommendation algorithms and dynamics of polarization in online social networks. Proc Natl Acad Sci 118(50):2102141118

    Article  MathSciNet  Google Scholar 

  8. Pariser E (2011) The filter bubble: how the new personalized web is changing what we read and how we think. Penguin, Baltimore

    Google Scholar 

  9. Narayanan A (2023) Understanding social media recommendation algorithms

    Google Scholar 

  10. Milli S, Belli L, Hardt M (2021) From optimizing engagement to measuring value. In: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp 714–722

    Chapter  Google Scholar 

  11. Islam R, Keya KN, Pan S, Foulds J (2019) Mitigating demographic biases in social media-based recommender systems. KDD (Social Impact Track)

  12. Bhadani S (2021) Biases in recommendation system. In: Proceedings of the 15th ACM conference on recommender systems, pp 855–859

    Google Scholar 

  13. Burns P (2023) What TikTok’s Secret Heating Button Reveals About Virality Online. https://medium.com/feedium/algorithmic-heating-virality-is-a-choice-and-the-game-is-rigged-150307f1032a

  14. Alatawi F, Cheng L, Tahir A, Karami M, Jiang B, Black T, Liu H (2021) A survey on echo chambers on social media: Description, detection and mitigation. arXiv preprint arXiv:2112.05084

  15. Kaiser J, Rauchfleisch A, Córdova Y (2021) Comparative approaches to mis/disinformation| fighting Zika with honey: an analysis of youtube’s video recommendations on Brazilian youtube. Int J Commun 15:19

    Google Scholar 

  16. Giansiracusa N (2021) How algorithms create and prevent fake news: exploring the impacts of social media, deepfakes, GPT-3, and more. Springer, Berlin

    Book  Google Scholar 

  17. Leerssen P (2020) The soap box as a black box: regulating transparency in social media recommender systems. Eur J Law Technol 11(2)

  18. Liu Z, Zou L, Zou X, Wang C, Zhang B, Tang D, Zhu B, Zhu Y, Wu P, Wang K (2022) Monolith: real time recommendation system with collisionless embedding table. arXiv preprint arXiv:2209.07663

  19. Zhao Z, Hong L, Wei L, Chen J, Nath A, Andrews S, Kumthekar A, Sathiamoorthy M, Yi X, Chi E (2019) Recommending what video to watch next: a multitask ranking system. In: Proceedings of the 13th ACM conference on recommender systems, pp 43–51

    Chapter  Google Scholar 

  20. Lada A, Wang M, Yan T (2021) How does news feed predict what you want to see. Personalized ranking with machine learning. Retrieved 28:2021

  21. Twitter (2023) The Twitter algorithm: TweepCred. https://github.com/twitter/the-algorithm/blob/main/src/scala/com/twitter/graph/batch/job/tweepcred/README

    Google Scholar 

  22. Wang Z, She Q, Zhang J (2021) Masknet: introducing feature-wise multiplication to ctr ranking models by instance-guided mask. arXiv preprint arXiv:2102.07619

  23. Pfeffer J, Matter D, Jaidka K, Varol O, Mashhadi A, Lasser J, Assenmacher D, Wu S, Yang D, Brantner C (2023) Just another day on Twitter: a complete 24 hours of Twitter data. In: Proceedings of the international AAAI conference on web and social media, vol 17, pp 1073–1081

    Google Scholar 

  24. Bovet A, Makse HA (2019) Influence of fake news in Twitter during the 2016 us presidential election. Nat Commun 10(1):7

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  25. Kouzy R, Abi Jaoude J, Kraitem A, El Alam MB, Karam B, Adib E, Zarka J, Traboulsi C, Akl EW, Baddour K (2020) Coronavirus goes viral: quantifying the covid-19 misinformation epidemic on twitter. Cureus 12(3)

  26. Lin H, Lasser J, Lewandowsky S, Cole R, Gully A, Rand DG, Pennycook G (2023) High level of correspondence across different news domain quality rating sets. PNAS Nexus 286

  27. Twitter (2017) Using Deep Learning at Scale in Twitter’s Timelines. https://blog.twitter.com/engineering/en_us/topics/insights/2017/using-deep-learning-at-scale-in-twitters-timelines

  28. Treen KMd, Williams HT, O’Neill SJ (2020) Online misinformation about climate change. Wiley Interdiscip Rev: Clim Change 11(5):665

    Google Scholar 

  29. Brennen JS, Simon FM, Howard PN, Nielsen RK (2020) Types, sources, and claims of covid-19 misinformation. Thesis

  30. Graham T, Bruns A, Zhu G, Campbell R (2020) Like a virus: the coordinated spread of coronavirus disinformation

  31. Barrie C, Ho JC-T (2021) academictwitter: an r package to access the Twitter academic research product track v2 api endpoint. J Open Sour Softw 6(62):3272

    Article  Google Scholar 

  32. Pierri F, DeVerna MR, Yang K-C, Axelrod D, Bryden J, Menczer F (2022) One year of covid-19 vaccine misinformation on twitter. arXiv preprint arXiv:2209.01675

  33. Falkenberg M, Galeazzi A, Torricelli M, Di Marco N, Larosa F, Sas M, Mekacher A, Pearce W, Zollo F, Quattrociocchi W (2022) Growing polarization around climate change on social media. Nat Clim Change, 1–8

  34. Resnick P, Ovadya A, Gilchrist G (2018) Iffy quotient: a platform health metric for misinformation. Cent Soc Media Responsib 17:1–20

    Google Scholar 

  35. Newsguard (2023) Score and Rating Levels. https://www.newsguardtech.com/ratings/rating-process-criteria/

  36. Thorburn L, Stray J, Bengani P (2023) Making Amplification Measurable. https://medium.com/understanding-recommenders/making-amplification-measurable-2be548e5986c

  37. Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. CRC Press, Boca Raton

    Book  Google Scholar 

  38. Jung K, Lee J, Gupta V, Cho G (2019) Comparison of bootstrap confidence interval methods for gsca using a Monte Carlo simulation. Front Psychol 10:2215

    Article  PubMed  PubMed Central  Google Scholar 

  39. Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. Elsevier, Amsterdam, pp 194–202

    Google Scholar 

  40. Lees A, Tran VQ, Tay Y, Sorensen J, Gupta J, Metzler D, Vasserman L (2022) A new generation of perspective API: efficient multilingual character-level transformers. In: Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pp 3197–3207

    Chapter  Google Scholar 

  41. Saveski M, Roy B, Roy D (2021) The structure of toxic conversations on Twitter. In: Proceedings of the web conference 2021, pp 1086–1097

    Chapter  Google Scholar 

  42. Cuthbertson L, Kearney A, Dawson R, Zawaduk A, Cuthbertson E, Gordon-Tighe A, Mathewson KW (2019) Women, politics and Twitter: using machine learning to change the discourse. arXiv preprint arXiv:1911.11025

  43. Törnberg P (2023) Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588

  44. Gilardi F, Alizadeh M, Kubli M (2023) Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056

  45. Diciccio TJ, Romano JP (1988) A review of bootstrap confidence intervals. J R Stat Soc, Ser B, Stat Methodol 50(3):338–354

    MathSciNet  Google Scholar 

  46. Haroon M, Chhabra A, Liu X, Mohapatra P, Shafiq Z, Wojcieszak M (2022) Youtube, the great radicalizer? auditing and mitigating ideological biases in youtube recommendations. arXiv preprint arXiv:2203.10666

  47. Ribeiro MH, Ottoni R, West R, Almeida VA, Meira W Jr (2020) Auditing radicalization pathways on youtube. In: Proceedings of the 2020 conference on fairness, accountability, and transparency, pp 131–141

    Chapter  Google Scholar 

  48. Brady WJ, Wills JA, Jost JT, Tucker JA, Van Bavel JJ (2017) Emotion shapes the diffusion of moralized content in social networks. Proc Natl Acad Sci 114(28):7313–7318

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  49. Ovadya A, Thorburn L (2023) Bridging systems: open problems for countering destructive divisiveness across ranking, recommenders, and governance. arXiv preprint arXiv:2301.09976

Download references

Acknowledgements

Not applicable.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Author information

Authors and Affiliations

Authors

Contributions

The author declares that they have carried out every step of the content of this work, including data collection and analysis. The author read and approved the final manuscript.

Corresponding author

Correspondence to Giulio Corsi.

Ethics declarations

Competing interests

The author declares no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

(PDF 614 kB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Corsi, G. Evaluating Twitter’s algorithmic amplification of low-credibility content: an observational study. EPJ Data Sci. 13, 18 (2024). https://doi.org/10.1140/epjds/s13688-024-00456-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1140/epjds/s13688-024-00456-3

Keywords