Hate is a social phenomenon whose characteristics we highlight in Sect. 2.1, with emphasis on how its complexities relate to the difficulties faced in accurately and fairly detecting online hate using ML systems. Existing research on automated hate detection systems is next described in greater depth in Sect. 2.2, with progress and persistent challenges in the accuracy of such systems underlined. Finally, background is provided into the fairness challenges faced by such systems in Sect. 2.3, with relevant research on the nature, implications, and possible solutions of these challenges highlighted.
2.1 The social dynamics of online hate
Online hate is a contested and context-dependent concept, and despite growing concerns about its harmful effects, platforms, governments and researchers have been unable to reach a common definition [60]. Academic work displays a fragmented understanding of online hate [37], although most definitions share three common elements: content, intent, and harm [38]. Content relates to the textual, visual or other modes used to proliferate hate. Intent and harm are more difficult to discern, and relate to the thoughts and experiences of the perpetrator and victim [47, 49]. In line with much previous work, we define online hate in line with Davidson et al. [12] as ‘language that is used to expresses hatred towards a targeted group or is intended to be derogatory, to humiliate, or to insult the members of the group.’
Social psychological work on prejudice and abuse proposes that the expression of hate is strongly affected by intergroup dynamics, such as contact and conflict between competing social groups [9]. Following Crandall and Eshelman’s Justification–Suppression model of prejudice, Day [13] finds racism on Twitter flourishes when the offender is motivated by a group of like-minded racist peers, when prejudice against targets can be socially justified, and when capable guardians are not present to suppress their activities. It has been suggested that community ties also operate within far-right online communities, motivating and deepening individuals’ commitment to extremist politics [5, 15].
Hateful users are often homophilous and organize themselves into densely connected and clustered networks where hateful ideas and content proliferate quickly [26, 53]. Johnson et al. [26] argue that content removal proves largely ineffective in such settings: the sources of the content—hate-spewing, highly resilient clusters of users who are committed to their goals—live on, growing larger with time due to consolidation of smaller groups or attraction of new recruits. Further, the growing user bases of platforms that have lenient content moderation policies and/or were created with the stated goal of protecting freedom of expression, such as Gab, Parler, and Bitchute, reflects the need to understand hateful networks and communities, rather than hateful individuals in isolation. Other forms of homophily also exist online, which could be used to understand the spread of online hate. Wimmer and Lewis [66] find that African–American individuals, for example, are more likely to associate themselves with other African–Americans on social media platforms, creating a “high degree of racial homogeneity” in social networks [66].
2.2 Automated methods of online hate detection
Hateful content detection has been extensively studied, with numerous papers presenting new datasets, models and architectures for hateful content detection, including several subtasks, such as multilingual, multimodal and multi-target hate detection [29, 61, 68]. Notably, incorporating sentence-level context has been shown to improve hateful content detection, with several papers deploying transformer-based models such as Bidirectional Encoder Representation from Transformers (BERT) to better distinguish between hateful and non-hateful content, even when they have lexical similarities [37, 57]. Indeed, as Mullah and Zainon [46] write in their comprehensive review of ML methods for automated hate speech detection, deep learning techniques leveraging such language models can considerably improve how context-dependent hate speech is detected. Liu et al. [36], for example, leverage transfer learning by fine-tuning a pre-trained BERT model to produce the most accurate classifier for offensive language in the SemEval 2019 Task 6 competition. The model, however, was less effective in identifying the type of the offence or the target. Chiril et al. [8], meanwhile, focus more on the target-oriented and topical focus of hate speech using a multi-target perspective, building classifiers to simultaneously categorize both the hateful nature of a text as well as its topical focus. They find that leveraging BERT under such a multi-task approach outperforms single-task models for detecting both the hatefulness of a text and its focus. Mishra et al. [43] also apply multi-task learning, back-translation, and multilingual training to the problem of hateful content detection.
However, even with such advanced contextual language models, several problems emerge. Liu et al. [36] cite the imbalanced nature of the target (hateful speech is much rarer than non-hateful speech) as well as the morphological nature of language as challenges to achieving higher performance. Meanwhile, data imbalance in the form of bias towards certain topics and targets can also hamper performance while adding the possibility of unfair classification [8]. Moreover, despite improving performance overall, such contextual language models are often still unable to distinguish complex forms of hate, such as understanding hateful uses of polysemous words related to minority groups [45]. Part of the challenge is that the meaning of such terms depends on the identity of the speaker and how they are used (e.g., a black person using the term “nigga” is fundamentally different to a white person using it). With short text statements (e.g., tweets) it can be difficult to capture and represent these different uses (particularly if the labelled training dataset is small), even with transformer-based models.
Creative ways have more recently been proposed to deal with such polysemy by better leveraging the social characteristics of online hate. Alorainy et al. [1] highlight the risk of false negatives when detecting subtle forms of hate, as traditional embedding techniques still depend heavily on the occurrence of certain words or phrases in classifying hate. Instead, they leverage social science findings around how hate is conveyed, specifically on the phenomenon of ‘othering’ whereby “hateful language is used to express divisive opinions between the in-group (‘us’) and the out-group (‘them’)”. By augmenting traditional embedding algorithms with features specifically related to ‘othering’, such as pronoun patterns and verb–pronoun combinations, they achieve a significant improvement in accuracy [1].
Previous research also includes classifying online hate by either using traits at the user-level or directly classifying hateful users. Given the vast body of research showing how online hate is often propagated by a small, highly-connected group of users (see [26]), there has been increased attention on leveraging user-level information for the task. At the 2021 Conference and Labs of the Evaluation Forum (CLEF), for instance, 66 academic teams participated in the task of directly detecting whether a Twitter user is likely to spread hate or not [51]. While the primary focus of the participants still seemed to have been the textual elements in the tweets of these users, some researchers also leveraged innovative strategies to extract and use richer information directly at the user level in the classification task. Irani et al. [24], for example, learned user-level embeddings with unsupervised and semi-supervised techniques, before combining them with textual features to more accurately classify hateful users in the provided dataset. However, few, if any, of the participants in the task incorporated network-level information when learning user-level embeddings, despite the inherent networked nature of social media [51].
Though still focused primarily on automated hateful content classification, a small but growing body of research leverages users’ networks, using simple features such as in- and out-degree or shallow-embedding methods such as node2vec [7, 41]. Fehn Unsvåg and Gambäck [17] create a classifier in which textual features are augmented with user-specific ones, such as gender, activity and network connectivity (number of followers and friends). They use Logistic Regression (LR), reporting a 3 point increase in the F1 score with the addition of user-specific features. Mishra et al. [41] use a variant of the word2vec skipgram model by Mikolov et al. [40] to create network (node2vec) embeddings of users based on their network positions and those of their neighbours. They concatenate the generated user network vectors with character n-gram count vectors, and use a LR classifier, achieving a 4 point increase in the F1 score compared to the same model without the network embeddings.
One drawback of node2vec embeddings, however, is that they can only incorporate the structural and community information of a user’s network; the textual content of tweets written by other users in this network is not incorporated. This may, as Mishra et al. [42] demonstrate, lead to misclassification of a normal user’s content because of their hateful network neighbourhood through ‘guilt-by-association’. This is a problem for use of a node2vec model in the real world, where such mistakes would spark considerable backlash and opposition. Most importantly, from a practical perspective, shallow embeddings such as node2vec may be unfeasible as the embeddings need to be re-generated for the entire graph every time a new user enters the network [21]. The large and constantly changing user bases of social media platforms render such methods computationally impractical.
Moving from detecting hateful content to detecting hateful users, meanwhile, is not straightforward due to challenges in sampling [12, 53, 62] and in accessing network and other user attributes [25]. Geometric deep learning offers a very promising avenue for hateful user detection, and has been used to leverage network features in other classification problems. Ribeiro et al. [53] were the first to use geometric deep learning for hateful user detection, using GraphSAGE, a semi-supervised learning method for graph structures such as networks. It incorporates a user’s neighbourhood network structure as well as the features of her neighbours to learn node embeddings for unseen data [21]. Ribeiro et al. [53] achieve an overall unbalanced accuracy of 84.77% and F1-score of 54%, a substantial improvement over their baseline model. However, one limitation of their work is the relative simplicity of the baseline, which is a decision tree that lacks any network information. This is not a like-for-like comparison as the geometric deep learning model has access to additional features.
Further, they do not consider the implications of geometric deep learning beyond just accuracy: network-level features may, theoretically at least, be less biased towards the ‘guilt-by-association’ sub-problem because of how they exploit homophily in social networks. This is a key concern; if this fatal limitation of incorporating network features cannot be overcome then the method cannot be reliably used in real-world content moderation.
2.3 Fairness in automated hate detection
Attention in online hate research has increasingly shifted from solely considering performance to also evaluating models based on their fairness. This is particularly important for online hate detection as biased models could risk perpetuating the very problems (of social discrimination and unfairness) that they are designed to challenge.
Fairness in online hate detection has been addressed in several prior studies. Thus far, fairness has largely been explored in relation to content, although it is likely that it also affects hateful user detection algorithms, presenting the potential for new sources of harm. To our knowledge fairness in hateful user detection has not been systematically studied in previous work. Chung [10] show that Google’s Perspective toxicity classifier disparately impacts African–American users by disproportionately misclassifying texts in African–American Vernacular English (AAVE) as toxic. Exclusively learning from textual elements, the model identifies relationships between words such as ‘black’, ‘dope’, or ‘ass’ and toxicity as such words appear frequently in hateful posts. Given the frequent use of such words in AAVE, African–American users bear the negative implications of such erroneous learning.
Davidson et al. [11] use machine learning to identify AAVE tweets and find substantial and statistically significant racial disparity in the five datasets they investigate. AAVE tweets are, for models trained on some datasets, more than twice as likely to be associated with hate compared to Standard American English (SAE) tweets. They investigate how tweets containing ‘nigga’, a term re-appropriated by the African–American community but still frequently used to perpetrate hate, are classified. They find that a tweet’s dialect, independent of whether it uses ‘nigga’, strongly influences how it’s classified: AAVE tweets are more likely to be classed as hate-related than SAE tweets, even where both contain the term ‘nigga’ [11].
Davidson et al. [11] argue that annotation errors in the training data, where black-aligned language may be more likely to be labelled as hateful, may explain some of these differences. Sap et al. [55] have confirmed this possibility, exploring two key hate speech datasets and finding that AAVE tweets are up to two times more likely to be labelled as offensive by human annotators compared to non-AAVE tweets. Models trained on such datasets learn and propagate these annotation biases, amplifying disparate impact. Even BERT-based classifiers can confuse the meaning of words such as ‘women’, ‘nigga’ or ‘queer’, falsely flagging non-hateful content that uses such words innocuously or self-referentially [45].
Fairness, alongside other concepts such as bias and discrimination, is contested in ML literature. Mehrabi et al. [39] provide a comprehensive review of work on ‘Fair ML’, and define fairness as “the absence of any prejudice or favouritism towards an individual or a group based on their intrinsic or acquired traits in the context of decision-making”. However, they acknowledge that fairness can have varying notions both within and across disciplines, depending on how it is operationalized. Barocas and Hardt [2] have categorized notions of fairness into two groups: those centred around ‘disparate treatment’ and those around ‘disparate impact’. Disparate treatment characterizes decision-making if a subject’s sensitive attributes, such as race, partly or fully influence the outcome. In contrast, disparate impact occurs where certain outcomes are disproportionately skewed towards certain sensitive groups [2]. Elsewhere, Kleinberg et al. [30] show the constraints imposed by different definitions of fairness on ML systems are incompatible: in most practical applications ML models cannot optimize for a particular definition of fairness without also causing unfair outcomes when considered from a different definition. Thus, it is important to establish the notion of fairness that is being prioritized before models are optimized.
Trying to maximize algorithmic fairness can lead to a ‘fairness–accuracy trade-off’, where “satisfying the supplied fairness constraints is achieved only at the expense of accuracy” [65]. However, Wick et al. [65] have also shown that theoretically and practically, such a trade-off may be overcome in cases where systemic differences in the outcome variable (e.g., online hate) are not caused by the sensitive attribute (i.e., race). This is a plausible assumption in the current setting: nothing suggests that people of a particular race are intrinsically more hateful than those of another, especially after accounting for other factors. As such, it could be possible for both fairness and accuracy to be optimized, given an effective model.
Fairness can be operationalized in different ways, each of which could be justified in certain situations [39]. One notion of fairness is demographic parity, which requires that a model’s predictions are the same across different sensitive groups [22]. While applicable in certain contexts, this may be suboptimal for hate detection tasks. The social reality of online hate means that certain groups are more likely to perpetrate it, such as white supremacists compared with other users (of any race) [11]. However, because demographic parity does not incorporate accuracy of classifications (i.e., whether predicted outcomes match actual labels), it could theoretically be ‘achieved’ with many misclassifications.
The notion of ‘predictive equality’ assesses the false positive error rate balance, and can also be used to evaluate fairness [58]. Predictive equality is very appropriate to studying online hate, which has a history of disproportionate false positives among minority groups. There is also a magnified harm of such errors compared to false negatives, given that it penalizes the very groups that hate classifiers are meant to protect [14].
A previously unstudied way of effectively and efficiently addressing fairness in online hate detection could be through incorporating user-specific information by leveraging their social networks. Given homophily in such networks, ML models incorporating network information might learn user characteristics without these being manually specified. Indeed, new approaches to ‘Fairness without Demographics’ in ML have sought to exploit correlations between observed features and unobserved sensitive group attributes to boost fairness [32]. When social networks are effectively incorporated into ML models, this homogeneity can be exploited to infer contextual information around an individual’s identity, without directly extracting and feeding such features into the model. This may not only lead to more scalable models for online hate detection, but also fairer and more accurate ones. For instance, the text of a user can be contextualized against their identity, helping resolve ambiguities stemming from the analysis of text alone, such as with polysemous words like ‘nigga’. This approach also has the clear advantage of minimizing the amount of sensitive data that needs to be harvested, assessed, and stored, minimizing the risk of privacy-invasions.