LEIA: Linguistic Embeddings for the Identification of Affect

The wealth of text data generated by social media has enabled new kinds of analysis of emotions with language models. These models are often trained on small and costly datasets of text annotations produced by readers who guess the emotions expressed by others in social media posts. This affects the quality of emotion identification methods due to training data size limitations and noise in the production of labels used in model development. We present LEIA, a model for emotion identification in text that has been trained on a dataset of more than 6 million posts with self-annotated emotion labels for happiness, affection, sadness, anger, and fear. LEIA is based on a word masking method that enhances the learning of emotion words during model pre-training. LEIA achieves macro-F1 values of approximately 73 on three in-domain test datasets, outperforming other supervised and unsupervised methods in a strong benchmark that shows that LEIA generalizes across posts, users, and time periods. We further perform an out-of-domain evaluation on five different datasets of social media and other sources, showing LEIA’s robust performance across media, data collection methods, and annotation schemes. Our results show that LEIA generalizes its classification of anger, happiness, and sadness beyond the domain it was trained on. LEIA can be applied in future research to provide better identification of emotions in text from the perspective of the writer.


Introduction
Automatic identification of emotion in text is a valuable tool to study affect through social media and other text digital traces [1].Word-based methods enabled the study of mood expressions on Twitter [2] in relation to daylight oscillations [3] and of collective emotions in social resilience [4].Rule-based methods allowed the quantification of emotion contagion on Twitter [5] and the dynamics of emotions after affect labeling on social media [6].More advanced classification methods trained on labeled data in various languages have been used to test the effect of air pollution on happiness in Weibo posts [7], to study the expression of emotions on Twitter about Black Lives Matter [8], and to validate social media emotion macroscopes against survey data [9,10].Beyond research, emotion detection from social media text has clinical potential to identify users at mental health risk [11] and can help platforms to detect abusive language [12].
Despite its potential, the use of emotion detection from social media text faces important challenges.Dictionary methods applied to social media text provide user-level metrics that are weakly correlated with answers to affective questionnaires [13].Furthermore, dictionary-based emotion analysis methods have weak correlations with population-level emotion prevalence [14], but the same study shows that more advanced supervised methods bear promise to capture well-being.One of the sources of problems with the application of social media text to study emotions is the sensitivity of methods to particular domains.For example, [15] applied out-of-thebox sentiment analysis in a benchmark of different domains and found how methods are very sensitive to the medium and text source.This is part of a general problem in which language model performance degrades with distribution shifts [16], weakening the validity of emotion detection from text in out-of-domain (OOD) settings.
A source of error in emotion detection in social media is the way in which training labels are produced.While the target of applications is often to infer a subjective emotional state of the author of a social media post, the labels of training data are frequently produced by readers and not the authors of the post.The use of crowdsourcing can contribute to this problem, which can be alleviated by gathering several annotations per text but always carrying the potential noise source of readers not understanding the emotional state of writers.For example, a comparison between reader and writer annotations shows that they disagree 25% of the time [17].To avoid this problem, experience sampling can be used to generate self-annotated emotion labels.For example, [18] gathered anxiety scores at the time when individuals posted tweets and compared self-reported anxiety with emotion text analysis.The results are correlations of at most 0.24, calling for studies that can leverage large datasets to identify emotional states more accurately.
New platforms to share emotional experiences with other users offer the possibility to gather large-scale datasets with emotion self-annotations.Vent is an example that offers a particularly good source of selfannotated data, as the dataset available for researchers has millions of posts [19] and the design of the platform is precisely to share emotions rather than a smaller functionality as in other platforms.Recent research on Vent has shown the difficulty to predict Vent precise mood labels from text [20], but it is still left to explore how Vent can be used to infer more coarse emotion labels that can match discrete emotion classes from psychological research.In this work, we focus on a subset of Vent tags that can be mapped to standard emotional states, with the goal of training a better and more robust emotion detection model that can be applied to other text sources, especially from other social media.In the following, we present the design and development of LEIA, followed by an empirical analysis in a benchmark of in-domain and out-of-domain tests.We further analyze examples of classification errors and outputs of LEIA to understand its limitations and paths for improvement.

Related Work
Emotion classification models mainly follow feature-based or neural approaches.Feature-based methods [21] employ handcrafted features built from resources such as emotion lexica.Neural approaches often rely on pre-trained representations such as word embeddings and contextual language models (LMs).The use of transformer-based LMs has been shown to yield state-of-the-art performance on natural language processing benchmarks.For emotion classification, recent research works have achieved better performance using pretrained LMs [22,23,24].
Learning representations for affect.A number of existing works learn representations for affective tasks.DeepMoji [25] is a neural network trained for predicting emoji in tweets using a large distant-labeled dataset considering 64 emojis as labels.Sentiment-specific word embeddings [26] encode sentiment information into the vector representation of words for sentiment analysis.Sentiment-aware language representation learning (Sen-tiLARE) [27] incorporates part-of-speech and word polarity to enhance representation learning of a contextual language model for sentiment analysis tasks.Another effective strategy in several natural language processing tasks is to pre-train transformer models on a large collection of text and then fine-tune the model for other downstream tasks [28], including tasks in the social media domain [24,22].In this strategy, the adaptation step often relies on the masked language modeling objective where random tokens are masked and the model is trained to predict the masked tokens.Alternative masking strategies have been proposed to improve the pre-training task either by masking important words [29] or masking words relevant for a given downstream task.Recently, emotion masked language modeling (eMLM) was proposed in [30] to preferentially mask emotion words for contextual language representation learning.Similar to SentiLARE, eMLM also relied on existing lexical resources by masking emotional words more frequently when training a Bidirectional Encoder Representations from Transformers (BERT) model from scratch, yielding improvements in downstream affect-related tasks.Motivated by these results, we employ eMLM in the design of LEIA as we explain below.
Fine-tuning strategies and model generalization Supervised models can show a performance drop when faced with domain shifts, i.e. when they are applied to text from a domain that is not the same as the domain of their training data [16].A recent result in computer vision [31] showed that this performance gap across domains can be mitigated with a fine-tuning strategy that first performs linear probing to align the features of the prediction head with the pre-trained base model and then fine-tuning all model parameters.This approach is similar to those proposed in [32] and provides a further theoretical basis as well as empirical validation.Linear probing is a non-destructive and computationally cheap approach that freezes the parameters of the base model and only updates the parameters of the prediction head during training.In this work, we consider this strategy in the context of text classification for the identification of emotion.
Emotion classification datasets Supervised models are trained and evaluated against emotion text datasets that are either constructed by manual labeling or automatically by using additional data sources and structures.Manually-labeled datasets are usually comparatively small while automatically-constructed datasets are built by identifying emotion-bearing patterns of expression such as hashtags in the case of Twitter.The annotation of emotion datasets can also be divided into reader-labeled and writer-labeled datasets.Reader-labeled datasets are assigned labels by the annotators post-hoc based on their perception of the emotions expressed by a given content.On the other hand, writer-labeled datasets are usually self-annotated by the writer of the message to reflect their emotion.
Most of the existing work on emotion classification has drawn on manually annotated, automatically constructed, and reader-labeled datasets.Recently, large-scale writer-labeled datasets have been introduced [19,33] and they are yet to become part of the benchmarks of emotion detection tasks.A notable example is the Vent dataset [19], which is produced by a specialized social media platform with the goal of encouraging people to write about their feelings and provide a tag.The quality of the self-annotated emotion data drawn from Vent was examined and led to the conclusion that the tagged emotional expressions are indicative of emotional content [34].Furthermore, the distinction between reader-labeled and writer-labeled datasets was analyzed in [20] with the findings indicating that classifying the emotion labels of these datasets is a hard task when considering all available labels in the platform.As supervised methods tend to perform better than unsupervised ones and gathering manual annotations is time-consuming and expensive, this kind of self-annotated datasets offers a potential alternative beyond indirect self-annotations within the text as in Twitter hashtags.

Experimental Setup
We illustrate our experimental setup in Figure 1.Next, we describe this setup more in detail starting with the datasets for training and evaluating our models, followed by details on the implementation of our proposed models and baselines.

Datasets
The Vent dataset [19] consists of 33 Million posts from the Vent social media app.Each post is annotated by its author with an emotion tag as a way to express their emotional state to others.While the dataset has 705 emotion tags, many are temporary tags about seasonal events that do not express a clear emotional state and the most frequent tags are used on the vast majority of posts.Since Vent was designed to provide a nuanced expression of emotions rather than text classification, we mapped Vent emotion tags to a list of emotional states consistent with individual emotions from the affective science literature [35].This way, we map emotion tags with words close in dimensional models of emotion [36] into the same label, for example, mapping the tags angry and annoyed into the same label of Anger.The precise mapping can be found in Table 1.Four of these emotion labels map to linguistic classes that have been consistently identified in emotional expression in text [37]: Sadness, Anger, Fear, and Happiness.We added a fifth category Affection, which occurs more frequently than Happiness and shows a social orientation of the expression of positive emotions on social media.We pre-process the Vent dataset to generate a cleaner dataset of posts in English that were labeled by their authors with one of the tags of Table 1.We remove non-English posts using three language identification tools  , and gcld33 .For a post to be included in our analysis, at least two out of the three methods had to agree on detecting it as in English.After that, we remove duplicates and tag memes (invitations for a challenge to answer a question), following the approach in [34].We remove posts with less than three words, excluding placeholders for links and user mentions in the word count.We also normalize the text by replacing multiple whitespaces with a single occurrence.We remove tab, new line and carriage return characters as well as Hypertext Markup Language codes.The resulting dataset contains more than nine million posts with metadata including the emotion labels, pseudonymized user ids, and timestamps when the post was written.In-domain evaluation datasets An overview of this study can be seen in Figure 1, including data sources and data splits for in-domain evaluation.We split the pre-processed Vent dataset into a training/development/test split with three disjoint test datasets to assess the capability of the model to generalize emotion identification.The random test set contains a uniformly random selection of 10% of all posts in the Vent dataset.The user test set consists of all posts written by a random sample of 10% of the users.This way, no post in the training set has been written by any of the users in the user test set.The temporal test set contains the last 10% of the posts according to their timestamp, thus allowing us to evaluate the model with future data with respect to its training set.We additionally extracted another 10% random set from the remaining posts as a development set to guide model design before the final run of all tests.All these subsets are disjoint and the three tests allow us to evaluate if and how the model generalizes across posts, users, and time.The resulting exact counts of posts and emotion labels in all splits can be found in Table 2.

Out-of-domain evaluation datasets
To evaluate if models learn about emotional expression beyond the domain of Vent as a social platform, we include five OOD datasets with emotion labels and texts associated with the emotions.The OOD datasets are the following: • enISEAR [17] is a dataset of emotional event descriptions in English using the International Survey on Emotion Antecedents and Reactions (ISEAR) approach [38] via crowdsourcing.Annotators generated event-focused emotion descriptions using the template: "I felt [emotion] when/because [situation] ".While the study included annotations by readers, we only use the annotation of the author of the text to evaluate models.The dataset consists of 1001 instances for seven emotions, four of which match our emotion labels to provide an out-of-domain test.We design the task as a prediction of the text in which we have replaced the emotion word with the placeholder mask, which is a special token common in language models to denote a missing word.enISEAR is generated by asking participants to describe an emotion-inducing situation, a design that limits its external validity with respect to social media but that has the highest standard of internal validity with text annotations produced in a controlled setup.We consider enISEAR as the out-of-domain dataset most relevant to test the psychological validity of the emotion detection of models, while other datasets from social media are necessary to evaluate models in other domains once this psychological validity level is clear.
• GoEmotions [23] is a corpus of English comments extracted from Reddit with manual annotations for multiple emotions.It is a reader-labeled emotion dataset with labels assigned when at least three annotators gave the same label to a comment.For our out-of-domain test, we include the subset of the test split with a single label from among the Ekman category of the dataset, thus having Sadness, Anger, Fear, and Joy as a general positive emotion label.
• TEC [39] is a corpus of tweets posted between Nov. 15, 2011 and Dec. 6, 2011 with self-label for emotions using emotion-word hashtags.The hashtags serve as the emotion label for classification and are removed from the tweet texts.We sample 10% of the dataset at random as our out-of-domain test set.Since the hashtags are assigned by the authors of the tweets, the dataset can be considered labeled from the perspective of the writer.
• Universal Joy [33] is a collection of anonymized public Facebook posts in 18 languages labeled with five emotions: anger, anticipation, fear, joy, and sadness.The labels are derived from the Facebook "feelings tag" provided by the writers of the posts.We use the English subset of the test set for our analysis.
• SemEval [40] is a collection of tweets in three languages from 2016 and 2017 collected from Twitter using emotion keywords as queries.Subsequently, matching tweets were annotated by crowdworkers for emotion intensity, valence, and basic emotion classes.This dataset was the benchmark data for the competition about affect detection in SemEval.Here, we use the test data by including only instances with a single label that correspond to one of the labels in our model.
Note that for the OOD datasets (GoEmotions, TEC, Universal Joy, and SemEval), we use only the test sample for OOD evaluation and exclude other training or development samples.We do this to provide an evaluation that can be compared to previous and future supervised methods that use the training samples.Based on our selection criteria, we find only 11 tweets with the Affection label in the SemEval dataset.So, we consider Happiness and Affection to be the Happiness emotion label, which limits the nuance in which we can assess classifications within positive emotions in out-of-domain settings but still enables a wider differentiation between general positive emotions and three negative emotions.Descriptive statistics of the counts and proportions of labels in the five datasets can be found in Table 3.
We use the in-domain and OOD datasets to evaluate the performance of models in our experimental setup.We calculate the macro-averaged F1 score over all emotion labels and report results with the F1 score of each of the emotion labels, as their frequencies greatly differ in several of the datasets we use for evaluation.

Models
Model design and pre-training Pre-trained language models have shown state-of-the-art performance on many natural language processing tasks.We expect language models pre-trained on social media data to perform better on the Vent dataset.In preliminary experiments using performance on the development set, we test three pre-trained models based on the Robustly optimized BERT approach (RoBERTa) architecture and pre-training: Roberta-base [41], Twitter-RoBERTa [22], and BERTweet-base [24].BERTweet-base had the best performance on the development set and thus we chose to continue our work with BERTweet-base and its large version, BERTweet-large, in all our experiments.BERTweet-base and BERTweet-large are transformers model pretrained on 850M tweets with 12 and 24 layers, respectively.BERTweet-base has a maximum sequence length of 128 (sub)words while BERTweet-large has a maximum sequence length of 512 (sub)words [24].Before training a classifier on the training set, we pre-train BERTweet-base (BERTweet-large) on the text of Vent posts in the training set ignoring all emotion labels.We perform task-adaptive pre-training [28] by preferentially masking emotion words using eMLM.We use the emotion terms in the NRC emotion lexicon [42,43] as it is one of the most extensive emotion lexicons available.We set the probability of masking emotion words to 0.5 following previous work [30].We train with the eMLM objective for 100K steps using the AdamW optimizer [44], a learning rate of 5 * 10 −5 , and a batch size of 128.We name the resulting models LEIA-LM-base and LEIA-LMlarge, i.e. the result of our pre-training of BERTweet-base and BERTweet-large respectively.On an NVIDIA RTX8000 GPU, pre-training takes approximately a week for the base model and a month for the large model.

Model fine-tuning with labeled data
We implement a multiclass classifier for the five emotion labels: Anger, Fear, Sadness, Happiness, and Affection.We train classifiers starting from LEIA-LM-base and LEIA-LM-large using a two-step approach.First, we perform linear probing to initialize the classifier head and then full fine-tuning of the model.For linear probing, only the classifier head is randomly initialized and trained on the training dataset while the remaining model parameters are fixed.This initial step can be seen as a way to align the features of the prediction head and the base model to minimize feature distortion [31].In the subsequent full fine-tuning step, the prediction head is initialized from the parameters learned from the initial linear probing step.We also fine-tune a BERTweet-base and a BERTweet-large model without the eMLM step.To improve model generalization, we average model weights [45] of the two model variants (one with eMLM and one without eMLM) for each of the base and large architectures.The resulting models are respectively named LEIA-base and LEIA-large.We show the performance of the intermediate model variants on the in-domain and OOD test sets as an appendix in Tables 6 and 7.For the linear probing step, we use a learning rate of 5 * 10 −4 and train only the classifier head while the other layers are frozen for 1000 steps.For fine-tuning, we set the learning rate to 10 −5 with a constant learning rate schedule, embedding dropout of 0.1, weight decay factor of 0.01, and a label smoothing factor of 0.1.We train for 5 epochs using AdamW optimizer with an effective batch size of 256 and a maximum sequence length of 128.We jointly optimize a supervised contrastive loss and a cross-entropy loss [46].The supervised contrastive loss ensures that the model captures the similarity between examples within a class while contrasting them with examples from other classes.This approach has been shown to aid model generalization.Following prior work [46], we set the weight of the contrastive loss to 0.9 and the temperature parameter to 0.3.The fine-tuning process takes approximately 24 hours for the base-sized model and 60 hours for the large-sized model on an Nvidia RTX8000 GPU with 48GB memory.
Baselines As baselines, we use the popular Linguistic Inquiry and Word Count (LIWC) dictionary approach [47] and a Naive Bayes Support Vector Machine (NBSVM) as a supervised baseline.For the LIWC approach, we map the score for the relevant LIWC categories to emotion labels as follows: emo anger to Anger, emo anx to Fear, emo sad to Sadness, and emo pos to Happiness.We did not find a category that can be mapped to Affection in the LIWC categories, thus considering only 4 classes for the dictionary-based baseline.We convert the multiclass result of LIWC to a binary classification task for each emotion label using the "one-vs-rest" setting.For Sadness category as an example, we consider instances within the Sadness category as having a label of 1 while all other examples are assigned a label of 0. We first compute the base frequency as the average of the LIWC score for each emotion category on the Vent development set and divide the LIWC score for each post in the test set with the base frequency.If the quotient is greater than 1, we predict that the respective emotion is present in the post otherwise absent.We opt for this option in order to be able to handle cases where the LIWC dictionaries do not match any word in a given post which results in a score of 0 across all labels.
We use NBSVM [48] as a supervised baseline.NBSVM is a strong baseline for text classification that uses Naive Bayes features for unigrams as input representation.We use the implementation in Ktrain [49] with a vocabulary size of 64K.

Results and Analysis
In this section, we report the performance of LEIA-base and LEIA-large in both in-domain and out-of-domain scenarios.We include the macro-F1 score and bootstrapping confidence intervals obtained from 10000 bootstrap samples.We provide an error analysis on a sample of incorrect model predictions.We end by assessing the salient features on selected examples of model predictions.
In-domain results Table 4 shows that LEIA-base and LEIA-large outperform all models in all three Vent test samples, achieving a Macro-F1 of about 73 on random posts, text from unseen users and different time periods.Model performance is comparable across all three test sets, which indicates that its F1 score is not achieved by exploiting biases of user activity or high-volume time periods.The dictionary approach has the lowest macro-F1 scores, being significantly outperformed by LEIA-base and LEIA-large.The supervised approach of NBSVM achieves macro-F1 scores of about 60 but is still substantially and significantly outperformed by LEIA-base and LEIA-large.
Figure 2 shows a breakdown of F1 per emotion class in the in-domain test samples.LEIA-base and LEIAlarge show consistently high F1 score for all emotion classes.This shows that the general performance of LEIA-base and LEIA-large is not as a result of bias from higher performance on majority class.The only class that has a slightly lower F1 is Fear, but LEIA-base and LEIA-large still outperform all other methods on it.One observation is that NBSVM also performs slightly worse for Fear than for other emotions, in contrast with LIWC, which obtains a comparatively better performance in the Fear category.

Affection Anger
Fear Happiness Sadness  Out-of-domain results Our out-of-domain benchmark shows that LEIA can detect emotional states in other types of text and social media platforms beyond Vent.Table 5 shows the Macro-F1 scores for the five out-ofdomain test sets.LEIA-base and LEIA-large have significantly higher F1 scores than all other methods when evaluated on 4 out of the 5 OOD datasets.The NBSVM has a comparable performance in the GoEmotions dataset, where the F1 of NBSVM and of LEIA-base are not significantly different.We also observe that a larger model does not necessarily lead to better performance on OOD datasets, as LEIA-large only shows a substantially different performance on the enISEAR dataset.Figure 3 shows the F1 score for each class on the OOD datasets.In general, LEIA often outperforms baselines across labels.LEIA is significantly better than the baselines for Happiness and Sadness in the Universal Joy and TEC datasets, for all emotions in the enISEAR dataset, and for all emotions except Fear and Sadness in the SemEval dataset.On the GoEmotions dataset, LEIA is tied with NBSVM as the best method to detect Anger as F1 score is not significantly different.The Fear class evaluation poses some challenges in this OOD evaluation since evaluation samples for this class can be very small (e.g.11 posts in Universal Joy and 77 in GoEmotions).In the case of Fear, the dictionary approach performs significantly better than the supervised approaches on GoEmotions, SemEval, and TEC.Recall that the dictionary approach is based on a binary classification setting which is easier than a multiclass classification setting.Despite this, the performance of the dictionary approach is significantly lower for Happiness, even reaching an F1 score of 0 on the enISEAR dataset.This trend is similar to the performance observed on the in-domain test sets.We can conclude that LEIA shows a good generalization beyond the domain it was trained on, first by achieving very high performance in enISEAR, the test closest to psychological methodology, but also achieving good performance for datasets that include posts from other social media such as Twitter and Facebook.The lower performance recorded for Fear on the out-of-domain test sets is not surprising as the model performance on this category tends to be lower on the in-domain test sets too.LEIA achieves a consistently high score for Happiness on the out-of-domain test sets despite the fact that it is one of the least frequent categories in the training set.This suggests that it constitutes an easier category for the model to recognize across domains than more nuanced negative emotions.
Error analysis We examine a random sample of 50 incorrect predictions from the user test split (10 per label) of the Vent dataset.We find that majority of errors in the sample can be categorized into the following cases: 1. Messages conveying an expectation of a positive outcome while the self-assigned label has negative valence (e.g., I need a good online game).These cases represent situations where the text is very similar to positive texts but subtle signals point toward negative states.
2. Expressions of both positive and negative emotions at the same time.These are assigned a single label by design but other labelling schemes could cope with mixed emotions.
3. Use of figurative expressions such as humor or sarcasm that the model does not recognize.
4. Very short posts that do not contain indications about the emotional state of the author (e.g., going for a coffee) where additional context is required.
5. Few instances where we find the model prediction more plausible than the assigned label.

Feature attributions
We examine the salient features that contribute to the predictions made by LEIA-base on a set of examples from the enISEAR dataset.We apply the Local Interpretable Model-agnostic Explanations (LIME) method for model interpretability [50], an attribution method for identifying salient features as n-grams of the classified text.Figure 4 shows four examples, one for each class of emotions in the enISEAR test set.The first column shows the model confidence scores for each class supported by LEIA-base and the text is colored according to which words contribute to the prediction.We observe that for the first example, the model incorrectly predicts Affection as the most likely label where the true label is Happiness, which is an error of a weaker kind since enISEAR does not have an Affection label and both emotions are close in terms of valence.The second highest class is Happiness and the prediction is positively based on words expressing high arousal and valence (e.g., "incredible") and negatively based on the word "worrying".In the second example, the model also seems to use relevant words linked to each other (e.g., "children" and "lied") to make the correct prediction.The model correctly predicts Sadness for the third example building on negative words, including terms linked to property damage that caused an emotional loss.We observe that the scores for fear and sadness are very close and much higher than for other classes.This seems plausible as the first sentence in this example could be a fearful situation.The model prediction is Happiness in the fourth example instead of Fear, which was the true label.Even though the prediction relies on relevant features, the model seems to lack the commonsense knowledge that cycling down a mountain can be scary and not necessarily a pleasant experience.
The last two cases suggest that the emotion tag for some of the posts is used as the main medium to express the emotion, leaving the text to add other information.This is one of the limitations of using Vent as a training dataset, as labels are part of the communication and may sometimes be complementary or otherwise to the posts.

Discussion
We present LEIA, a language model in two sizes (LEIA-base and LEIA-large), that leverages approaches for adapting pre-trained language models for emotion identification.We show that using an emotion lexicon with task-adaptive pre-training, in this case focusing on emotion words, is effective for improving model performance using BERTweet-base and BERTweet-large language models.LEIA generalizes beyond Vent posts as it shows better performance on texts written by users not included in its training data and future time periods.It achieves a balanced performance across emotion labels despite their imbalance in training data and this performance is also seen on out-of-domain texts for the considered emotions except for Fear.These results are in part possible thanks to focusing on a small set of emotions suggested by psychological research, as classifying the larger set of mood labels in Vent [20] is a substantially harder task we did not tackle here.Also, the Vent dataset, which despite being generated on a platform not as large as common ones in research, e.g.Twitter and Reddit, has a sufficiently large scale that enables the models to learn a broader range of emotional expressions.
The performance of LEIA-base is comparable to LEIA-large across tests in our benchmark with one notable exception: LEIA-large is substantially better for the enISEAR dataset.This dataset is especially important given the psychological methodology used to generate it, which allows us to compare the results of machine  learning methods with self-reported labels in a controlled setup.LEIA's performance in enISEAR is especially high, reaching F1 of 70 for LEIA-base and 79 for LEIA-large, showing a high level of psychological validity, especially when compared to other methods in the benchmark that achieve at most 55.LIWC generally achieves low F1 in all tests except SemEval, which grants two notes.First, SemEval was generated by searching tweets with emotion-bearing terms, easing the task of LIWC when classifying emotions based on similar word lists.Second, LIWC was not designed as an emotion classification method at the scope of a social media post, but as a more general text analysis method that should be applied to longer texts and not necessarily for classification.We added LIWC as a contrast with common methods applied in the field, but our comparison overstretches the applications for which LIWC was designed.
Limitations While we show that our proposed models are effective, our experiments span two model sizes with the same architecture.Future research should conduct experiments on other pre-training approaches beyond masking as well as more efficient training techniques.In addition, we rely mostly on hyperparameter settings in the literature and optimizing them could lead to better performance.However, this is computationally expensive and there might be unfavorable trade-offs between model performance and resources.Another limitation is our focus mainly on English posts, providing no evidence here of the potential of this approach for other languages.Furthermore, we study five emotion labels guided by psychological research, but several competing representations models for emotion are available.Humans are able to classify a larger number of basic emotions and can also quantify emotions in dimensional spaces, two open areas that can be explored with more nuanced labeling schemes.While self-annotated datasets have the potential to become the new gold standard beyond crowdworkers, the Vent dataset is still produced with visible mood labels rather than private reports that do not constitute part of online communication.This is still closer to general emotion expression than automatic labeling with emoji or hashtags, but models like LEIA-base or LEIA-large can be substantially improved with psychological methods like experience sampling with validated psychological scales [18].
Broader impact and ethical considerations This work shares the same ethical concerns with other emotion recognition systems as highlighted in [51].Emotion detection models should be used responsibly and special care should be taken when they are applied in new scenarios, not only because of their possible lower performance but also due to possible different privacy expectations with respect to emotions.We must note that we have no way of estimating the demographic diversity of Vent users and it is very likely that the model misses idiosyncrasies of emotional expression in minority groups and in cultures not represented in the dataset.We acknowledge that we only consider one type of model evaluation focusing on accuracy while there are several aspects such as bias, fairness, and robustness that should be considered before a model is used in practice, especially when guiding any decision-making.

Conclusion
LEIA is an emotion detection method that achieves a balanced performance across emotions and generalizes across posts, users, and time.It shows satisfactory performance in out-of-domain tests, especially when compared to self-annotated texts produced with psychological methods.Beyond our validations, the language models within LEIA can be used as pre-training resources for future applications that employ annotated data in other domains, for example for tweets in particular contexts.
We named LEIA after Princess Leia from Star Wars, following the tradition of emotion method names set out by LIWC [52] (pronounced Luke, as in Luke Skywalker), and VADER [53] (as in Darth Vader).These three methods have a similar purpose but very different approaches that align with concurrent developments in text analysis.We published openly our models in HuggingFace (https://huggingface.co/LEIA) including both the classifier LEIA-base (LEIA-large) and the corresponding emotion-aware language model with the hope that they can be used in future work in emotion detection from text.

Figure 1 :
Figure 1: Overview of data sources, training steps, models, and evaluation tests.

Figure 2 :
Figure 2: Results within the Vent dataset in the three test samples.Error bars show bootstrap 95% confidence intervals and may be too small to be visible due to the large sample sizes.

Figure 3 :
Figure 3: F1 score for each label for the out-of-domain datasets.Error bars represent confidence intervals computed using bootstrapping with replacement.Missing bars correspond to F1 of 0.

Figure 4 :
Figure 4: LIME explanations showing the feature importance for LEIA-base prediction on four examples taken from the enISEAR dataset.The mask token is < mask >, shown with vertical lines in the figure.

Table 1 :
Mapping of Vent categories to emotion labels 2

Table 2 :
Frequency of occurrence of the labels on the data splits of the Vent dataset after pre-processing.The proportion of the total number of instances within the sample is in parenthesis.

Table 3 :
Frequency of occurrence of the labels on the test sets of out-of-domain datasets