Large scale analysis of gender bias and sexism in song lyrics

We employ Natural Language Processing techniques to analyse 377808 English song lyrics from the"Two Million Song Database"corpus, focusing on the expression of sexism across five decades (1960-2010) and the measurement of gender biases. Using a sexism classifier, we identify sexist lyrics at a larger scale than previous studies using small samples of manually annotated popular songs. Furthermore, we reveal gender biases by measuring associations in word embeddings learned on song lyrics. We find sexist content to increase across time, especially from male artists and for popular songs appearing in Billboard charts. Songs are also shown to contain different language biases depending on the gender of the performer, with male solo artist songs containing more and stronger biases. This is the first large scale analysis of this type, giving insights into language usage in such an influential part of popular culture.


Introduction
Music allows the expression of emotions, feelings and ideas through verbal and auditory language. Multiple messages can be conveyed by means of this language both at an emotional level, mostly through sounds, and at a verbal level through song lyrics [1]. In particular, musical lyrics can contain verbal messages of various nature, through which ideas of individuals and social groups have been spread in society. Song lyrics are also a popular expression of culture, and they reflect or maybe even emphasize many social phenomena. They hold ideas that have occurred over time about different social issues, such as gender discrimination and sexism [2,3,4]. As stated by Davis in [5], song lyrics "are more than mere mirrors of society; they are a potent force in the shaping of it". As such, song lyrics are an important albeit underexplored data source to observe and measure societal changes.
Natural Language Processing (NLP) techniques are an ideal tool for analyzing song lyrics due to their textual nature. These techniques have proven to be effective in improving results for companies, producers, and songwriters in the music industry [6]. However, there is growing concern that NLP models may inadvertently learn and perpetuate existing biases encoded in the training data when trained on large text corpora [7]. This raises the possibility of biases being amplified in language models and systems that use them [8,9].
However this undesired property of NLP models can also be turned into an opportunity when it comes to the measurement of linguistic biases present in text corpora. Previous studies have demonstrated that word embeddings, which represent words as high-dimensional vectors [10], are capable of capturing linguistic biases related to human biases [11], as well as reflecting stereotypes towards women and ethnic minorities [12,13]. The magnitude of such biases can vary depending on the domain of the text corpus [14,15]. These findings provide evidence of the effectiveness of measuring linguistic biases in text through word embeddings.
NLP techniques have also shown promise in identifying hate speech and other forms of discriminatory language in text corpora [16,17]. These techniques involve models trained on annotated datasets that have been labeled for various concepts related to hate speech, such as sexist content. The resulting models can then be used to automatically detect and quantify for example the presence of sexist content in new text data.
In this study we take advantage of some of these NLP techniques and analyse English song lyrics regarding two aspects receiving increasing attention: gender-related language bias and sexism in language. We consider two main research questions: RQ1: Is there evidence of sexist content in song lyrics? Can we find differences with respect to artist gender and musical genres across time?
RQ2: Are gender and social role stereotypes reflected in song lyrics? Are there differences in the songs with respect to artist gender?
To answer these research questions, we use Song Lyrics Data from the "Two Million Song Database" of the WASABI project [18] to obtain a large corpus of song lyrics and song related information. We enrich the songs' metadata with Billboard chart performance to get the popularity trend of songs that spans from 1960 to 2010.
Then, we apply and adapt the sexism classifier of Samory et al. [19] to explore the presence of sexism in song lyrics. This approach allows to analyze the evolution of sexism in song lyrics of the WASABI dataset across time. We find an increase over time of sexist content in popular song lyrics by male solo artists, and that Hip hop and R&B and Soul songs have a higher fraction of sexist lyrics when compared to the other analyzed genres.
Finally, we use different word embedding association tests to detect (gender) language bias in song lyrics [11,20,21]. Differentiating the analysis by gender, we show the significant presence of language biases in the WASABI dataset, in particular for male solo artists, in contrast to more gender neutral bias of female solo artists.
The main contributions of this work consist in providing (to our best knowledge): • the first large scale and longitudinal exploratory data analysis employing an automatic method to identify song lyrics containing sexist content, • the first extensive study of language bias in song lyrics segregating by artist gender.

Related work
The analysis of gender stereotypes and sexism expressed in language with NLP methods has been a growing area of research in recent years [22]. These methods are closely entangled with the detection and mitigation (i.e. removal or reduction) of gender bias in NLP models and their output (see for example Sun et al. [23] for a review on mitigation). A promising approach for both, detection and mitigation, relies on measuring gender bias from the association between word vectors, for example through their cosine similarity. Bolukbasi et al. [12] proposed to first identify a gender direction defined by the subspace spanned by gendered words (e.g., she and he, women and man). Then, they quantify gender bias as the projection of ideally gender-neutral words (e.g., job and profession names) onto the gender subspace.
This idea, combined with averaging methods and hypothesis testing has later led to the development of the Word Embedding Association Test (WEAT) [11]. The significance and sensitivity of WEAT has been analyzed by [24] and it has been successfully employed for example in [20] to quantify gender stereotypes in language corpora, adding a single category version (SC-WEAT) to the analysis. WEAT has later been extended further to SWEAT [21] to compare the relative polarization of two corpora (like male and female authored lyrics in our case). We will exploit WEAT, SC-WEAT and SWEAT here and describe them in more detail in Section 3.2. Besides word embeddings, similar approaches have been used to measure gender bias in large language models [15,25].
Part of our work furthermore builds on the results of a sexism classifier developed by Samory et al. [19]. Since the exact definition of misogyny and sexism may be under discussion [26], the authors of [19] took into account different dimensions of sexism to increase model validity and furthermore improved the model reliability through including adversarial examples. Other noteworthy approaches to sexism detection use support vector machines, sequence-to-sequence models and a FastText classifier [27]; a BERT-based architecture to detect misogyny and aggression simultaneously on social media [28] or compare different models for identifying misogyny across languages and domains on Twitter data [29]. We refer to the following reviews for a more comprehensive overview about automatic misogyny and hate speech detection [16,17].
In relation to our research questions, most of the works on gender stereotypes and sexism in song lyrics have targeted popular songs using manual content analysis, which allows studying fine-grained constructs related to sexism such as objectification and sexualization [30,31]. These studies analysed popular songs from the 1960s to the 2000s showing that sexualization and mentions of sexual desire increased dramatically only after the 1990s, while mentions of love decreased. This is interpreted as a signal of the increase of sexism in songs because lust in the absence of love is likely to objectify the object of the desire [30]. Other works consider both the differences between male and female artists and the gender of the target being objectified. In [32], authors analyze the same time period considered in our work, finding female artists more likely to sing about love, while men are more likely to objectify others (both men and women), with a stronger emphasis towards women. Flynn et al. [33] confirmed these findings, adding that women were more likely to objectify themselves than men do. Moreover, Rap, Hip hop and R&B songs are the genres with the most objectification. These works rely on different aspects of the concept of sexism, which is possible to code through manual content analysis. Although our methodology can not distinguish between such nuances, it allows to analyse song lyrics at a larger scale.
Here we are more interested in studies using large datasets like the one from [34], which analyses bias in half a million song lyrics using WEAT scores [11]. This study does not segregate its results neither by gender nor in the temporal dimension and finds that bias in songs is strongest in relation to gender stereotypes and career paths. However, also gender biases in relation to Math vs Arts and Science vs Arts are found, meaning that both math and science words are more closely related to male terms while females terms are more closely related to art. All these biases are similar (albeit a bit smaller) to what can be observed in a large internet crawl of texts [11], thus indicating that biases present in song lyrics mirrors the biases that exist in society. Recently, in a study performed in parallel to ours it was found that in song lyrics men are more likely than women to be associated to traits depicting them as competent, even though this bias becomes weaker going forward in time [35]. The authors of this work, albeit incorporating a temporal dimension, used a simpler not standardized metric of gender bias and measured only a single trait, whereas our analysis investigates several traits and uses different association tests used as well in other published works [11,20,21].
Other tasks for which song lyrics have been used for is song mood [36] and sentiment [37] classification, or together with audio features for genre classification [38] and song popularity [39]. The performance in the later two tasks with lyrics alone has been explored in [40].

Data collection and filtering
The WASABI database is a knowledge base that includes data about 77K artists, 200K albums, and more than two million songs [18]. We queried the database for all solo artists and groups having published more than 10 songs, and collected all their English song lyrics published between 1960 and 2009. The database contains information about solo artists and band members, including their gender. We assigned the gender label "male" ("female") to bands composed of only male (female) members, and "mixed" to bands with both male and female members. We discarded solo artists without gender information, as well as groups with at least one member without it. The database does not include other gender identities, except for 9 artists whose gender is labeled as "Other" and accounting for a total of 44 songs. We thus decided to consider artists' gender as binary, although we acknowledge this limitation that prevents us from taking into account other gender identities. Furthermore, we filtered out all the songs whose lyrics or publication year are unavailable, whose lyrics are shorter than 10 words or composed of less than 4 lines. After removing duplicate lyrics (details in Appendix B), our final dataset consists of 377,808 song lyrics: 244,146 are performed by 7,131 solo artists and the remaining 133,662 by 4,294 groups. Finally, we retrieved also the song genres, which was not available for 46,482 songs (12.3%). The number of unique genres was reduced by replacing them with their corresponding top-level genres.
To identify popular songs in the WASABI dataset, we retrieved the Billboard Hot 100 weekly charts, composed of all-genre weekly song charts released since 1958 [41]. We furthermore extracted the top 10 songs from all the charts. We were able to map 10,798 out of 24,180 unique songs (44.7%) of the Billboard, and 2,608 out of 4,348 unique songs (60.0%) of the Billboard top 10 charts by using approximate string matching to match both the title and the artist of songs (details in Appendix C). In the following, when referring to songs in Billboard or in Billboard top 10 charts, we refer to these sets of songs that we mapped to the main WASABI dataset.
The WASABI database provides the language of song lyrics through the "language detect" field.
We used the taxonomy of the Wikipedia page of popular genres https://en. wikipedia.org/wiki/List_of_music_genres_and_styles.

Sexism detection
We fine tuned a BERT classifier to detect sexist passages in texts using the dataset and the code provided by Samory et al. [19]. The dataset contains texts manually labeled following a scale-based codebook that operationalizes the concept of sexism. The codebook consists of four non-overlapping categories resulting from the review of psychological scales measuring sexism and related constructs. These include a broad range of sexist content that can be present in song lyrics such as behavioral expectations, stereotypes & comparisons, endorsements of inequality, and denying inequality & rejection of feminism. In addition to sexist content, the codebook also includes categories that take into account sexist phrasing, in order to distinguish sexist content from texts containing only uncivil content or common profanity. This makes the classifier perform well on out-of-domain data, also thanks to the inclusion of human-written adversarial examples. We refer to Samory et al. [19] for further details on the codebook.
Since the original dataset consists of short texts, we adapted the representation of the input to classify song lyrics. In detail, lyrics are divided into batches composed of groups of four lines, each of them sharing two lines with the previous and following group. A batch is considered to contain sexist content if the model outputs a probability higher than a certain threshold. Whenever a song lyric contains at least one batch identified as sexist, we propagate the sexist label to the song. To verify the performance on song lyrics, we evaluated the classifier on an external dataset [42] of 190 lyrics, 40.5% of which are considered to contain sexist content. For the optimal classification threshold of 0.725, the classifier achieves a precision of 0.73. We discuss in Appendix D the results obtained for different classification thresholds and requiring a larger number of batches labeled as sexist to propagate the label to the whole song, showing that our main findings are not affected by these choices.

Language biases from word embeddings
Words can be represented by vectors by leveraging the co-occurrence of nearby words in such a way that vectors that are close to each other represent words sharing a similar semantic meaning [43]. This representation of https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1 Mapped to WASABI using the same approach used for the Billboard charts, described in Appendix C.
words is able to encode word analogies (e.g., "King" is to "Man" as "Queen" is to "Woman"), but it may also encode language biases and stereotypical analogies that are present in the training corpus [11]. For example, from a higher co-occurrence of pleasant words next to music instrument than to weapon words, we can likely expect word embeddings of unpleasant words to be closer to weapon than musical instrument words. We measured language biases in three different subsets of the lyrics corpus: all lyrics of solo artists, lyrics of male solo artists, and lyrics of female solo artists. This allowed us to compare the biases that are present in song lyrics performed by male and female artists. In the following, we list the methodologies we used to measure these kinds of associations in word embeddings.
WEAT: The Word Embedding Association Test (WEAT) measures the association between two sets of target words X and Y (e.g., music instruments and weapons), and two sets of attribute words A and B (e.g., pleasant and unpleasant words). The WEAT tests the null hypothesis that there is no difference between the two sets of target words in terms of their relative similarity to the two sets of attribute words [11]. The similarity between two words a and b is defined as the cosine similarity cos( a, b) between their representation in a vector space. The effect size is defined as: and pooled std dev is the pooled standard deviation [44]. From this definition, positive values of the effect size indicate that the words in X are more similar to the words in A than in B, and the words in Y are more similar to the words in B than in A. Recalling the above example, we would expect a positive effect size. The significance is computed in the following way. The test statistic is defined as: We found different implementations of the WEAT, which compute the effect size using either the standard deviation or the pooled standard deviation (i.e. a weighted average of standard deviations). We opted for the latter solution because it reflects the definition of the effect size in terms of Cohen's D defined in Caliskan et al. [11]. Given 1,000 random partitions of equal size for the union of the two target sets ..,1000 , the one-sided P value is computed as SC-WEAT: Since the WEAT condensates into one effect size the relative associations of attribute sets against two target sets, the similarity against one single target set is lost. We used the Single Category Word Embedding Association Test (SC-WEAT) to disentangle this relative association [20]. The SC-WEAT measures the association between one set of target words W and two sets of attribute words A and B. If the SC-WEAT score is positive (negative), words in W are more (less) similar to the words in A compared to B. The procedure to compute the significance is analogous to the one described for the WEAT. Note that the WEAT and the SC-WEAT are not redundant because the former measures the relative association of X and Y with A and B, whereas the SC-WEAT evaluates the association of one target set per time, and their combination can add information to the language biases analysis.
SWEAT: The methods described above measure associations between word sets whose embeddings are learnt from a single text corpus. To compare the association between two corpora, we employed the Sliced Word Embedding Association Test (SWEAT) [21]. Given the word vectors learnt on two corpora D 1 and D 2 , the SWEAT score is defined as: where s (w, A, B, D) has the same form as in Equation 1, but word vectors are taken from the distributional representation D. A positive SWEAT score indicates that the word vectors of a target set in D 1 are relatively more associated to the words in A than in B, and the word vectors in D 2 are relatively more associated to the words in B than in A. Again, significance is computed in the same way as for the WEAT. Here, D 1 and D 2 refer to the male and female corpus respectively.
Selection of attribute and target words: Table D.7 in the Appendix reports the sets of target and attribute words used, borrowed from [11] and [14]. We slightly modified the word sets to account for rare or missing words. For this, we first removed words occurring less than five times in at least one of the three corpora. Then, whenever pairs of attribute (target) sets contain a different number of words, we removed the least frequent words from the larger set until the two sets have the same size. In doing this, we defined the frequency of a word as the minimum of its frequencies in the three corpora under analysis. For the list of male and female proper names, we used the most frequent proper names in the lyrics corpora.
Learning word vectors: We used the Gensim implementation of Word2Vec [45] to learn word embeddings from scratch separately for the three corpora of song lyrics. To account for the potential variability that arises from different initializations of the word vectors, we conducted the association tests on word embeddings obtained from five independent runs of the Word2Vec algorithm. We then report the score obtained from the first iteration and deem the result to be statistically significant only if all five iterations yielded a significant score at a certain level. In other words, we conducted the association tests five times using different initializations to ensure the robustness of our results. Table 1 reports the number of songs for each combination of artist type and gender. Solo artists are more represented by males than females, with more than the double of the number of songs. This unbalance is exacerbated for groups, where female and mixed groups account for 1.7% and 6.2% of the songs. This observation holds for the songs in Billboard and Billboard top 10 charts as well, where the gender inequality in English song charts is well documented in the literature [46,47]. The distribution of publication years of the songs in our dataset is shown in the left plot of Figure 1. The total number of songs per year is not uniformly distributed across time. There are about 3,000 songs per year from the 1970s until the 1990s with a subsequent steady increase reaching around 20,000 songs per year in the late 2000s. However, the relative fraction of songs performed by male and female artists remains relatively constant, as can be seen in the right plot of Figure 1. Indeed, male artists (in groups or as solo artist) contribute with a fraction of songs between 70% and 80% over the whole time span. Figure 2 shows the increase of the fraction of songs by female solo artists across time in all the subsets of the dataset. In particular, the Billboard and Billboard top 10 charts show a sharp increase between 1980 and 1990 where the fraction of female artist songs increases by around 10%, reaching 40% and 50% respectively. Then, during the 2000s this fraction decreased again   to values of the mid-80s. We refer to Appendix A for the fraction of songs segregated by genre across time. Table 2 shows the proportion of songs containing sexist passages identified for each artist type and gender. The classifier identifies 89,462 (23.7%) lyrics in the WASABI dataset to contain sexist passages. Artists and groups have different fractions of sexist lyrics: 30% of male solo artist songs are classified  as sexist, compared to 16% to 20% of songs of the other groups of artists. The fraction of sexist songs in Billboard and Billboard top 10 charts is at least 10% higher than the one of the whole WASABI dataset. This observation suggests that popular songs are more likely to contain sexist content than an average song, even more so if they reach a top position in the charts. This finding is consistent for different classifications thresholds (details in Appendix D). Furthermore, a larger fraction of male solo artist songs contains sexist content when compared to those of female solo artists. We observe that male and female artists display different trends for what concerns their percentage of songs classified as sexist over time, but male solo artists consistently pub- lish relatively more sexist songs in the whole time span. Figure 3 shows the fraction of songs containing sexist passages for each artist type and gender. The share of female solo artist songs in the WASABI dataset with sexist lyrics remains relatively constant over time at around 20%, but, even though the share of male solo artist sexist songs becomes larger going forward in time, this trend is even stronger for popular songs in which we can identify a sharp increase starting around the mid-80s. At the end of the time span under analysis, more than 60% of male solo artist songs on Billboard are found to contain sexist passages. The share of female solo artist songs on Billboard with sexist lyrics increases over time as well, but during the 2000s this share is 20% lower when compared to male solo artists. Differently, group artists, regardless of members' gender, do not display a relevant increase of sexist songs. Figure 3 shows only the group male artists to not overload the figure, but the female and mixed group artists display a similar trend. Interestingly, male solo artists perform relatively more sexist songs than male group artists. This might signal a difference between male solo artists and groups in terms of subjects and themes covered in their songs. As observed previously, the figure shows also that Billboard charts, compared to the WASABI dataset, attracted more songs with sexist content starting from the 1990s. After this aggregate analysis we now inspect how different genres con-tribute to the observed trends. In particular, we focus on the four most popular genres in our data-sets: Pop, Rock, Hip hop, R&B and soul. The changes of their frequency over time in the data set is depicted in Figure A.5 in the Appendix, while Figure 4 shows the fraction of songs containing sexist passages of male and female solo artists as well as male only groups for Pop, Rock, Hip hop, and R&B and soul. A majority of the Hip hop songs are classified as sexist, regardless of artists' gender. Pop and Rock songs do not show a substantial difference between male and female solo artists, but male artists perform relatively more sexist songs than female artists for the Pop genre. In both cases, these fractions are lower than the Hip hop genre. Lastly, R&B and soul songs display a steady increase of sexist lyrics over time, with male solo artists having a higher fraction of sexist songs than female artists.

Language bias in song lyrics
Here we investigate the corpora of song lyrics for potential language bias, showing the results of the different word embedding association tests defined in Section 3.2: SC-WEAT (measures the difference between two sets of attribute words and a target set), WEAT (measures the relative similarity of the two target sets with two sets of attribute words), and SWEAT (compares the differences for a target set in two corpora, i.e. male and female corpora).
The corresponding results are shown in Table 3 with positive WEAT effect sizes (last column) for all the tests, indicating that words in the target set X (e.g., male names) are more similar to words in A (e.g., career) than B (e.g., family), and words in the target set Y (e.g., female names) are more similar to words in B than A (e.g., family vs career). However, the magnitudes and significance levels of these paired associations are different. We now explain the detailed results grouped by the tested sets of words.
Pleasant vs. Unpleasant words: These associations are considered to be universally accepted stereotypes for humans [48]: pleasant words to flowers and musical instruments, and unpleasant words to insects and weapons. Experiments involving human subjects also found these associations [49]. We thus expect to find the same in song lyrics as well, regardless of the gender of the artist. Indeed, we found all these associations to be statistically significant for both pairs of target sets in all the three corpora (positive and Note that the genre information is missing in 382 (3.5%) and 65 (2.5%) songs in Billboard and Billboard top 10 charts respectively.
Again we omitted female and mixed groups from this analysis. statistically significant WEAT effect size). The fact that the trained word vectors capture these trivial stereotypes makes us confident that the word vectors have learned meaningful associations (see Appendix E for a more detailed discussion of these results). Besides that, we observe a negative and statistically significant SWEAT score for Musical instruments, meaning that Musical instrument words are closer to Unpleasant than Pleasant words in the male corpus while being closer to Pleasant than Unpleasant words in the female corpus. This might hint at a deeper difference in the way male and female artists refer to music or the tools (instruments) they use to produce it.
Career vs. Family: We find that Career words are significantly closer to Male than to Female proper names in the male lyrics (SC-WEAT X = 2.76), as well as that Female names are significantly closer to Family than to Career words (SC-WEAT Y = −1.04). This is not significant if we consider nouns and pronouns instead of names and may be explained by song lyrics being more gender-stereotypical when songs address a specific person mentioned by their proper name.
Interestingly, we do not observe the same bias in female lyrics, where the SC-WEAT score for Female names is close to 0 and we therefore observe a significant difference between the two corpora (SWEAT = −0.92 for Female names). Furthermore, the SC-WEAT score for Career is much smaller (0.45), but this does not translate into a significant numerical difference between male and female corpora. However, a significant difference is also found for Male names (SWEAT = −0.68). Although the corresponding SC-WEAT scores are not significant, we observe a positive value (0.94) in female lyrics and a negative score in male lyrics (−0.24). In other words, Male names are closer to Career than to Family terms in female artist lyrics than in those from male artists.
We observe thus that male solo artists associate Female names closer with Family and female solo artists Male names closer to Career, while the opposite is not the case. There are no biases in female lyrics of female names towards Family terms and in male lyrics of Male names towards Career terms (if anything, it would be in the opposite direction, i.e. a negative SC-WEAT score of −0.24). However, when considering career words alone there is a bias in the male corpus of Career terms towards Male names (SC-WEAT X = 2.76). So Career terms are more likely to be mentioned together with Male names, while if Male names are mentioned there, it is slightly more likely to be in relation to Family words.
MatSci vs. Arts: When analysing a potential gender bias of Mathematics or Science vs. Arts terms we observe that female lyrics show a positive and significant WEAT score (1.29), and a negative and significant single category association for Arts, indicating a higher association of Arts words to Female than to Male nouns and pronouns (SC-WEAT Y = −1.26). None of the other tests is statistically significant for the male and all corpora, nor the comparison of the male and female corpora through SWEAT.
Intelligence vs. Appearance and Strength vs. Weakness: Male solo artists associate Strength words significantly more to Male than Female nouns and pronouns (WEAT = 1.02) with a positive and significant single category association for Strength words (SC-WEAT X = 1.24). Although we also find a similar significant relation for female artists lyrics (SC-WEAT X = 0.70), the direct comparison with the male corpus gives a positive and significant SWEAT score for Strength (0.32) terms. This indicates the association of Strength words to Male terms is stronger in lyrics of male than female solo artists.
Finally, we find a positive and slightly significant WEAT score (0.71) comparing the Intelligence and Appearance words in male lyrics. Although the single category scores are not significant by themselves, their opposite sign shows a signal for male artists to associate Intelligence words with Male terms and Weakness words with Female terms. This is different in female lyrics where the corresponding SC-WEAT scores are close to zero indicating no bias in this regard and translates into a slightly significant result for the SWEAT score for Intelligence (0.18).
To summarise, this analysis shows that songs of male solo artists contain more and often stronger gender biases than those of female solo artists, which are closer to gender neutrality. The only exception of this observation is a bias in female lyrics relating females more closely to art terms. Male artist songs emphasize men as stronger and focused on career at the expense of women depicted as less strong and closer to the family-related terms.

Discussion
This section discusses our findings in relation to the research questions and findings of previous works, some potential shortcomings in our study design, and potential paths for future research.
Sexism (RQ1): The first research question investigates the presence of sexist content in song lyrics and whether there is any difference between artist Table 3: Results of the association tests performed on three different lyrics corpora: male solo artists, female solo artists and their union (all). Target sets X or Y resulting in statistically significant SWEAT scores (comparing male with the female corpus) highlighted in bold with their corresponding score (* p < 0.10, ** p < 0.05). gender, type, and musical genre. We found that almost 25% of song lyrics express sexist content, but this share is not uniformly distributed across artist gender and type. Male solo artists have the highest share of sexist lyrics in all the three subsets of the WASABI dataset (all songs, songs in the Billboards charts, and songs in top 10 position). Interestingly, this share is much lower for groups composed of only male members. This difference is worth mentioning because it might indicate a deeper divergence between songs performed by male solo and group artists that, to our knowledge, has not been reported in the literature previously. Another observation is that the relative number of lyrics containing sexist passages is higher in Billboard charts than in the whole WASABI dataset independently from the gender and type of the artist, a trend that is stronger going forward in time. Other works found similar observations, even though using different definitions or specific aspects related to sexual content. For example, some studies report an increase in sexual content and objectification of women in popular songs during the last five decades [32,30]. In a study limited to the year 2009, the top 10 song charts were shown to be more likely to contain sexual content if compared to songs from the same album by the same artist that did not enter the top 10 [50]. According to previous studies, we also found an higher fraction of songs containing sexist content among Hip hop and R&B and soul songs [51,33], albeit other genres are not exempted from this trend [52]. Hip hop and R&B and soul were also found to display an increasing trend in associating competence more frequently to men than women [35].
Although we obtained aggregated results in line with findings from previous works, we emphasize that these results rely on a sexism classifier trained on out-of-domain data, which might have led to poor generalization when applied to song lyrics. However, the classifier was trained on a combination of diverse corpora of sexist texts that enforce its ability to generalize on out-of-domain data [19]. We have also validated our model on a dataset of sexist lyrics showing good performances and robust results across different classification thresholds. We believe that this exploratory data analysis can encourage further works aimed at identifying sexism in song lyrics, which may need to pass through extensive manual labeling in order to train a classifier on the specific domain.
Language bias (RQ2): The second research question focuses on the differences of language bias in male and female solo artist song lyrics. Our results extend the ones obtained in [34], where all the WEAT scores are positive in magnitude. We enrich this work in two different ways. First, we measure language biases separately in the subsets of male and female solo artist songs, obtaining all positive WEAT scores as well. Some of these associations (e.g., male with career and female with family) align to the ones observed in a large-scale crawl of the Internet [11], indicating how music reflects societal biases. Second, we enrich the analysis through other association tests, which measure associations against a single target set (SC-WEAT) and between the two corpora (SWEAT). Our results show that the biases affecting these two corpora are different. In particular, male lyrics contain larger gender biases. For instance, Female proper names are more associated to Family than Career words, while Career words are closer to Male than Female names. At the same time, Strength words are more associated to Male than Female terms. The female corpus does not contain these two types of biases. Similarly, other works have found song lyrics of male artists to contain stronger gender biases than female artists, in particular depicting men as more competent than women [35].
In both corpora, we find that Strength words are closer to Male than Female terms but this association is again stronger in male lyrics. A bias that is only present in female lyrics is that female solo artists use more frequently Female terms closer to Art than Mathematics and Science words. Interestingly, in addition to a weak association of Intelligence words with Male terms in the male corpus, there is no significant association between Appearance words and gendered terms. This might be related to the observation that objectification does not target uniquely women, but it is also fairly common for men even though not present to the same extent [33]. Besides these gender biases, our word embeddings can capture expected associations like Pleasant with Flower words and Unpleasant with Insects or Weapons that do not depend on the gender of the artist.
Our results lack of an analysis of gender biases across time and genres as done in Boghrati and Berger [35], and a comparison between solo artists and groups songs. However, these subsets of the dataset would have reduced the training data used to learn word embeddings, thus compromising their quality.
Other limitations concern the dataset. We can not claim any generalizability of our conclusions because there is no guarantee that WASABI contains a representative sample of English songs. Moreover, Billboard charts add an additional US bias in the selection of popular songs. The choice of the variables used to stratify the results is another limitation. It is worth noticing that we split artists into males and females taking into account only the performer of the song, thus ignoring the gender of the songwriter. In addition, we can consider only two genders, thus neglecting other non-binary gender identities, as only binary gender data is available in the WASABI database. It would be interesting to address these gaps in future works, but this will require additional efforts for accurate data collection. Despite these limitations, the size of WASABI database, together with its open access, makes it a relevant study object by itself, and our contribution may be valuable for future users of this resource.

Conclusion
In this work, we have exploited the WASABI database to describe how sexist content in music varies during five decades, from 1960 to 2010, and to what extent word embeddings learned from the song lyrics corpus contain language biases. The former analysis, stratified by artist gender and type, shows that popular song lyrics of male solo artists become more sexist over the years, while this behavior is less noticeable for the other categories of artists. The genres (among the ones analyzed) that have the highest fraction of songs containing sexist content are Hip hop and R&B and soul, independently from the gender of the artist. Regarding language biases, we find the lyrics of male solo artists to contain more gender bias than those of female solo artists. This is true for instance for the stereotype depicting men as stronger and focused on success, and women being closer to family. The former bias is present in female solo artist songs as well, even though the association is weaker than for male solo artists. Our results show different ways to extract meaningful metrics about language usage and bias in song lyrics, as well as how to analyse such an important and heavily consumed expression of popular culture that influences how listeners see the world and reflects how artists perceive it.

Appendix A. Genre Popularity
We show here in Figure

Appendix B. Duplicate lyrics detection
The WASABI database was created by collecting the discography of all the artists, including the text of song lyrics when available. This latter information was collected from LyricWiki, an online database of lyrics and encyclopedia curated by users [18]. During this process, the same song lyrics may be collected multiple times (e.g., the same song is present in several albums of the same artist) and it is ideal to reduce as much as possible the presence of duplicates in the dataset. To do that, we used techniques of approximate string matching.
Given a set of song lyrics, we represented each lyrics as the set of 3-grams it is composed of, following a bag-of-words approach. Then, we consider two lyrics to be duplicates if the Jaccard index of their 3-grams representation is higher than 0.80. This threshold allowed us to detect groups of songs whose lyrics are identical up to slight variations. We then consider the song with the earliest publication date as the original song and the other songs of the group as its duplicate.
In case two duplicate songs were performed by two distinct artists, we refer to the song published later as cover song. We removed from the dataset only the duplicate songs but we kept cover versions. We identified 82,531 duplicate songs and 7,524 cover songs. Besides the optimal classification threshold, we also considered the results for classification thresholds at 0.50 and 0.90, where the latter favors precision on the sexist class over recall, and minimises false positives. Indeed, with the 0.90 threshold the precision for the sexist class reaches 0.78 while the macro F1-score is 0.69 and the recall drops to 0.45. At the same time, the performance on the non-sexist class reaches a recall of 0.91 with a precision of 0.71. The corresponding ROC curve is shown in Figure D.6. We also note that the F1-score on the sexist class is stable for other choice of N B (N B = 2: F1-score at 0.69 for an optimal threshold of 0.575; N B = 3: F1-score at 0.65 for an optimal threshold of 0.525). Now we come back to discuss the results of the sexism classifier on the WASABI dataset. Figure D.7 shows the distribution of sexist batches for songs classified as sexist according to the optimal classification threshold and N B = 1. Almost half of the lyrics have more than 2 sexist batches, while 25% and 29% of song lyrics have 1 and 2 sexist batches respectively.
To verify the independence of the conclusions we extracted from the opti- mal classification threshold, we report the fraction of sexist lyrics for all the three thresholds on Table D.5. Here, we can observe for all the thresholds the same trend showing that (i) male solo artists performs more sexist songs compared to the other groups, and (ii) Billboard charts contain a higher fraction of sexist songs than the whole WASABI dataset, regardless of the gender and type of the performer. Table D.6 shows that these considerations hold for N B = 2 and N B = 3 as well.
Appendix E. Word embedding association tests Table D.7 shows the lists of words corresponding to each set of target and attribute words, and we report the references from which each word set was borrowed on the right column. We applied slight modifications to some of those sets to take into account rare or missing words. First, we removed words if they occur less than 5 times in one of the three corpora. Then, whenever pairs of attribute (target) sets contain a different amount of words, we remove the least frequent words from the larger until the two sets have the same size. In doing this, we define the frequency of a word as the minimum of the frequencies among the three corpora under analysis. We  used a different procedure to select proper names. We downloaded the yearly counts of names of newborn babies from 1879 until 2020, and searched for them in song lyrics. The male and female name word sets are thus composed of the most frequent names in the lyrics corpora. Learning word vectors: We used the Gensim implementation of Word2Vec [45] to learn word embeddings with window length 5, embedding dimension 300, and 40 training steps. Words occurring less than 5 times were discarded.
Detailed results on Pleasant vs. Unpleasant words: Results are shown in Table 3. The coupled association measured by the WEAT returns positive and statistically significant scores for all the three corpora and both pairs of target sets, namely Flowers/Insects and Musical instruments/Weapons. The single category associations (SC-WEAT) of the forhttps://www.ssa.gov/oact/babynames/limits.html (accessed 12/01/2022) Table D.6: Percentage of sexist songs identified for each artist type and gender. We report how these percentages change for different values of N B (i.e. the minimum number of 4 line batches classified as sexist to consider the whole song to contain sexist content). The classification threshold = 0.725 and values in bold are used in the main text. Percentages correspond to the fraction of sexist lyrics within the artist type and gender. mer are significant as well, indicating that, if kept separately, Flower words are more associated to Pleasant than Unpleasant words and vice versa for Insect words. For the latter, the only female corpus shows a significant single category association between Musical instrument and Pleasant words. This reflects in the comparison between the male and female corpus through the SWEAT that returns a negative and statistically significant score, i.e., Musical instrument words are closer to Unpleasant than Pleasant words in the male corpus while being closer to Pleasant than Unpleasant words in the female corpus. The fact that the learned word embeddings encode these associations makes us confident of the quality of the learned word vectors. Career corporation, professional, career, office, business [11,14] Family family, marriage, wedding, children, home [11,14] Female attributes girl, hers, her, aunt, daughter, sister, female, mother, she, grandmother, woman [11,14] Male attributes brother, grandfather, his, son, father, man, male, uncle, he, him, boy [11,14] Female names kim, rose, mary, eve, kelly, jane, lisa, juliet, jean, annie, trina, sarah, sally, betty, lucy, taylor, bonnie, marie, jenny, dolly, julia, anna, jill, angie * Male names john, jack, joe, johnny, james, david, paul, billy, jimmy, simon, mark, romeo, bill, peter, bob, lee, jim, bobby, tom, jackson, sam, michael, charlie, adam * Flowers lilac, bluebell, violet, crocus, buttercup, iris, rose, tulip, daisy, marigold, daffodil, orchid, carnation, magnolia, lily, poppy, clover