Support the underground: characteristics of beyond-mainstream music listeners

Music recommender systems have become an integral part of music streaming services such as Spotify and Last.fm to assist users navigating the extensive music collections offered by them. However, while music listeners interested in mainstream music are traditionally served well by music recommender systems, users interested in music beyond the mainstream (i.e., non-popular music) rarely receive relevant recommendations. In this paper, we study the characteristics of beyond-mainstream music and music listeners and analyze to what extent these characteristics impact the quality of music recommendations provided. Therefore, we create a novel dataset consisting of Last.fm listening histories of several thousand beyond-mainstream music listeners, which we enrich with additional metadata describing music tracks and music listeners. Our analysis of this dataset shows four subgroups within the group of beyond-mainstream music listeners that differ not only with respect to their preferred music but also with their demographic characteristics. Furthermore, we evaluate the quality of music recommendations that these subgroups are provided with four different recommendation algorithms where we find significant differences between the groups. Specifically, our results show a positive correlation between a subgroup’s openness towards music listened to by members of other subgroups and recommendation accuracy. We believe that our findings provide valuable insights for developing improved user models and recommendation approaches to better serve beyond-mainstream music listeners.


Introduction
In the digital era, users have access to continually increasing amounts of music via music streaming services such as Spotify and Last.fm. Music recommender systems have become an essential means to help users deal with content and choice overload as they assist users in searching, sorting, and filtering these extensive music collections. Simultaneously, both music listeners and artists benefit from the employed segmentation and personalization approaches that are typically leveraged in music recommendation approaches [1]. As a result, users with different preferences and needs can be targeted in various ways with the Figure 1 Recommendation accuracy measured by the mean absolute error (MAE) of a non-negative matrix factorization-based approach (i.e., NMF [10]) and a neighborhood-based approach (i.e., UserKNN [11]) for mainstream and beyond-mainstream user groups in Last.fm. We see that beyond-mainstream users receive a substantially lower recommendation quality (i.e., higher MAE) compared to mainstream music listeners. Thus, for recommender systems, it is harder to provide high-quality recommendations to beyond-mainstream than to mainstream listeners goal that all users are presented the information and content that they need or prefer. This also means that current recommendation techniques should serve all users equally well, independent of their inclination to popular content.

Present work
In the paper at hand, we focus on music consumers who listen to music beyond the mainstream (i.e., users who listen to non-popular music) in the music streaming platform Last.fm. 1 As highlighted in Fig. 1, current recommender systems do not work well for consumers of beyond-mainstream music (see Sect. 3.5 for details on this analysis). In contrast, music consumers who listen to popular music seem to get better recommendations. This finding is not essentially new. In fact, it is a widely-known problem that recommender systems (and those based on collaborative filtering, in particular) are prone to popularity bias, which leads to the behavior that long-tail items (i.e., items with few user interactions) have little chance being recommended. This phenomenon is also present across different application domains such as movies [2] or music [3].
Our previous work [4] has shown that users interested in beyond-mainstream music tend to have larger user profile sizes (i.e., individual users show a high(er) number of distinct artists they have listened to) compared to users interested in mainstream music. The observation that beyond-mainstream music listeners produce a substantial amount of digital footprints motivates the need to improve the recommendation quality for this group. However, although related research has already studied the long-tail recommendation problem (e.g., [5][6][7][8]; see Sect. 2 for a more detailed discussion of related work), it is still a fundamental challenge to understand and identify the characteristics of beyondmainstream music and beyond-mainstream music listeners. Additionally, related work [9] has shown that the group-specific concepts of openness and diversity influence recommendation quality, where openness is defined as across-group diversity (i.e., do users of one group listen to the music of other groups?) and diversity is defined as within-group variability (i.e., how dissimilar is the music listened to by users within groups?). Thus, we are also interested in the correlation between the characteristics of beyond-mainstream music and music listeners with openness and diversity patterns as well as with recommendation quality. Concretely, our work is guided by the following research question: RQ: What are the characteristics of beyond-mainstream music tracks and music listeners, and how do these characteristics correlate with openness and diversity patterns as well as with recommendation quality?
To address this research question, we create, provide, and analyze a novel dataset called LFM-BeyMS, which contains complete listening histories of more than 2000 beyondmainstream music listeners mined from the Last.fm music streaming platform. Besides, our dataset is enriched with acoustic features and genres of music tracks. Using this enriched dataset, we identify different types of beyond-mainstream music via unsupervised clustering applied to the acoustic features of music tracks. We then characterize the resulting music clusters using music genres. Then, we assign beyond-mainstream users to the clusters to further divide the beyond-mainstream users into subgroups. We study how the characteristics of these beyond-mainstream subgroups correlate with openness and diversity patterns as well as with recommendation quality measured through prediction accuracy.

Findings and contributions
We identify four clusters of beyond-mainstream music in our dataset: (i) C folk , music with high acousticness such as "folk", (ii) C hard , high energy music such as "hardrock", (iii) C ambi , music with high acousticness and high instrumentalness such as "ambient", and (iv) C elec , music with high energy and high instrumentalness such as "electronica". By assigning users to these clusters, we get four distinct subgroups of beyond-mainstream music listeners: (i) U folk , (ii) U hard , (iii) U ambi , and (iv) U elec . We also find that these groups differ considerably with respect to the accuracy of recommendations they receive, where group U ambi gets significantly better recommendations than U hard . When relating our results to openness and diversity patterns of the subgroups, we find that U ambi is the most open but least diverse group, while U hard is the least open but most diverse group. This is in line with related research [9], which has shown that openness is stronger correlated with accurate recommendations than diversity. This means that users are more likely to accept recommendations from different groups (i.e., openness) rather than varied within a group (i.e., diversity).
Summed up, our contributions are: • We identify more than 2000 beyond-mainstream music listeners on the Last.fm platform and enrich their listening profiles with acoustic features and genres of music tracks listened to (Sects. 3.1-3.4). • We validate related research by showing that beyond-mainstream music listeners receive a significantly lower recommendation accuracy than mainstream music listeners (Sect. 3.5). • We identify four clusters of beyond-mainstream music using unsupervised clustering and characterize them with respect to acoustic features and music genres (Sect. 4.1). • We define four subgroups of beyond-mainstream music listeners by assigning users to the music clusters and discuss the relationship between openness, diversity, and recommendation quality of these groups (Sect. 4.2). • To foster reproducibility of our research, we make available our novel LFM-BeyMS dataset via Zenodo 2 and the entire Python-based implementation of our analyses via Github. 3 We believe that our findings provide useful insights for creating user models and recommendation algorithms that better serve beyond-mainstream music listeners.

Related work
We identify three strands of research that are relevant to our work: (i) modeling of music preferences, (ii) long-tail recommendations, and (iii) popularity bias in music recommender systems.
Modeling of music preferences A multitude of factors [12] influences musical tastes and musical preferences of users. Characteristics of music listeners and music preferences have been studied in various research domains [13], ranging from music sociology [14] and psychology [15] to music information retrieval and music recommender systems [1]. Studies on music listening behavior showed that personal traits and long-term music preferences are correlated as people tend to prefer music styles that align with their personalities [16,17]. Furthermore, related work found a relationship between music and motivation [18], music and emotion [19][20][21][22] or both personality and emotion [23]. Openness, a personality trait from the Five Factor Model [24], has also been shown to positively influence a user's preference for music recommendations [9]. Specifically, the authors of [9] found that people tend to prefer recommendations from different kinds of music (i.e., openness) rather than varied within a specific kind of music (i.e., diversity). Others showed that familiarity has a positive influence on music preferences [25,26] and that music preferences may change over time [27]. Another strand of research on modeling users' music preferences leverages content features, e.g., acoustic features. It has been shown that the distribution of acoustic features of a user's preferred genre substantially influences the user's choice of music within other genres [28]. Also, acoustic features have been utilized to model users' preferences under different contextual conditions, in order to refine recommendation quality [29]. Based on tracks' acoustic features, the authors of [30] identified several types of music, and subsequently modeled each user by linearly combining the acoustic features of the music types. In contrast to these works, we focus on using acoustic features of music tracks for modeling and clustering beyond-mainstream music. Additionally, we link these beyond-mainstream music clusters to music genres and users in our Last.fm data sample.
Long-tail recommendations Related research [6,7] has found that individual music consumption is biased towards popular music and that usage data for less popular music is scarce. Due to the scarcity problem, items with no or few ratings (i.e., long-tail items) have little chance of being recommended [5]. As a consequence, users that particularly favor items with few ratings or interactions are less likely to be recommended those items that they like [3]. That is problematic because many users, from time to time, prefer niche music [8]. Therefore, such users are not well served as a result of their preference for less popular items. That has been attributed to popularity bias, which corresponds to over-representation of popular items in the recommendation lists [31][32][33]. Abdollahpouri et al. [2] studied popularity bias in a dataset of movies (i.e., the MovieLens 1M dataset [34]) from the user perspective. Their study showed that commonly used recommendation techniques tend to deliver worse recommendations to users who prefer less popular movies. In our work [4], we found evidence for popularity bias in a Last.fm dataset and showed that traditional personalized recommendation algorithms such as collaborative filtering deliver worse recommendations for consumers of niche music. In the present work, we aim to gain a deeper understanding of the behavior and preferences of this beyond-mainstreaminess user group. Thus, in contrast to existing works in long-tail recommendations, we focus on the user rather than the item perspective.
Popularity bias in music recommender systems Music recommender systems [1] are crucial tools in online streaming services such as Last.fm, Pandora, or Spotify. They help users find music that is tailored to their preferences. The basis of music recommender systems are user models derived from users' listening behavior, user properties such as personality (e.g., [35]), content features of music, or hybrid combinations of both, e.g., [36][37][38][39]. As discussed earlier, due to insufficient amounts of usage data for less popular items, many music recommendation algorithms do not provide useful recommendations for consumers of less popular and niche items. As a remedy, in [40], an approach is suggested that divides music consumers into experts and novices according to their long tail distribution in their playlists. These experts are then converted to nodes with bidirectional links connecting all the experts. These links are created to perform link analysis on the graph and to assign fine-grained weights to songs. The presented approach helps add music from the long-tail into the recommendation list. In our previous research [41,42], we have used a framework [43] that employs insights from human memory theory to design a music recommendation algorithm that provides more accurate recommendations than collaborative filtering-based approaches for three groups of users, i.e., low-mainstream, mediummainstream and high-mainstream users. While the awareness of popularity bias in music recommender systems increases (e.g., [44]), the characteristics of music consumers whose preferences lie beyond popular, mainstream music are still not well understood. In the present work, we shed light on the characteristics of such beyond-mainstream music consumers and relate them to openness and diversity patterns as well as recommendation quality. With this, we aim to provide useful insights for creating novel music recommendation models that mitigate popularity bias.

Preliminaries
We investigate the characteristics of beyond-mainstream music listeners in a dataset mined from Last.fm, a popular music streaming platform. We characterize the tracks in our dataset with acoustic features. Besides, we compare the recommendation accuracy of beyond-mainstream music listeners with the one of mainstream music listeners to motivate our subsequent analysis of the characteristics of beyond-mainstream music listeners.

Acoustic music features
For our analyses, we characterize music tracks using acoustic features that describe the content of a given track. Following the lines of, e.g., [30,[45][46][47], we rely on acoustic features provided by the Spotify API as a compact characterization of tracks. 4 The following eight features are extracted from the audio signal of a track: Danceability captures how suitable a track is for dancing and is computed based "on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity". Energy describes the perceived intensity and activity of a track and is based on the dynamic range, perceived loudness, timbre, onset rate, and general entropy of a track.
Speechiness captures the presence of spoken words in a track. High speechiness values indicate a high degree of spoken words (e.g., an audiobook), whereas medium values indicate tracks with both music and speech (e.g., rap music). Low values represent typical music tracks. Acousticness measures the probability that the given track only contains acoustic instruments. Instrumentalness quantifies the probability that a track contains no vocals, i.e., the track is instrumental. Tempo measures the rate of the track's beat in beats per minute. Valence describes the "emotional positiveness" conveyed by a track (i.e., cheerful and euphoric tracks reach high valence values). Liveness measures the probability that a track was performed live, i.e., whether an audience is present in the recording.

Enriched dataset of music listening events
To study characteristics of beyond-mainstream users and their listening preferences, we create a novel dataset called LFM-BeyMS that contains the required information for such analyses. We base our work on a dataset gathered from the Last.fm music platform, which we considerably enrich with the music tracks' acoustic features (see Sect. 3.1) [48]. Additionally, we combine this data with mainstreaminess information of Last.fm users (see Sect. 3.3) as well as music genre information to identify beyond-mainstream listeners and music (see Sect. 3.4). An overview of our new LFM-BeyMS dataset and its data sources is depicted in Fig. 2. As shown, the starting point for our new dataset is the publicly available LFM-1b dataset 5 of music listening information shared by users of the online music platform Last.fm [49]. LFM-1b contains listening histories of 120,322 users; their listening records (or "listening events") have been created between January 2005 and August 2014. They sum up to over 1.1 billion listening events (LEs), where each LE is described by an (anonymous) user identifier, the artist name, the album name, the track name, and the timestamp of the listening event. Also, the LFM-1b dataset includes demographics of some users (i.e., country, age, and gender).
To enrich the LFM-1b dataset to suit our task, we utilize our previously created CultMRS music recommendation dataset [50]. This dataset contains 55,191 users, who have listened to a total of 26,022,625 distinct tracks, captured by a total of 807,890,921 LEs [48].
To further enrich the dataset with music acoustic features, we gather the acoustic features described in Sect. 3.1 for the tracks remaining in the dataset after the filtering described above. To this end, we rely on the Spotify API to gather content-based acoustic features for each track. Particularly, we search tracks using the track, artist, album triples extracted from the LFM-1b dataset using the Spotify search API 6 to gather the Spotify track URI of each track by using all three parts of the triple in a conjunctive query. In total, this allowed gathering 4,326,809 Spotify URIs. For the remainder of the tracks, we were not able to retrieve a URI. We attribute this to two factors: either the searched track is not provided by Spotify or the track, artist, and album information cannot be matched to LFM-BeyMS contains BeyMS, i.e., data to study the beyond-mainstream user group, and Recommendation, i.e., data to conduct recommendation experiments of beyond-mainstream and mainstream music listeners a Spotify track unambiguously. Subsequently, we use the obtained track URI to query the acoustic features API, 7 which returns the acoustic features of a given track (cf. Sect. 3.1). In a subsequent cleaning step, we remove all tracks for which the Spotify API did not provide the full set of acoustic features.
That procedure provides us with a set of 3,478,399 unique tracks and their acoustic features. Within the LFM-1b dataset, this amounts to 13.36% of the distinct tracks. Overall, these account for as much as 48.89% of all listening events (i.e., the tracks listened to by users) of the LFM-1b dataset. The resulting dataset, now enriched by acoustic music descriptors, comprises a total of approximately 394 million listening events of 55,149 users. In Table 1 (column "CultMRS"), we provide further descriptive statistics of the CultMRS dataset. We refine this dataset to create our new LFM-BeyMS dataset (column "BeyMS in Table 1), which consists of BeyMS, i.e., data to study the characteristics of beyondmainstream music listeners, and Recommendation, i.e., data to conduct recommendation experiments of beyond-mainstream and mainstream music listeners.

Identifying beyond-mainstream music listeners
To identify beyond-mainstream music listeners, for each user, we compute a mainstreaminess score, which is generally defined as the overlap between a user's individual listening history and the aggregated listening history of all Last.fm users in the dataset. In this vein, the mainstreaminess score reflects a user's inclination to music listened to by the Last.fm mainstream listeners (i.e., the "average" Last.fm listener in the dataset). In [51], several measures of user mainstreaminess are defined. Out of these, we choose the M-global-R-APC definition since it yielded good results in context-based music recommendation experiments for the LFM-1b dataset, as evidenced in [51]. The M-global-R-APC measure approximates a user's mainstreaminess score by computing Kendall's τ [52] rank correlation between the user's vector of artist play counts and the global vector of artist play counts (aggregated over all users in the dataset). This definition also explains the name of the measure, where "M" refers to mainstreaminess, "global" indicates the global perspective, "R" stands for rank correlation, and "APC" refers to artist play counts. Next, we describe how we identify our beyond-mainstream users via filtering the users by the number of listening events (see Fig. 3 and Sect. 3.3.1) and by mainstreaminess scores (see Fig. 4 and Sect. 3.3.2).

Filtering users by the number of listening events
For our study, we select the users so that listeners of different levels of "listening activity" are equally represented. We conduct a Gaussian kernel density estimation (KDE) [53] on the distribution of listening events over users to estimate the continuous probability density function (PDF) [54]. However, KDE estimates the PDF via discrete bins and hence, it is necessary to approximate the gradient via the principle of finite differences. The gradient of the PDF helps us identifying regions of increasing or decreasing probability. Figure 3 shows that two large subsets of users exist that exhibit either very few or an abundance of listening events. For our analysis, we consider only users who are not in one of the subsets as mentioned earlier. On the one hand, we exclude users with too little data available for studying their listening behavior; and on the other hand, we exclude socalled power listeners who might bias our analyses. Furthermore, such users with a very high number of listening events are often radio stations, which do not contribute reliable data to our investigations.
Hence, we define lower and upper bounds regarding the number of users' listening events to include in our study, such that the rate of change in terms of the number of listening events is minimal and stable within these boundaries. That requires the gradient of the region within the lower and upper bound to be near zero (i.e., ±10 -6 ). By computing the second-order accurate central differences [55], we obtain an approximation of the   Figure 4 illustrates the mainstreaminess distribution of the 12,814 users that we have extracted based on the number of listening events. Here, mainstreaminess is defined accord-ing to the M-global-R-APC definition taken from [51] (explained in Sect. 3.3). By setting an appropriate upper bound, we aim to exclude mainstream music listeners. In other words, we aim to set the upper bound to the beginning of the distribution's bulk, which is motivated as follows: Firstly, the first inflection point (i.e., maximal gradient) of a Gaussian distribution is found at E[X]-std(X), where E[X] is the expectation, and std(X) is the standard deviation of the Gaussian random variable X. Secondly, the first inflection point of a Gaussian distribution is equivalent to the 15.9-percentile. By setting the mainstreaminess threshold to this point, we intend to omit the majority of users and hence, only consider the 15.9% of users with the lowest mainstreaminess scores. Utilizing this upper bound on the mainstreaminess score, we obtain a set of 2074 beyond-mainstream users. Furthermore, the Gaussian assumption can be strengthened by the observation that the 2074 beyondmainstream users represent 16.19% of users. In the remainder of this paper, we refer to this set of beyond-mainstream music listeners as BeyMS.

Identifying beyond-mainstream music
We aim to study beyond-mainstream listeners in terms of their music taste. We characterize music via its acoustic features, as described in Sect. 3.1, and also investigate genres as an alternative way to describe a music track via conventional categories. As the LFM-1b dataset does not contain genre annotations of tracks and the Spotify API only provides genres on artist level, 8 we leverage the tags assigned to each track by Last.fm users to identify genre annotations. To obtain these tags, we use the respective Last.fm API endpoint. 9 After having fetched the tags for each track, we de-capitalize them and remove all non-alpha-numeric characters. Since not all tags used by Last.fm users correspond to actual music genres (e.g., the "seenlive" tag is used to indicate that a user has seen an artist performing this track live), we use a fine-grained music genre taxonomy consisting of 3034 genres that are also utilized by Spotify, which we gather from the EveryNoise service (2019-07-24). 10 Specifically, for each track listened to by any of our BeyMS users, we remove all tags that are not part of the EveryNoise genre taxonomy, using a case-insensitive matching approach.
We note that Last.fm users tend to assign very general genre tags to a large number of tracks, such as "pop" or "rock". To remove these coarse-grained genres and to identify finegrained beyond-mainstream music genres, we calculate the inverse document frequency (IDF) [56] metric of our genre-track distribution by treating genres as terms and tracks as documents, i.e., IDF(g) = log 10 |T| |{t∈T with g∈G t }| . More precisely, the IDF-score of genre g is determined by relating the number of all tracks |T| to the number of tracks annotated with genre g where |G t | is the set of genres assigned to track t. This way, a coarse-grained genre receives a small IDF-score, while a fine-grained genre receives a high IDF-score. Figure 5 shows the IDF-score distribution of the top-100 genres in ascending order (i.e., from coarse-grained to fine-grained). Here, we identify two groups of genres, where the first group consists of 6 genres with small IDF-scores, and the second group consists of 94 genres with high IDF-scores. The visual inspection of Fig. 5 indicates that the lower bound of 0.90 serves as a discriminant between these two groups of coarse-grained and 8 https://developer.spotify.com/documentation/web-api/reference-beta/#endpoint-get-an-artist 9 https://www.last.fm/api/show/track.getTopTags 10 http://everynoise.com/ Figure 5 IDF-score distribution of the top-100 genres in ascending order (i.e., from coarse-grained to fine-grained). The 6 coarse-grained genres below the lower bound of 0.90 are removed from the genre assignments, i.e., "rock", "pop", "electronic", "metal", "alternativerock", "indierock" fine-grained genres. Consequently, we remove the 6 coarse-grained genres (i.e., "rock", "pop", "electronic", "metal", "alternativerock", "indierock") from the genre assignments of our tracks, which leads to 157,444 out of 799,659 tracks listened to by BeyMS users with at least one remaining genre. In total, these tracks are annotated with 1418 unique genre identifiers.
We are aware of the fact to our track filtering procedure leads to incomplete listening profiles of users. Since we rely on genres to describe beyond-mainstream music, these filtering steps are necessary for our study. To ensure that the BeyMS users' reduced listening profiles are still representative of their music preferences, we further investigate the consequences of the filtering procedure. Here, we find that a user's listening history (i.e., the entirety of a user's listening events) is reduced to 61% on average. However, we also find that there are only 62 of the 2074 BeyMS users, for whom the listening history is reduced to less than 20%. For these users most affected by the filtering, we compare the acoustic feature distributions of their listened tracks before and after the filtering steps, and find that filtering only marginally affects the acoustic feature distributions (i.e., average change in mean = 0.0098 ± 0.0148). This means that the acoustic feature distribution contained in the user's profile is highly robust against the filtering. The statistics of BeyMS are summarized in column "BeyMS" in Table 1.

Recommendations for beyond-mainstream music listeners
In order to compare the recommendation accuracy of recommendations received by the users of our BeyMS group and by mainstream users, we construct a dataset consisting of BeyMS's listening events and the listening events of an equally-sized group of mainstream users. Therefore, we define the MS user group as 2074 (i.e., the size of our BeyMS group) randomly-chosen users with a mainstreaminess score that is higher than the upper bound for low mainstreaminess, identified in Fig. 4. Furthermore, the MS users are also in between the lower and upper bounds for listening events identified in Fig. 3. As shown in We use the Python-based open-source recommendation library Surprise 11 to compute and evaluate recommendations. One advantage of using Surprise is that it provides builtin recommendation algorithms as well as a standardized evaluation pipeline, which enhances the reproducibility of our research. Since Surprise is focused on rating prediction, we formulate our music recommendation scenario also as a rating prediction problem, in which we predict the preference of a target user u for a target track t. As done in [57], we model the preference of t for u by scaling the play count (i.e., number of listening events) of t by u to a range of [1; 1000] using min-max normalization. We perform this normalization on the individual user level to ensure that all users share the same preference value ranges. Thus, with this method, we ensure that each user's most listened track has a preference value of 1000, while their least listened track has a preference value of 1. To ensure that this min-max normalization procedure does not disrupt the play count distribution of our users, we compare the original play count distribution with the normalized distribution and find that both distributions are strongly right-skewed. Specifically, we find very similar distributions for large amounts of our play count data.
We utilize a selection of Suprise's built-in recommendation methods consisting of one baseline approach (i.e., UserItemAvg), two neighborhood-based approaches (i.e., UserKNN and UserKNNAvg), and one matrix factorization-based approach (i.e., NMF). Specifically, UserItemAvg predicts the average play count in the dataset by also accounting for deviations of u and t, for example, if a user u tends to have more listening events than the average Last.fm user [58]. UserKNN [11] is a user-based collaborative filtering approach and is calculated using k = 40 nearest neighbors and the cosine similarity metric, which are the default settings of Surprise. UserKNNAvg is an extension of UserKNN [11] that also takes the average rating of target user u into account. Finally, NMF, i.e., nonnegative matrix factorization [10], is calculated using 15 latent factors, which is the default parameter in the Surprise library. As shown in our previous work [4], NMF is also capable of recommending non-popular items from the long tail and should therefore especially be of interest for our beyond-mainstream recommendation setting.
We use Surprise's default parameters and refrain from performing any hyperparameter tuning since we are only interested in assessing (relative) performance differences between the two user groups BeyMS and MS, and not in outperforming any state-of-the-art algorithm. This is also the reason why we focus on traditional algorithms instead of investigating the most recent deep learning architectures, which would also require a much higher computational effort.
The resulting mean absolute error (MAE) results can be observed in Table 2 (and correspond to the ones already shown in Fig. 1). We favor MAE over the commonly used root mean squared error (RMSE) due to several pitfalls, especially regarding the comparison of groups with different numbers of observations [59]. Here, we perform 5-fold cross-validation leading to 5 different 80/20 train-test splits and average the MAE over the 5 folds. NMF clearly outperforms UserItemAvg as well as the two neighborhoodbased methods (i.e., UserKNN and UserKNNAvg) both for the two user groups (see Table 2 Mean absolute error (MAE) results for the two user groups MS and BeyMS of different mainstreaminess and a selection of standard recommendation algorithms. A one-tailed Mann-Whitney-U test (α = 0.0001) provides significant evidence, indicated by * * * , that all algorithms perform worse on BeyMS than on MS in terms of MAE. Furthermore, NMF (as shown in bold) outperforms the other three approaches UserItemAvg, UserKNN  rows "BeyMS" and "MS") separately and overall without distinguishing between the user groups (see row "Overall"). Additionally, we conduct a one-tailed Mann-Whitney-U test (α = 0.0001), where we define the null-hypothesis as the MAE for MS being larger than or equal to the MAE for BeyMS. Results marked with * * * indicate that the null-hypothesis was rejected for every fold. Thus, all algorithms (including NMF) provide a significantly larger error for BeyMS than for MS. In other words, recommendation quality is significantly better for users with mainstream taste than for users who prefer beyond-mainstream music across all recommendation approaches. These initial results underpin the need to study the characteristics of the BeyMS user group that receives worse recommendations. The corresponding experiments are presented in the next section.

Characteristics of beyond-mainstream music and listeners
We identify the types of beyond-mainstream music using unsupervised clustering and characterize these types with respect to acoustic features and music genres. Besides, we detect subgroups of beyond-mainstream music listeners by assigning users to these clusters and evaluate the recommendation quality obtained for these subgroups. Finally, we discuss the recommendation quality with respect to openness and diversity. For this, we relate to the definitions given by [9]: Openness is the across-groups diversity (or categorical diversity) and describes if users of one group also listen to the music of other groups. Diversity is the within-groups diversity (or thematic diversity) and describes the dissimilarity of music listened to by users within groups. Based on the findings of [9], we would expect that subgroups with high openness should receive more accurate recommendations than subgroups with high diversity.

Clustering and characterizing beyond-mainstream music
To study the different types of music listened to by the users in our BeyMS group, we conduct a cluster analysis. Specifically, we cluster the 157,444 tracks listened to by BeyMS users, where each track is described by the eight acoustic features danceability, energy, speechiness, acousticness, instrumentalness, tempo, valence, and liveness (see Sect. 3.1). We scale the value ranges of these features to [0, 1] using min-max normalization. The use of latent representations of musical elements such as tracks was shown to be efficient in the area of music information retrieval [30,60,61]. Furthermore, for visually analyzing the obtained music clusters and decreasing computation time, we favor a reduction of dimensionality to two dimensions.
We conduct experiments with a broad body of dimensionality reduction methods, i.e., linear and nonlinear principal component analysis (PCA) [62], locally linear embedding [63], multidimensional scaling [64], Isomap [65], spectral embedding [66], t-distributed stochastic neighbor embedding (t-SNE) [67] and uniform manifold approximation and projection (UMAP) [68]. We visually inspected the 2-dimensional feature spaces created by these methods with regards to the clustering quality, and we obtained the visually most homogeneous results with UMAP. Moreover, UMAP has already been successfully used in the music domain [30] and thus, we use it for the remainder of our experiments. Specifically, we utilize the open-source implementation of UMAP [69], which requires four parameters: (i) the distance metric M in the input space, (ii) the number of latent dimensions D, (iii) the minimum distance of points in the latent space d min , and (iv) the number of neighbors of a point N . Based on experimentation and related literature (e.g., [69]), we set the distance metric M to the Euclidean distance, the number of latent dimensions D to 2, the distance d min to 0.1 and the number of neighbors N to 15.
In a next step, we perform clustering on the dimensionality-reduced acoustic features of tracks. Again, we conduct experiments with various clustering methods, i.e., DBSCAN [70], K -Means [71], Gaussian mixture models [72], affinity propagation [73], spectral clustering [74], hierarchical agglomerative clustering [75], OPTICS [76] and HDBSCAN * [77]. Here, we obtain the best results with respect to cluster cohesion and separation using HDBSCAN * . Furthermore, HDBSCAN * was also already used by related work to cluster music items [78]. We employ the open-source implementation of HDBSCAN * [79] that requires four parameters: (i) the minimum cluster size s min that defines the minimum size of a group of points to consider a cluster, (ii) the minimum number of samples in the neighborhood of a core point N min , which quantifies how conservative the clustering is, (iii) ε, which enables the recovery of DBSCAN clusters if the s min value is not reached, and (iv) the scaling of the distance α, which is another measure of the clustering's conservativeness. In detail, α scales the distance between two points, which determines whether these points are merged into a cluster. This scaling is used in the construction of HDBSCAN * 's hierarchy of clusterings. Again, we find the best-suited parameters based on experimentation and related literature (e.g., [77]). Specifically, we require each cluster to comprise a sufficiently large number of tracks to increase the level of significance of our subsequent experiments. We expect the existence of very small music clusters and thus, search for the optimal value of the minimal cluster size s min in the search space of {1000; 1025; . . . ; 1475; 1500}, where we obtain the best results with respect to the withincluster variance for s min = 1375. Furthermore, tightly packed clusters without any contribution of noise should be favored. In other words, all points within a cluster should be within the neighborhood of at least one core point. Thus, we set the minimal number of samples in the neighborhood N min = s min = 1375. The remaining two parameters are set to their default values, i.e., ε = 0 and α = 1. Figure 6 shows the results of the clustering process using HDBSCAN * and UMAP for the 2-dimensional mapping. This process leads to four music clusters. Here, the green cluster (hatch: +) is the largest one with 92,798 tracks, followed by the pink cluster (hatch: x) with 30,379 tracks and the blue cluster (hatch: /) with 12,148 tracks. The smallest cluster is the orange one (hatch: o) as it contains 7629 tracks. The remaining 14,490 of our 157,444 BeyMS tracks have not been assigned to a cluster and thus, will not be included in further analyses and interpretations. Next, we describe how we name these clusters based on their music genre distributions.

Genre distributions
In Fig. 7, we illustrate the top-10 genres of the four music clusters. For this, we refer to the genre IDF-scores presented in Sect. 3.4 and weight each genre assigned to a track in a cluster with its corresponding IDF-score. For example, if a genre with an IDF-score of 1.4 is assigned to 1000 tracks in a cluster, it is visualized as an aggregated genre IDF-score of 1400 in the corresponding plot of Fig. 7. Based on the genre distributions, we label each cluster according to its top genre.
With respect to the blue cluster (hatch: /) in Plot (a), we find top genres such as "folk" and "singersongwriter", which typically reflect music with high acousticness. In the remainder of this paper, therefore, we refer to this cluster as C folk . The top genres of the green cluster (hatch: +) in Plot (b) are typical high energy music genres such as "hardrock", "punk", "poprock", and "hiphop". Based on this, we name this cluster C hard .
For the orange cluster (hatch: o) in Plot (c), we find genres that reflect music with high acousticness and high instrumentalness such as "ambient", "experimental", "newage", and "postrock". As "ambient" clearly dominates the genre distribution for this cluster, we name this cluster C ambi . Similarly to C folk , this cluster contains music with high acousticness; yet, while C folk is characterized by low instrumentalness music, C ambi is characterized by a high level of instrumentalness. Finally, Plot (d) shows the genre distribution of the pink cluster (hatch: x) with "electronica" as the top genre, which leads to the name C elec for this cluster.
Thus, both, C elec and C hard , consist of high energy music but in contrast to C hard , C elec also comprise high instrumentalness values. This also makes sense when looking at other top genres of C elec such as "deathmetal" and "blackmetal" where guttural vocal techniques are often mistakenly classified as another type of instrument [80].
To compare the genre distributions among the four music clusters, we illustrate the relative genre frequency distribution of the clusters in Fig. 8. The relative frequency of a genre g depicts the fraction of listening events of tracks within a cluster c that are annotated with g. Here, we only show genres with a minimum relative genre frequency of 0.1. We Figure 7 Top-10 genres of the four music clusters C 1 -C 4 according to the aggregated genre IDF-scores. We name the clusters according to the top genre, i.e., (a) blue (hatch: /) → C folk ("folk"), (b) green (hatch: +) → C hard ("hardrock"), (c) orange (hatch: o) → C ambi ("ambient"), and (d) pink (hatch: x) → C elec ("electronica")

Figure 8
Relative genre frequency distribution of the four music clusters. While there are dominating genres in C folk and C ambi , the genre distribution is more diverse in C hard and C elec see that there are clearly dominating genres in C folk and C ambi , whereas the genre distributions in C hard and C elec are more evenly distributed. When relating this finding to the findings of Fig. 7, we clearly see that the results correspond to each other: C hard and C elec contain a more diverse genre spectrum (e.g., "hardrock" and "hiphop" are both part of C hard 's top genres) than C folk and C ambi (e.g., in C ambi 's top genres, we find "ambient" and "darkambient").

Acoustic feature distributions
To understand the musical content of these four music clusters, we analyze the acoustic feature distributions of the four music clusters using boxplots in Fig. 9. This visualization does not show any obvious differences with respect to danceability and tempo among the four clusters. For the acoustic features energy, speechiness, acousticness, valence, and liveness, there are similar values for the cluster pairs C folk and C ambi , and C hard and C elec . We observe differences between these two cluster pairs with respect to energy and acousticness. While C hard and C elec provide high energy values and small acousticness values, C folk and C ambi feature small energy values and high acousticness values.
In contrast, for instrumentalness, we see similar values for the cluster pairs C folk and C hard as well as for C ambi and C elec . We observe very high values for C ambi and C elec , and very small values for C folk and C hard . This difference is also visible in Fig. 6 in the form of the gap between C folk and C hard on the left, and C ambi and C elec on the right.
Summing up, in C folk , we find music with low energy, high acousticness, and low instrumentalness; C hard contains music with high energy, low acousticness, and low instrumentalness; in C ambi , we observe music with low energy, high acousticness, and high instru- Figure 9 Distribution of the eight acoustic features for the four music clusters. While the clusters do not show obvious differences with respect to danceability and tempo, we find large differences with respect to energy, acousticness and instrumentalness mentalness; and in C elec , we find high energy, low acousticness, and high instrumentalness. Thus, these findings are in line with the genre distributions presented in Fig. 7.

Assigning and studying beyond-mainstream music listeners
In the next step, we assign the 2074 BeyMS users to the four music clusters to categorize them into four distinct beyond-mainstream subgroups for further analyses.
For each user u, we count the number of listening events LE u,c that u has contributed to the tracks in each cluster c, where c ∈ C = {C folk , C hard , C ambi , C elec }. Then, we assign u to the cluster c for which the number of contributed listening events LE u,c is the highest. However, because we have varying cluster sizes, the probability of u listening to a track t of the two larger clusters C hard and C elec is much higher than for the two smaller clusters C folk and C ambi , although C folk and C ambi could be more representative choices for u. Thus, similar to the IDF distribution of genres (see Fig. 5), we take advantage of the IDF scoring to reduce the influence of the larger clusters and to assign higher weights to the smaller clusters. Specifically, these cluster IDF-scores are given by IDF(c) = log 10 |T| |{t∈T with c t }| , i.e., by relating the number of all tracks |T| to the number of tracks in cluster c where c t is the music cluster assigned to track t. That lets us define the user-cluster weight w u,c for user u and cluster c as w u,c = IDF(c) · LE u,c .
Consequently, users are assigned to the highest weighted music cluster and thus, a subgroup U c for cluster c is given by U c = {u ∈ U : arg max c∈C (w u,c )}.
Out of the 2074 BeyMS users, we can assign 2073 users to these subgroups. Thus, only 1 user listened to tracks not contained in any cluster in Fig. 6. Similar to the naming scheme of music clusters, we label the subgroups according to the name of their assigned music cluster. Hence, we obtain four subgroups U folk , U hard , U ambi , and U elec . Table 3 provides basic descriptive statistics of these four resulting subgroups. Here, U hard is the largest subgroup with |U| = 919 users, followed by U elec with |U| = 642 users, U folk with |U| = 369 users, and U ambi with |U| = 143 users. The differences with respect to the number of users also correspond to the differences regarding the number of artists |A|, the number of tracks |T|, and the number of listening events |LE| contained in the clusters. In the case of the number of genres |G|, this differs slightly because the users in the smaller U ambi cluster listen to more genres (i.e., 918) than the bigger U folk cluster (i.e., 811). This indicates that the users in U ambi listen to a broader set of music than the users in U folk .
Considering the average number of listening events per user (i.e., |LE u |) and the average number of tracks per user (i.e., |T u |), we see that, while there is little difference between U hard and U elec with respect to |LE u |, |T u | is much higher for U elec (i.e., 670.402) than for U hard (i.e., 557.470). This indicates that, although the number of listening events is nearly the same, users of U elec tend to listen to a wider set of tracks than users of U hard . With Table 3 Descriptive statistics of the four subgroups. Here, |U| is the number of users, |A| is the number of artists, |T| is the number of tracks, |LE| is the number of listening events, |G| is the number of genres, |LE u | is the average number of listening events per user, |T u | is the average number of tracks per user and Age is the average age (along with the standard deviation) of users in the group

Figure 10
Radar plot illustrating the contribution of each music cluster to a subgroup. While the weight distribution of U hard and U elec is rather narrow, it is more broad in case of U folk and U ambi suggesting that these groups are more open to music outside the own music cluster respect to the average age of the users Age, we see that the users of U folk and U ambi are the oldest ones, and users of U hard and U elec are the youngest ones. However, it is worth noting that the group with the highest average age (i.e., U ambi ) also shows by far the highest standard deviation of age (i.e., 14.138 years). In Fig. 10, we show the contribution of each music cluster to each subgroup in the form of a radar plot. For this, we use the user-cluster weights w u,c introduced before and calculate the average weight over all users in cluster c. One consequence of the IDF scoring applied to w u,c is that the weight contributions of a user group to the four clusters does not sum up to 1, which eventually influences the interpretation of the values shown in Fig. 10. However, in return, these values account for the varying cluster sizes and can also be interpreted as preference weights for a user group towards a specific music cluster.
We observe that the weight distribution of the two larger subgroups U hard and U elec is rather narrow, which indicates that these users do not listen to many tracks of other clusters. Contrary to that, the weights of the two smaller subgroups U folk and U ambi are more broadly distributed over the four music clusters. This suggests that users of U folk and U ambi are more open to music outside of their own music cluster than users of U hard and U elec .

Correlation of music clusters and beyond-mainstream subgroups
To better understand the correlations and connections between the music clusters and subgroups, we plot the Pearson correlation matrix of the four music clusters as a heatmap in Fig. 11. Here, we represent each music cluster c by a 2073-dimensional vector (i.e., one entry for each user) consisting of the user-cluster weights w u,c , introduced before. Each element in the matrix is then calculated using the Pearson correlation measure based on these cluster vectors. For example, if there is a positive correlation between two clusters, we assume that a user who enjoys music from the one cluster likely also enjoys music from the other cluster. This can give us also an indication of the openness of a subgroup Figure 11 Pearson correlation matrix of the four music clusters. While C hard has solely negative correlations with all other clusters, and thus, listeners of C hard seem to be the most closed subgroup, C ambi has positive correlations with C folk and C elec , and thus, listeners of C ambi seem to be the most open subgroup Figure 12 Boxplots showing the average pairwise user similarity of the four subgroups using the cosine similarity calculated on the users' genre distributions. While the users in U hard and U elec exhibit a more diverse listening behavior, users in U folk and U ambi tend to listen to more similar, i.e., less diverse, music genres for music mainly listened to by other subgroups. Specifically, for C folk , we see a positive correlation between C folk and C ambi , and a negative correlation between C folk and both, C hard as well as C elec . Users listening to the music of C hard seem to represent the most closed subgroup as C hard because it solely has negative correlations with all other clusters, especially with C ambi and C elec . In contrast, users listening to the music of C ambi seem to represent the most open subgroup as C ambi has positive correlations with two other clusters, i.e., C folk and C elec . The fourth cluster, C elec , is negatively correlated with C folk and especially with C hard , and positively correlated with C ambi . These results are also in line with the ones shown in Fig. 10, in which we identify the users of U ambi as more open music listeners than the ones of U hard .
In order to relate the openness of the subgroups to the diversity of the users within the subgroups, we calculate the average pairwise user similarity using the cosine similarity metric computed on the users' genre distributions, i.e., number of listening events per genre. Figure 12 shows the resulting boxplots for the four identified subgroups (i.e., C folk , C hard , C ambi , and C elec ). Figure 12 shows that users in U hard and U elec have a rather small average pairwise user similarity and, thus, exhibit a more diverse listening behavior, whereas users in U folk and U ambi tend to listen to more similar music genres and, thus, have a narrow listening behavior within the group. Summed up, we find pronounced differences with respect to openness and diversity across the subgroups. Although U ambi is the most open subgroup (i.e., also listens to music of other subgroups), it is also the least diverse subgroup (i.e., the users within the group listen to very similar music). That observation is in line with what is shown in Figs. 7, and Fig. 8. Here, we see that C ambi , i.e., the most tightly connected music cluster to U ambi , contains the dominating genre "ambient" as well as genres that are strongly associated with this dominating genre (e.g., "darkambient"). For U hard , we observe the opposite. While it is the least open subgroup, it is also the most diverse one (e.g., it contains "hardrock" as well as "hiphop" listeners).

Recommendations for beyond-mainstream user subgroups
In Sect. 3.5, we have shown that the recommendation accuracy of four personalized recommendation algorithms is significantly worse for BeyMS users than for MS users. Now, we extend this analysis and evaluate the recommendation accuracy of these algorithms for the four subgroups (i.e., U folk , U hard , U ambi , and U elec ). Table 4 shows our results with respect to the mean absolute error (MAE). Additionally, we analyze these results with respect to statistically significant differences in Table 5 by performing ANOVA (α = 0.01) and a subsequent Tukey-HSD test (α = 0.05). Here, we report pairwise differences as significant (marked with * * ), if both ANOVA and Tukey-HSD were significant across all five folds (see Sect. 3.5 for details on the experimental setup).
We see that among all algorithms, the significantly worst accuracy results (i.e., the highest MAE scores) are achieved for the U hard subgroup. Next, U folk , U ambi and U elec reach significantly better (i.e., lower MAE scores) than U hard for all algorithms. However, there is no statistically significant difference between the recommendation accuracy of U folk and U elec . The overall best accuracy results (i.e., lowest MAE scores) are reached for the U ambi subgroup. These results are also statistically significant when compared with the other subgroups for the NMF algorithm. NMF also gives the overall best accuracy results for  all subgroups, which is in line with our results presented in Sect. 3.5 and in our previous work [4]. Furthermore, we find a relationship between openness, diversity, and recommendation quality. Here, U hard is the least open but most diverse subgroup and gets the worst recommendations, while U ambi is the most open but least diverse subgroup and gets the best recommendations. This is in line with the findings of [9], who have shown that users are more likely to accept recommendations from different groups (i.e., openness) rather than varied within a group (i.e., diversity). Thus, we find a relationship between the quality of recommendations provided to beyond-mainstream music listeners and openness as well as diversity patterns of these users.
Finally, in Fig. 13, we visually compare the MAE scores reached by the best performing approach NMF for the four subgroups. Additionally, we depict the MAE score for BeyMS as a black dashed line and the MAE score for MS as a gray dashed line. We see that U hard reaches worse results than BeyMS while U folk and U elec reach slightly better results than BeyMS. Interestingly, U ambi not only reaches better results than BeyMS but also better results than MS. Although this improvement over MS is not statistically significant (according to a one-tailed Mann-Whitney-U test with α = 0.0001), it shows that there is a large variety among BeyMS users, where specific subgroups (i.e., U hard ) are disadvantaged in terms of recommendation accuracy by recommendation algorithms while others (i.e., U ambi ) are not.

Conclusions and future work
In this paper, we shed light on the characteristics of beyond-mainstream music and music listeners. As our first contribution, we identified 2074 beyond-mainstream music listeners (i.e., BeyMS) in the Last.fm platform, and subsequently created a novel dataset called LFM-BeyMS based on the listening histories of these users. We further enriched this dataset with (i) acoustic features of music tracks gathered from Spotify, and (ii) genre information of tracks derived from Last.fm tags and matched with the Spotify microgenre taxonomy. Additionally, for reasons of comparability, LFM-BeyMS contains data of 2074 Last.fm users listening to mainstream music. Using this dataset, as our second contribution, we validated related research by showing that beyond-mainstream music listeners receive a significantly lower recommendation accuracy than mainstream music listeners by four standard recommendation algorithms (i.e., UserItemAvg, UserKNN, UserKNNAvg and NMF).
As our third contribution, we applied the clustering algorithm HDBSCAN * on the acoustic features of tracks listened by BeyMS and identified four clusters of beyondmainstream music: (i) C folk , music with high acousticness such as "folk", (ii) C hard , high energy music such as "hardrock", (iii) C ambi , music with high acousticness and instrumen-talness such as "ambient", and (iv) C elec , music with high energy and instrumentalness such as "electronica".
As our fourth contribution, we mapped these clusters to our BeyMS users, which led to four beyond-mainstream subgroups: (i) U folk , (ii) U hard , (iii) U ambi , and (iv) U elec . We analyzed these subgroups with respect to their openness (i.e., across-groups diversitydo users of one group listen to music of other groups?) and diversity (i.e., within-groups diversity-how dissimilar is the music listened to by users within groups?). Here, we found large differences between U hard and U ambi . Although U hard is the most closed subgroup (i.e., users do not listen to music of other subgroups), it is also the most diverse subgroup (i.e., users listen to a diverse set of genres such as "hardrock" and "hiphop"). For U ambi , we get opposite results: while it is the most open subgroup (i.e., users listen to music of other subgroups as well), it is also the least diverse one (i.e., the users within the group listen to very similar music such as "ambient" and "darkambient"). We related these characteristics of the subgroups to the recommendation quality of the four recommendation algorithms UserItemAvg, UserKNN, UserKNNAvg and NMF. Here, we found that U hard got music recommendations with lowest accuracy, while U ambi got music recommendations with highest accuracy. This is in line with related research [9], which has shown that openness is stronger correlated with accurate recommendations than diversity. U ambi even received better recommendations than the group of mainstream music listeners. This result highlights that there are large differences between the subgroups of beyond-music listeners. Finally, to foster reproducibility of our research, we provide our novel LFM-BeyMS dataset via Zenodo as well as our source code via Github.
We believe that our findings provide useful insights for creating user models and recommendation algorithms that better serve beyond-mainstream music listeners. As it was shown in [4], beyond-mainstream music listeners tend to have larger user profile sizes than users interested in mainstream music, which means that they provide a substantial amount of listening interaction data for services such as Last.fm and Spotify. We assume that improving the recommendation quality for this active user group also leads to another effect, namely a more prominent exposure of (long-tail) music artists due to a better-connected recommendation network [81]. We leave such investigations to future work.
Limitations Despite the merits of this work, we are aware of its limitations. The first limitation we recognize is that our analyses are based on a sample of the Last.fm community. The extent to which their listening behavior is representative of the Last.fm community at large, or similar music streaming communities such as Spotify, needs further investigation.
Next, since we conducted a comparative study of the accuracy of recommender systems algorithms-and were therefore not interested to beat state-of-the-art algorithmswe focused on traditional algorithms (e.g., KNN-based collaborative filtering) instead of investigating the most current deep learning architectures, which would also require a much higher computational effort. Furthermore, an award-winning-paper by Dacrema et al. [82] has recently shown that traditional algorithms are able to outperform almost all deep learning architectures.
Future work While our work serves as a first milestone towards better characterizing beyond-mainstream music and listeners of such music, future work should focus on user modeling techniques to individually target the different subgroups, for example by integrating knowledge about openness and diversity. With respect to analyzing openness and diversity of users and user groups, we would also like to work on a more formal definition of these dimensions, which would not only allow us to measure them more precisely but also to integrate them into the recommendation calculation process.
Additionally, since previous research has shown that the listener's cultural background impacts the quality of music recommendations [48], we plan to compare the cultural and socioeconomic aspects of beyond-mainstream and mainstream music listeners. We plan to employ these aspects by means of Hofstede's cultural dimensions [83] and the World Happiness Report [84].
Finally, another avenue for future work is the research in the area of fair music recommender systems. Here, we plan to build user models that are capable of accounting for the complex characteristics of beyond-mainstream music listeners presented in this paper. While we believe that more specialized user models could help to provide better recommendations for users who currently receive worse recommendations (e.g., the U hard subgroup identified in this paper), we also aim to highlight that such user models still need to be generalizable to avoid any unfair treatment of other users. Hence, future research should work on achieving a specialization-generalization trade-off in music recommender systems. We hope that our open LFM-BeyMS dataset as well as our source code will be of use to the scientific community for subsequent analyses.