Predictive modeling to study lifestyle politics with Facebook likes

“Lifestyle politics” suggests that political and ideological opinions are strongly connected to our consumption choices, music and food taste, cultural preferences, and other aspects of our daily lives. With the growing political polarization this idea has become all the more relevant to a wide range of social scientists. Empirical research in this domain, however, is confronted with an impractical challenge; this type of detailed information on people’s lifestyle is very difficult to operationalize, and extremely time consuming and costly to query in a survey. A potential valuable alternative data source to capture these values and lifestyle choices is social media data. In this study, we explore the value of Facebook “like” data to complement traditional survey data to study lifestyle politics. We collect a unique dataset of Facebook likes and survey data of more than 6500 participants in Belgium, a fragmented multi-party system. Based on both types of data, we infer the political and ideological preference of our respondents. The results indicate that non-political Facebook likes are indicative of political preference and are useful to describe voters in terms of common interests, cultural preferences, and lifestyle features. This shows that social media data can be a valuable complement to traditional survey data to study lifestyle politics.


Introduction
Voting is central to the democratic process and the well-functioning of our political system. Therefore, voting behavior and party preference are well-studied by a broad range of social scientists who try to explain how and why decisions are made by the electorate. Traditionally, to understand individuals' votes, scholars relied on socio-structural factors such as group identification, religious affiliation, and socio-economic status. Electoral pioneers from the Michigan and Columbia school both stressed the importance of these long-term factors in explaining voting behavior [1,2]. In the US context party identity was key, while in many European countries, more religious and class-based cleavages were driving voters [3]. As these stable factors started losing importance, scholars gradually started devoting more attention to short-term factors such as concrete issues and popular candidates [4,5]. Studies showed that a better educated electorate became more volatile and more affected by issue priorities [6] and candidate evaluations [7]. "Lifestyle politics" represents an alternative way to understand political preference and voting behavior. Authors such as Bennett [8] and Giddens [9] argue that in our postmodern society, personal identity is replacing collective identity and that individuals increasingly let their personal politics depend on lifestyle choices. Politics focuses more and more on broad issues like identity, values, and moral and social orientation [9]. According to DellaPosta et al. [10] (p. 1474) we "are increasingly likely to find our local communities and social networks populated by individuals with similar aesthetic tastes, leisure activities, consumer preferences, moral practices, and ways of life. " Our attitudes on climate change, religion, migration, sexual minorities or workers' welfare are strongly connected to our consumption choices and daily lifestyle [11]; in turn, lifestyle is becoming more important in understanding individuals' political opinions and voting behaviors. In recent studies, political preference has been associated with leisure activities and personal tastes [10], brand choice in supermarkets [12], sustainable behaviors [13], food and music taste [11,14] and movie preferences [15]. These lifestyle choices reflect deep-seated personality traits that not only drive the culture wars over issues like race, or sexuality but also influence how citizens think about classic economic issues [16] Although the linkage between politics and many aspects of our personal life were documented, what is driving this alignment between ideological preferences and seemingly unrelated lifestyle dimensions is less clear. While some suggest that these lifestyle choices are a way to express one's (political) identity [15], others rather stress the deeper role of social and cultural homogeneity [10]. Yet others point out the political relevance of seemingly apolitical domains like entertainment news, novels, films, television sitcoms, and popular culture more broadly to distribute news and stimulate political debate [17]. Although a clear connection between lifestyle and political preference is not carved in stone, politicians seem well aware of their lifestyle choices when acting in the public sphere. For instance, studies show that politicians prefer certain types of food on the campaign trail, in particular to relate to the common man or to show cultural sensitivity. In this way politicians, subtly present or frame themselves in line with the cultural preferences of their potential voters [18]. Finally, "branded communication" refers to the idea to attach political messages to so-called lifestyle brands to capture the attention of individuals, and by which the alignment between ideological preferences and lifestyle is further strengthened [19].
The growing political polarization renewed the issue of "lifestyle enclaves" as a prime interest for a wide range of social scientists [20,21]. Furthermore, we know relatively little on whether these "cultural fault lines" that are clearly visible in the two party system in the US [14] are as present and determining in a multi-party system with a much broader variety of ideological players. Yet, detailed information on individuals' lifestyles is very difficult to collect, which complicates empirical and comparative studies in this domain. The options of leisure activities, movies, music, cultural activities, etc. that could be included in survey questions on lifestyle politics are endless, therefore, including these type of questions will not only occupy a lot of survey space and time but will also result in nonexhaustive option lists. Moreover, it implies that researchers are able to deductively make a selection of the lifestyle indicators (i.e. cultural places, movies, products) that should be included to best explain or predict vote choice.
A potential valuable alternative data source to capture these cultural and lifestyle practices is social media data. Today, thanks to the Internet and social media, an unseen amount and granularity of data are available. People visiting webpages or liking Facebook content leave little "bread crumbs" behind in the digital world that are indicative of their interests and personality. From this fine-grained behavioral data, inferring unknown information about a user is possible by applying predictive modeling techniques [22]. We explore the potential of Facebook "like" data for capturing lifestyle and predicting political and ideological preference. With Facebook "likes" we refer to the mechanism used by Facebook users to express their positive association with public Facebook pages of products, sport clubs, musicians, books, restaurants, etc.
Digital trace data (such as Facebook likes) seem to offer valuable insights in political attitudes and behavior [23], and several scholars have combined survey data with digital traces to study and understand political behavior. For example, Bond and Messing [24] estimate the ideology of politicians and their supporters using individual citizens' Facebook likes of political figures to study the relationship between ideology and age, social relationships and ideology, and the degree of polarization among the electorate. Eady et al. [25] apply a method developed by Barberá [26], to quantify the ideological distributions of users' online political and media environments on Twitter and study the extent to which liberals and conservatives live in so-called "echo-chambers". Some other examples include comparing individual exposure to news and politics content [27] and understanding news sharing behavior [28,29].
Our study adds to this existing work by using non-political Facebook pages to study lifestyle politics. As most previous literature mainly focused on the inference of ideology based on the accounts or pages from political actors or news media, the inclusion of nonpolitical Facebook pages is innovative.
To explore the value of digital trace data to complement classical survey research, we gathered Facebook likes and survey data of more than 6500 participants in Flanders, the Dutch-speaking part of Belgium. The data gathering was of course done with user consent and a clear privacy statement. Based on these Facebook likes, we build models to predict political and ideological preference, and compare this to predictive models based on traditional survey data. The interpretation of these models can be used to gain insights into voter profiles. The main contributions of this study are twofold. First, we explore the use of Facebook like data to complement and improve traditional survey data to study lifestyle politics, with an explicit focus on the contribution of non-political likes. Second, we collect a unique dataset of Facebook likes and survey data to gain insights into voter profiles in Belgium, a fragmented multi-party system.

Predictive modeling with Facebook likes
Facebook likes 1 express a certain positive attitude or interest in a public Facebook page, similar to visiting a webpage, purchase behavior, payment or location data. For example, observing a users' Facebook likes related to books provides similar information to watching someone's book closet or a list of purchased books online. Liking a Facebook page signals a user's desire to see more posts from the page's publisher. Moreover users can actively interact with the page creator as well as other people that have liked the page, by commenting, liking, and sharing content [30].
Additionally, a Facebook like has been conceptualized as a form of social endorsement, since these likes are publicly observable by friends in one's network [24]. Since like behavior is observable, it is used as a form of self-expression, and in line with the theory of Goffman [31] it is believed that users are building an idealized version of themselves on social platforms. At the same time, this "ideal" self should remain congruent with how one is perceived in the offline world, as Facebook networks are often grounded in offline relationships [32]. Indeed, in a study on consumer identity, Hollenbeck [33] finds that users like Facebook pages (of brands) to present-an ideal version of-themselves on the platform. Given these characteristics, Facebook likes have been proven powerful to infer interests and psychological traits, and became invaluable for user profiling and personalized advertising applications [34]. More concretely, Piazza et al. [35] investigated the relationship between Facebook likes and individual lifestyle, and they found a correlation between the activity, interests, and opinions of an individual and their like information. Given these characteristics, we believe Facebook likes can be used to operationalize lifestyle preferences.
We argue that Facebook likes could be a valuable addition to survey data for electoral research because of at least four different reasons: (1) through the direct measurement of actual behavior we avoid social desirability-though this might, to some extent, also be present in like behavior, see below-, recall error, and subjectivity 2 that are specific to answering survey questions [37], (2) it requires less effort from the respondents, and thus it is less time-consuming and costly than collecting survey data, (3) they can provide an unfiltered look and unique information on interests and lifestyle that we cannot grasp (fully) with survey-based data and (4) they can inductively help to identify the most important indicators of a phenomenon (such as lifestyle politics in the case of our study), which can then later be included in a survey. 3 Social media data allow to measure real behavior rather than self-indicated behavior or attitudes.
Of course, we must keep in mind that online data is a proxy for the real behavior under study, and thus, drawing causal inference and generalizing results from this type of data must be performed with caution [38,39]. For example, what you like on Facebook is not necessarily what you like in real-life and thus true behavior can only be studied indirectly [39]. More specifically, two potential hidden biases need to be considered. First, Facebook like behavior may be affected by user induced biases such as social desirability and intentional misrepresentation. Second, Facebook profile data are affected by the mechanics of the platform, such as the personalization by Facebook algorithms [40]. The pages that you like are at least partially influenced by what is recommended to you. These recommendations are based on your previous behavior on the platform, your profile attributes, and who you are connected to [40]. Therefore, connections between lifestyle and political preference are possibly reinforced owing to algorithmic curation and recommendation systems. Furthermore, behavioral data can suffer from exhibiting a low signal-to-noise ratio, since the behavior we capture can be unrelated to the target question. Surveys, if properly designed and implemented, provide better quality controls and allow for more targeted questions and responses [41]. A last issue with behavioral data is privacy. Since these data are often very personal and sensitive, the data must be collected and stored with respect for users' privacy, addressing challenges on user consent, data anonymization, secure storage, etc. [42]. 4 Yet, because of the enormous amount of information available in behavioral data, there is much to learn from it when following a rigorous research approach [38,39].
We will use a predictive modeling approach to study Facebook likes and political preference. Although most empirical political science research relies on explanatory modeling to test causal explanations [43], predictive modeling is more suitable to uncover complex patterns from data that might lead to the generation of new hypotheses [44]. The distinction between explanatory and predictive modeling exhibits some practical implications to each step of the modeling process [45].
At the very essence, the scientific goals that they are aimed at are different: explaining or predicting. Explanatory modeling is theory-driven: given a causal theoretical model, statistical models are applied to observational data to test hypotheses. In contrast, in predictive modeling statistical models are used for generating good predictions of new values, irrespective of the true underlying causal relationship [45]. Since the goal of prediction and explanation is different, they lead to different modeling approaches. Explanatory modeling requires interpretable statistical models; however, this is not a strict requirement for good prediction, although interpretability of the model is desired in our case. Moreover, some methods exist that are very useful for prediction, but not for explanation. When building prediction models, we will make use of regularization: an extra constraint introduced to the optimization function that penalizes the weights of coefficients. This is especially useful when modeling Facebook likes: because they are sparse data, simply optimizing the standard objective function would lead to estimates that are not unique and can overfit the data.
Finally, validating a causal model involves goodness-of-fit tests (e.g., normality tests) and model diagnostics such as residual analysis. Prediction performance is evaluated on a part of the data that was not used for training the model (a holdout-or test set). The evaluation metric we will use is Area Under the ROC Curve (AUC). AUC is a frequently used metric in data science to measure the performance of a classification model, independent of the frequency of the classes. It can be interpreted as the probability that the model ranks a random positive example higher than a random negative example [46]. A perfect model would achieve an AUC of 100%, while an AUC of 50% indicates a random model.

Data collection
This study is part of a broader research project that focuses on (social) media and electoral preferences, and was approved by the Ethics Committee for the Social Sciences and Humanities (EA SHW), at the University of Antwerp. The data collection started in March 2018 and focused on Flanders, the Dutch-speaking part of Belgium, representing around 60% of the population. Belgium is a federal country with a fragmented multiparty system and large coalition governments. A detailed survey with questions on sociodemographics, media consumption, political preference and attitudes was sent to 4500 respondents online. Surveying was done by Dynata/SSI, who distributed the survey among their own online panel, which consists of a diverse collection of citizens in terms of age, gender, and educational level. 5 Targets were set on these socio-demographic characteristics to increase the representativeness of the sample. 6 Of these respondents, 524 agreed to provide us access to their Facebook like data via Facebook Login (See Appendix A.1.4). In May-June 2018, a second round of data collection was conducted, where we disseminated a shorter survey and Facebook Login through the online webpages of popular Flemish newspapers. We asked people to use our tool, which would predict their ideological position based on their personal Facebook likes. An additional 6209 respondents agreed to provide access to their Facebook data, and they completed 12 survey questions about their media consumption and political preference. The Facebook likes and survey questions were collected with user consent and stored anonymously and securely on a local server. The data are used for scientific purposes only and will under no circumstances be shared with other institutions or companies. Results will be shown on an aggregate level only and participants have the right to stop their collaboration at any time and ask for their data to be removed. More details about the data collection and a full discussion of the privacy and ethical concerns can be found in Appendix A.1.
Per user, we stored the name and timestamp of all public Facebook pages they liked. This resulted in a total of 595,994 unique Facebook pages. For privacy reasons (see Appendix A.1.5), only pages that are liked by a minimum of 50 respondents in our dataset will be reported later in the analysis, which results in 5357 pages. For these pages we searched for the category that was assigned to them on Facebook. In total, Facebook shows over 1300 different page categories. 7 We started from the categorization that Facebook uses to classify public pages, but we refined and adjusted it into 20 categories (see Table 1). Note that our categories are not mutually exclusive; for example, an art festival can be included in the category Arts & Culture as well as in the category Event & Festival.
The survey data (see Appendix A.1.2) include some of the basic variables that are generally included in models of voting behavior in Belgium, see for instance [47,48]. Specifically, we added several of the most important structural determinants of ideology and voting behavior (gender, age and education), as well as the use of different media sources (TV, newspaper, social media, etc.), interest in the news (sports, culture, home affairs, foreign affairs, etc.), and general interest in politics.
Two survey questions will be used as target variables: (a) Ideological leaning. The participants positioned themselves on a scale of 0 (most left) to 10 (most right). In the analysis we consider the numbers 0 to 3 as "left", 4,5 and 6 as "center", and numbers 7 and higher as "right", and (b) Party preference. The participants indicated how likely they are to ever vote for each of the seven main Flemish political parties on a scale of 1 (never vote for party) to 10 (definitely vote for party): the worker's party (PVDA), the green party (Groen), the social democratic party (Sp.a), the Christian democratic party (CD&V), the liberal party (Open VLD), the Flemish nationalist party (N-VA), and the extreme right party (Vlaams Belang, VB). The preferred party per participant is the one with the highest score. In case of a tie between two or more parties, all tie-parties are considered equally important. The number of participants per target variable can be found in Table 4 in the Appendix. 5 For more information on Dynata and their panel see https://www.dynata.com/. 6 With regard to age our sample of 4500 respondents is slightly older than the population. 7 An overview of all possible Facebook categories can be found on https://www.facebook.com/pages/category/. Our sample of Facebook users consists of a diverse mix of users in terms of gender, age and education levels. Yet, our sample contains less females, less participants older than 55 and less lower educated participants compared to the general population. Weights were applied to our survey samples (see Appendix A.1.3) but this did not influence the results, 8 therefore we will only report the unweighted results hereafter. Moreover, through the selfselection of participants, Facebook users with a higher political interest are overrepresented. Likewise, some parties are over-or underrepresented in our sample. However, since the goal of our study is not to predict aggregated election results but rather to gain insights into voter groups, this is not a particular stumbling block in this study. Nevertheless, it is likely that we will achieve more accurate results for political parties that are well-represented in our sample.

Methods
As discussed in Sect. 2, our methodology consists of predictive modeling to study lifestyle politics. We apply predictive models to infer the political and ideological preference of our respondents, based on their Facebook likes. 9 The goal of these models is to optimize prediction accuracy at the individual level. Yet, the coefficients of these models reveal which likes are predictive for a certain political preference, providing insight in the interconnection between lifestyle and politics. To show Facebook likes are not just capturing socio-demographics in an indirect way, we compare the models built on Facebook likes to models built on survey data and demonstrate the potential added value of combining the predicted Facebook likes models with traditional survey data.
More specifically, we compare the predictive performance of Facebook like data to the survey data in five different set-ups: models based on all Facebook likes (M1), models based solely on non-political Facebook likes (M2), models based on the survey data (M3), 8 The spearman rank correlation between the coefficients of the models with and without survey weights applied was 0.94 9 Facebook pages are encoded as dummy variables where the value 1 indicates the user has liked the page and 0 indicates the user has not liked the page. models based on a combination of all Facebook data and survey data (M4), and models based on a combination of non-political Facebook likes and survey data (M5) (see Table 2). As explained in Sect. 2, prediction models are evaluated based on out-of-sample predictive accuracy. We report the average AUC over ten folds. 10 Political leaning is divided into three classes (left, center, and right, see Sect. 3). For each of the three classes, we transfer the variable into a binary classification problem (one-vsall) and train three binary Logistic Regression (LR) models. 11,12 Next, we calculate the weighted average AUC for the three classes. Similarly, we built seven binary classifiers for the seven political parties and calculate weighted average AUC. Though multinomial LR (one-vs-one) could also be used in this case (and is in fact more common in electoral research), we prefer binary classification to be able to create general distinct profiles for each party electorate in comparison to all other citizens, rather than compare the voter profiles to only one single reference group. However, both methods (binary and multinominal LR) show similar predictive performance in our analysis.
Additionally, we want to know how predictive each category of pages is for political leaning. In other words, how accurate can political leaning be predicted when using only the Facebook likes of the concerning category. This tells us which aspects of our social lives are most related to politics. To report the predictive performance of each category independent of the amount of pages per category, we randomly sample (with replacement) 100 pages per category and use only those pages as features. 13 This procedure (of random sampling) is repeated 10 times, and the average AUC is reported.
Finally, we compare the insights in ideological position and party preference based on Facebook likes versus survey data. Analyzing the coefficients of a logistic regression and their "p-values" is common practice in traditional explanatory and predictive modeling with dense data [50]. However, when using a regularization term, assumptions about the asymptotic distribution of parameters do not apply, and therefore, different methods for significance testing are needed and suggested in literature [51,52]. We will follow the bootstrap procedure as described in [52]. From the original dataset we take a random sample with replacement and built a model from this dataset to estimate the betas. This step is repeated 1000 times to obtain 1000 values for each beta. For each beta we estimate the  probability density function using a Gaussian kernel 14 to calculate the probability (p-value) that the parameter is less than or equal to zero. Facebook pages will be ranked based on the mean coefficients over 1.000 bootstraps while the p-value indicates significance on a α = 0.05 level. Table 3 compares the predictive performance for models based on Facebook likes and survey data for all users that participated in our study (M1)-(M5). Figure 1 shows the boxplots for the predictive performance of these models. In Appendix A.2 we perform a corrected paired differences t-test [55] and a non-parametric Wilcoxon signed-ranks test [56] (based on 10-fold cross-validation results) to compare the AUC of the different models and apply the Bonferonni correction for multiple pairwise comparison [57,58]. Both for ideological leaning and party preference, nearly all pairwise differences between the models are significant at the α = 0.05 level, 15 see Appendix A.2. The Facebook likes (M1) exhibit a higher predictive power than the survey data (M3). Still, the survey ques- tions do seem to capture some information that Facebook likes do not, since the combined model (M4) achieves the highest predictive power. One might argue that the high predictive performance of Facebook likes is due to the presence of political Facebook pages. Naturally, if someone likes the Facebook page of a certain party or politician this is very indicative of their political preference. Therefore, to assess the predictive value present in non-political Facebook pages, we built the same prediction models on the Facebook data without political pages (M2). As expected, the predictive performance decreases when excluding explicit political content. However, the AUC is still higher than the survey data (M3). Moreover, when combining non-political likes and survey data (M5), the performance is higher than for the survey data alone, which indicates that purely non-political lifestyle choices can augment surveys in predicting political preferences.

Results
The high predictive performance of non-political Facebook likes is an indication that non-political interests are interrelated with political preferences. The fact that this performance is higher than for the available survey data indicates that indeed additional information is captured with Facebook data that was not covered in the survey questions. To further explore which lifestyle categories are connected to political preference, we calculate the predictive performance per Facebook category. Figure 2 shows which categories of Facebook pages are most predictive for ideological preference. A prediction model built on only political Facebook pages achieves, not surprisingly, the highest AUC, followed closely by Civil Society and News & Media, which could be considered as semi-political categories. However, also the non-political categories of Arts & Culture and Communities are very telling for ideological preference. Conversely, the categories Movies, Sports & Health and Travel are the least predictive for ideological preference. This indicates that some aspects of our social lives are more tied to our ideological views than others.
Finally, we note the non-political likes perform better at discriminating between left versus non-left people and between right versus non-right than between center versus non-center (see Table 6 in Appendix for the t-test results). This might indicate that a less clear pattern is present in the characteristics and behavior of people with a center political leaning and that they exhibit a less distinguishable profile. Similarly, voters for some parties can be classified more accurately than others (e.g. the party N-VA, see Table 7 in the Appendix). From the survey data (see Table 11 in the Appendix), we learn that traditional variables, such as gender, age, education, and interest in certain topics help to predict people's vote. For instance, left voters are often highly educated women, between the age of 25 and 55, and interested in news about culture, arts, and international politics. Conversely, right voters are more often male than female and generally demonstrate a strong interest for financial and economic news. What additional insights do Facebook likes provide? To answer this question, we examine the most related pages when excluding political pages (see Table 10 in the Appendix and Fig. 3 for a summary).
With Facebook likes, analyzing cultural taste and differences in lifestyle between left and right voters is possible at a very detailed level. The pages most related to a left political leaning are (alternative) media outlets that are considered to be leaning more toward the left, and nonprofit organizations for climate and human rights, which can be considered as outspoken left-wing topics. For example, all else equal, the odds of demonstrating a left political ideology are almost 40% higher for someone who liked the website of the left-leaning newspaper De Morgen than for who did not, and they are almost 20% higher for someone who liked the nonprofit organization Amnesty International. Right pages are dominated by Flemish nationalistic content and memes. The odds of demonstrating a right political leaning are almost 20% higher for someone who liked the alternative right-wing news website SCEPTR or the popular mainstream paper HLN.be than for someone who did not. Similarly to the left pages, most of these pages carry a subtle, or sometimes an outspoken, reference to political ideology. In contrast, center voters like less explicit or implicit political pages on Facebook.
Similarly, to predict political parties, politically loaded pages are ranked high, such as newspapers or organizations with a certain ideology. Bluntly summarized, voters for the workers'-and social democratic party like solidarity content such as refugee or thirdworld organizations, green voters mainly like environmental pages (e.g. Greenpeace) Christian democratic voters are interested in religious organizations and the royal family, liberal voters in financial news and businesses, and finally the Flemish nationalists and extreme rightists tend to like Flemish nationalistic and identity content on Facebook. At first sight, most of these likes seem to be in line with the broader ideological or issue profile of the party family.
When delving deeper into aspects of lifestyle and cultural preferences our methodology allows us to focus on the most related pages per specific Facebook category. For example, when considering only the pages in the category Movies, left voters more often like adventure, romance and drama movies, such as The Hunger Games (Odds Ratio (OR) = 1.06) or 500 Days of Summer (OR = 1.06), and visit arthouse cinemas in larger cities. In contrast, action movies such as Scarface (OR = 1.04) and Fast & Furious (OR = 1.04) are mainly liked by voters on the right. In the category Music, the genres of alternative rock, blues, and experimental music are most related to a left political leaning (e.g. Bob Dylan (OR = 1.08) or Tom Waits (OR = 1.07)) whereas popular (hard) rock music (e.g. AC/DC (OR = 1.05)) is often liked by right voters, next to the genres of techno and electronic dance music (e.g. Justice (OR =1.04)). The same analysis can be done for other categories such as books, food, brands, sports, etc. Clearly, the lower odds ratios of a single movie or rock band compared to those of a civil-society organization or alternative news outlet indicate that their connection with ideological preference is present, but much more modest.

Conclusion and future research
The starting point of this study is that lifestyle and politics are closely related. Next to classical variables such as socio-demographics and issue preferences, our personal values and lifestyle choices correlate with our political preferences. The fact that (non-political) Facebook likes achieve high predictive accuracy (in addition to survey data) shows that they are indeed capturing additional information, which we argue are related to values and lifestyle.
Consequently, we looked into which aspects of our social life are most predictive for our political preference by analyzing the predictive performance of Facebook categories (e.g. movies, music, food, etc.) separately. Non-surprisingly, politicians, media and civil society are most predictive for political leaning, but also arts, culture, entertainment and books help to predict where people stand politically, whereas, in particular sports and travel are less predictive. This raises the question why some aspects of our social lives are more connected to political preference than others. For instance, our study seems to suggest that Belgian citizens are more likely to meet people with different political convictions when cheering for their favorite sports team than when going to a music festival. However, the extent to which these results can be generalized to other contexts or countries is uncertain. For instance, the study of [14] found that for Twitter users in the US musical preference was less connected to political preference than following sports teams. To reach insights that travel across countries and time periods, future research needs to be comparative or at least more similar in terms of data and methods.
In contrast with most previous research on lifestyle politics in the (polarized) two-party system in US, we examined lifestyle politics in a multi-party system, with much more subtle ideological differences between parties. Our study indicates that Facebook likes are less predictive for center voters and for traditional political parties. A less clear pattern is present in the characteristics and behavior of those voters compared to voters with a more outspoken ideological position. We find, for instance, that our social media data are much better in predicting who votes for an extreme-right party compared to the social democrats. Is this simply due to the more radical or straightforward ideological position of these parties, or do certain politicians link their ideology or party platform more to lifestyle choices?
The same goes for placement on the left-right scale. Our model is better at predicting positions to the left or right on this scale than predicting the center position. Again, this might simply be because center voters have less distinguishable profiles, but it may also be driven by the fact that respondents who in surveys position themselves at the center tend to be less politicized. Studies have demonstrated that at least part of these respondents have no clear political attitudes and tend to have lower political interest and less political knowledge [59][60][61]. In this sense, these scales cannot fully disentangle respondents who are politically neutral from those who are non-political. Nevertheless, the fact that it is also more difficult to predict actual votes for center parties-although to a lesser extent than the center ideological position-suggests that at least part of the difference can likely be attributed to the fact that the ideological position at the center is less connected to distinct lifestyle profiles. How to explain these differences and why certain ideological and political opinions are more connected to lifestyle than others, are interesting follow-up questions for social scientists, and electoral scholars in multi-party systems in particular. At this stage it remains unclear whether an ideological position leads to a specific lifestyle, or rather whether certain lifestyle choices facilitate the adoption of certain political attitudes. Future research should try to dig deeper into the causal mechanism of this intriguing finding.
Our analyses showed that with Facebook data, different interest categories can easily be analyzed and compared to improve our understanding of public opinion and voter behavior. As such, digital trace data can complement survey data. Although we believe that it is theoretically possible to achieve the same predictive power with targeted survey questions, we think it is not practically feasible in the context of exploratory research for two reasons. First, capturing this amount of detailed information using traditional surveys is very difficult, as it would require very long question batteries on different categories related to lifestyle and consumer choices. The survey of DellaPosta et al. [10] is a rare example of such an extensive survey, but requires a great effort from respondents. In addition, it would request scholars to define a priori the most important political lifestyle indicators. Therefore, we believe digital trace data are invaluable for exploratory research, as they can capture many different lifestyle options with relatively little effort. The patterns extracted from behavioral data can then help scientist to refine research questions and formulate targeted survey questions.
An important aspect of future research is then how to summarize the insights gained from Facebook likes in such a way that they can support theory building. Our classification using Facebook categories is one way. Other possibilities include dimensionality reduction or clustering Facebook likes to learn a certain structure from the data. For instance, movies or music of the same genre could be grouped together, or food of the same cuisine. These dimensions or clusters can help us to identify overarching cultural taste or lifestyle patterns which can thereafter be transformed into a series of survey questions and used for explanatory analysis.
At the same time, we do not argue that social media data can replace survey research. Probably both advanced survey research and social media data are needed to understand how lifestyle and political preferences influence each other. For instance, DellaPosta and colleagues (2015) explain the puzzling association between lifestyle preferences and political affiliation by the self-reinforcing effect of homophily and social influence. In short, they argue that people make lifestyle choices and opt for places where they meet likeminded people (homophily), in those (online) places people's attitudes are affected by exposure to each other (social influence). However, to better understand how this self-reinforcing effect takes place, further research is needed. By using a survey panel design [62] or by simultaneously analyzing friend networks on Facebook [63], the relationships between homophily and social influence could be further explored. Interviews or surveys could seek for explanations for this association and its direction by asking questions such as "Do you like to go to events where people have the same opinions as you?" or "Do you talk about politics or societal issues at music concerts?".
We argue that combing social media and survey data is not only useful to understand where "the cultural fault lines" are most persistent, but it can simultaneously provide insights in which (type of ) organizations, brands or events "unite" people with different political views. For example, our data indicate that people from all sides of the ideological spectrum "like" a political comedy show from the public broadcaster (De Ideale Wereld). From the perspective of the recently growing polarization of the audience for late night comedy in the US [64], this is not a trivial finding. Therefore, this type of knowledge may be increasingly relevant and useful, as ever more countries are facing increasing levels of (affective) polarization [20,21]. Furthermore, the growing affective polarization in society might make people more eager to express their political identity online by liking pages or following actors that are seen in line with their political views and cultural tastes. This would imply that similar factors drive both online and offline lifestyle choices. Further research could tackle this, among others by studying the linkage between social media data and political preferences over time.
In conclusion, we argue that fine-grained behavioral data are useful to discover unknown patterns or better understand existing relationships and are therefore equally valuable for political and social science research. However, the mining of behavioral and online data has raised new and unexpected ethical and regulatory questions related to online collection, storage, and use of human subjects' data [65]. We therefore underline the importance of an ethical discussion of the research design in journal manuscripts, which we included in Appendix A.1.5. Unfortunately, Facebook restricted data collection through the APIs of Facebook, Instagram, and other platforms it owns. While this intervention certainly is positive for the privacy protection of their users, it is also locking out third parties and diminishing transparency of the platform. Social scientists depend on the data collected through the API to investigate the impact of such platforms on our media and society. A potential promising avenue of Facebook is to work in a coordinated way with researchers. 16 However, some scholars question whether this initiative will provide sufficient support for free and independent scientific research [66]. Nonetheless, the potential of other social media data (e.g. Twitter) or other types of behavioral data (e.g. location data, payment data or browsing data) needs to be further explored as they complement the insights of survey research into different aspects of citizens' social and political life.

A.1 Data collection
The data collection took place in two waves in March and June 2018. In the first wave (March 2018), a detailed survey with questions on socio-demographics, media consump-

Figure 4
Overview of the front-and back end of the webpage that was created to collect Facebook like data for our study tion, political preference and attitudes was sent to a representative panel of around 4500 respondents. From these respondents, 524 agreed to give us access to their Facebook like data, via Facebook Login (for more details about Facebook Login see Appendix A.1.4). Based on the data of this small set of respondents, models were built to predict gender, age, political leaning and party preference based on Facebook likes. These initial prediction models were used to develop a tool that shows participants which characteristics can be inferred about them based on their Facebook likes. The goal of this tool was two-fold: (1) to convince people to participate in our study and (2) to create awareness about the personal information that you might disclose about yourself on Facebook. In the second wave (May-June 2018), this tool was disseminated through the online webpages of popular Flemish newspapers to reach a broad audience. Via an online webpage, users could give consent to collect their Facebook likes and they were also asked to complete 12 survey questions about their media consumption and political preference. In return for completing the survey, people could see a prediction of their gender, age, ideology and party preference based on the Facebook likes they provided. The Facebook likes and survey questions were anonymously and securely stored on a PostgreSQL database server. An overview of the front-and back end of our webpage can be found in Fig. 4.

A.1.1 Survey invitation
In wave 1, this survey invitation was sent to a representative panel (translated from Dutch to English): "People increasingly keep themselves up to date on the news via social media such as Twitter and Facebook. Therefore, as researchers, we are very interested in the public pages you follow (like) on Facebook and the posts you share or post on your timeline. In addition, we want to examine what companies (such as Cambridge Analytica) can deduct from the like behavior of Facebook users and provide users insight into this.
For this we need your help. By participating in our study, you allow us access to your Facebook data and will need to answer ten questions. We will use those questions to determine what we can predict of people on the basis of their page likes on Facebook.
Of course, all data will be processed and stored fully anonymously. This research pursues scientific aims only, and data will not be shared with others. Your privacy is of high importance to is. Read our privacy policy here.
If you have any further questions about this research you can read our frequently asked questions section. If you still have any questions left you can contact us via nws-data@uantwerpen.be. " In wave 2, this text was added to the previous invitation: 17 "Based on your Facebook likes we use data mining models to predict your gender, age, ideological leaning and party preference. We will also show you which specific likes (i.e.

Survey weights
We used Iterative proportional fitting (IPF) 18 to adjust survey weights to reflect the overall population distribution in terms of gender, age and education levels (see Table 5). However, the weighting process did not influence the outcomes all that much (e.g. the spearman rank correlation between the coefficients of the models with and without survey weights  applied was 0.94) and therefore the unweighted results will be reported in the manuscript.

A.1.4 Facebook login
Facebook Login 19 is a developers' tool that allows users to authenticate using their Facebook credentials and log into an app. Facebook Login also enables to ask for permissions to access items of user data. These permissions, if granted by the user, provide access to a subset of that person's data stored on Facebook. To be able to ask users for such permissions, passing a review process (called App Review 20 ) is necessary, so that Facebook can make sure that the data are not misused. 21 Researchers that wish to access a user's personal Facebook data thus need to obtain permission from Facebook as well as from the user himself. After we passed the review process, we integrated the Facebook Login button on our webpage. Respondents could click on this button, authenticate with their Facebook credentials, and grant us permission to their Facebook likes (see a mockup example in Fig. 5).

A.1.5 Privacy and ethical concerns
Facebook and other online environments provided tremendous opportunities for social science research to study human behavior. At the same time, the rapid increase in available online data and technological progress raised new and unexpected ethical questions related to online collection, storage, and use of human subjects' data [68]. These questions become urgent as the data and research possibilities move well beyond those typical in the social sciences, to more directly address sensitive aspects of human behavior and our daily lives. Yet, the variety in data sources, research topics, and methodological approaches complicates the draft of universally applicable ethical guidelines. Rather, all researchers engaging with (data from) human subjects, have the ethical responsibility to minimize potential harm. Even seemingly benign data can contain sensitive or private information and has the potential to impact people's lives [68]. From a legal standpoint, the introduction of the European Union's General Data Protection Regulation (GDPR) in May 2018, offers a regulatory framework for researchers collecting (online) personal data. Next to that, various instances are updating ethical guidelines for Internet research and human subjects [69,70] We base an ethical discussion of our research and data collection on the work of [65] that unites GDPR regulation and moral concerns into one concise data science framework. We present a version of this framework, tailored to our study in Fig. 6 and will briefly discuss its elements below.
The data needed for this study consist of survey questions and Facebook like data. All data were collected with informed consent. We clearly stated what data would be gathered and for what purpose. Respondents were required to agree with our privacy policy to participate in our study, and have the right to stop their collaboration at any time and ask for their data to be removed. Furthermore, no other data than the data needed for our study were collected. The data are anonymized immediately, meaning that all personal identifiers (i.e. name and Facebook id) are permanently removed. Yet, we are aware of the risk of reidentification [71], especially since like behavior is quite unique per individual. It is increasingly hard to transform the data in such a way that it protects sensitive information [72]. Therefore the -anonymized-data will under no circumstances be made public or shared with other instances and are stored locally and encrypted for optimal security. The data are stored and used for scientific purposes only and will be removed permanently after ten years. Results will be shown on an aggregate level only and will never contain individual examples, nor will we explicitly mention pages that were liked by less than 50 participants. Albeit absolutely necessary to protect privacy of our respondents, our privacy policy demonstrates two important implications for scientific generalizability. First, informed consent automatically induces self-selection or consent bias, and generalizing our results to the total population is difficult. Secondly, since we can only share aggregated results, this complicates the reproducibility of our study. However, the complete computer code and aggregated data can be made publicly available to assess scientific quality. Finally, a last important aspect of ethical research conduct is clear communication with data subjects (participants) before and after the study. Before the start of our data collection we compiled a clear privacy statement, in collaboration with the university's data protection officer and with approval by the Ethics Committee for the Social Sciences and Humanities (EA SHW). 22 The full privacy statement was communicated with participants and can be found on our website, 23 where we have also communicated a concise summary of our results. To conclude, we believe that clear communication about ethical research design in scientific journals benefits the ethical debate and framework for online human subject research. We therefore encourage all researchers and journals to include such discussion.

A.2 Statistical comparison of models
We use a corrected paired differences t-test (based on 10-fold cross-validation results) to compare the AUC of the different models [55] and apply the Bonferonni correction for multiple pairwise comparison [57,58]. The results can be found in Table 6 (ideological leaning) and Table 7 (party choice). For leaning, the number of models that is being compared is 3 (the number of ideologies) times 10 (the number of folds per binary model), for party choice this is 7 (number of parties) times 10 (number of folds per binary model). Also the predictive performance of non-political Facebook likes for the ideologies and parties are compared, here the number of models is 10 (number of folds). Secondly, in Table 8 and 9 we show the results for a Wilcoxon signed-rank test, a non-parametric alternative to the paired t-test [56]. The results for this non-parametric test largely correspond to the parametric t-test.