Analysis and classification of privacy-sensitive content in social media posts

User-generated contents often contain private information, even when they are shared publicly on social media and on the web in general. Although many filtering and natural language approaches for automatically detecting obscenities or hate speech have been proposed, determining whether a shared post contains sensitive information is still an open issue. The problem has been addressed by assuming, for instance, that sensitive contents are published anonymously, on anonymous social media platforms or with more restrictive privacy settings, but these assumptions are far from being realistic, since the authors of posts often underestimate or overlook their actual exposure to privacy risks. Hence, in this paper, we address the problem of content sensitivity analysis directly, by presenting and characterizing a new annotated corpus with around ten thousand posts, each one annotated as sensitive or non-sensitive by a pool of experts. We characterize our data with respect to the closely-related problem of self-disclosure, pointing out the main differences between the two tasks. We also present the results of several deep neural network models that outperform previous naive attempts of classifying social media posts according to their sensitivity, and show that state-of-the-art approaches based on anonymity and lexical analysis do not work in realistic application scenarios.


Introduction
The Web is pervaded with user-generated contents as Internet users have multiple and increasing ways to express themselves. They can post reviews of products, businesses, services and experiences; they can share their thoughts, pictures and videos through different social media platforms; they reply to surveys, forums and newsgroups and some of them have their own blogs and web pages. Many companies are encouraging this behavior, because user-generated content has more attractive power on other users than professional contents, and this increases their engagement on web platforms. However, texts, photos and videos posted by users may harm their own and other's privacy, thus exposing themselves (and other users) to many risks, from discrimination or cyberbullying to frauds and identity theft. Although user-generated content is often subject to moderation, also adopt-ing automated recognition techniques such as inappropriate content [1], hate speech [2] and cyberbullying [3] detection, there is no control on the sensitivity of posted contents. It is worth noting that social media and forums are not the only platforms that store and publish private contents. Surveys, or contact/helpdesk forms are other examples where the users are free to enter any type of text and other contents, together with other more structured personal information. Often, such data need to be transferred to third parties to be analyzed, and the lack of control on free-text fields could put the privacy of respondents at risk. A common quick solution consists in totally removing all such fields or sanitizing them automatically or at hand. However, existing automatic sanitization approaches [4][5][6] try to replace sensitive terms belonging to specific domains (e.g., medical or criminal records) with more general ones, and rely on existing knowledge bases and natural language processing techniques such as named entity recognition and linking. In some cases, sanitization techniques destroy the informativeness (and sometimes the meaning itself ) of the text.
Self-disclosure, i.e., the act of revealing personal information to others [7], is a social phenomenon that has also been extensively studied in relation with online forums [8], online support groups [9] and social media [10]. Although self-disclosure is also closely related to content sensitivity, it has often been investigated in the context of intrinsically sensitive topics, such as in forums related to health issues, intimate relationships, sex life, or forum sections explicitly devoted to people searching for support from strangers [11]. In these settings, the identity of the users is often masked by pseudonyms or entirely anonymous. Instead, general purpose social media platforms usually encourage the usage of the real identity, although this does not prevent their users from disclosing very private information [12][13][14]. Moreover, the sensitivity of social media texts is harder to detect, because the context of a post play a fundamental role as well. Finally, social media posts are sometimes very short; yet, they may disclose a lot of private information.
To better understand the problem, let us observe the post in Fig. 1: it does not mention any sensitive term or topic, but discloses information about the author and his friend Alice Green, and contains hidden spatiotemporal references that are immediately clear from the context (the author is about to leave for a journey, which implies that he will be far from home for a month, disclosing a potentially sensitive information). On the other hand, there may exist posts that contain very sensitive terms, but are not sensitive at all, when contextualized correctly. An example is given by the post in Fig. 2, where several sensitive terms (struggling, suffering, COVID-19) and topics (health, economic crisis) are    and topics (health, economic crisis), but no private information is disclosed about any specific person mentioned, but no private information is disclosed about any specific person. In these cases, the automatic assessment of text sensitivity could save a lot of rich information and help automate the sanitization process. Furthermore, an automatic warning system able to detect the true potential sensitiveness of a post, may help a user decide whether to share it or not.
Indeed, the problem of assessing and characterizing the sensitivity of content posted in general purpose social media has been already studied, but, due to the unavailability of specifically annotated text corpora, it has been tackled through the lens of anonymity, by assuming that sensitive contents are posted anonymously [15,16], on anonymous platforms [17], or with more restrictive privacy settings [18], while non sensitive ones are posted by identifiable users and/or made available to everyone. However, as we pointed out in [19], anonymity and sensitivity are not straightforwardly related to each other. The decision of posting anonymously could be determined uniquely by the sensitivity of the topic, but not by the sensitivity of the posted content itself. Analogously, many non anonymous social media posts contain very private information, just because their sensitivity [12] or their visibility [14] are underestimated by the content authors. These considerations make what we call the "anonymity assumption" too simplistic, or even unrealistic in practice. Other existing annotated corpora concern posts extracted from Reddit [11] and support groups for cancer patients [8,9]. Unfortunately, these corpora focus on very specific (and intrinsically sensitive) topics or give a very restrictive interpretation of self-disclosure: in [11], for instance, only posts disclosing personal information or feelings about the authors are annotated as sensitive. Moreover, it has a strong focus on mutually supportive communities and intimate relationships. To cope with this problems, very recently, we have introduced a more general task called content sensitivity analysis as a machine learning task aimed at assigning a sensitivity score to content [19]. However, in that preliminary work, we model the problem as a simple bag-of-words classification task on a very small text dataset (less than 700 social media posts) with mild accuracy results (just above the majority classifier).
In this paper, we address all the limitations of previous works by analyzing a new large corpus of nearly 10,000 text posts, all annotated as sensitive or non sensitive by humans, without assuming any implicit and forced link between anonymity and privacy. We provide an in-depth analysis of sensitive and non sensitive posts, and introduce several sequential deep neural network models that outperform bag-of-words classifiers. We also show that models trained according to the anonymity assumption do not work properly in realistic scenarios. Moreover, we also study how the problem of self-disclosure is related to ours and show that existing text corpora are not adequate to analyze the sensitivity of posts shared in general purpose social media platforms. At the best of our knowledge, this is the first work addressing the problem of directly and efficiently evaluating the real sensitivity of short text posts. It has then the potential to represent a new gold standard in content sensitivity analysis and self-disclosure, and could open new research opportunities for improving the users' awareness on privacy and performing privacy risk assessment analysis or sanitization on data containing free text fields.
Our paper is organized as follows. In Sect. 2, we review some closely related work and discuss their limitations. We define formally our concept of privacy-sensitive content, describe how we have constructed our annotated corpus, and present the datasets used in our analysis in Sect. 3. Section 4 contains an in-depth analysis of the lexical features characterizing sensitive content in the different datasets, while, in Sect. 5, we report the results on multiple classification tasks conducted under different settings. In Sect. 6 we discuss more in detail the results of the experiments and draw some generalized conclusions. Finally, Sect. 7 concludes by also presenting some future research perspectives.

Related work
With the success of online social networks and content sharing platforms, understanding and measuring the exposure of user privacy in the Web has become crucial [20,21]. Thus, many different metrics and methods have been proposed with the goal of assessing the risk of privacy leakage in posting activities [22,23]. Most research efforts, however, focus on measuring the overall exposure of users according to their privacy settings [24,25] or position within the network [14]. Instead, the problem of characterizing and detecting the sensitivity of user-generated content, has been subject of very few studies in the last decade. One of the first work in this direction has tried to address this problem using a lexicographic approach [26,27]. Similarly to sentiment analysis or emotion detection, in fact, linguistic resources may help identify sensitive content in texts. In their work, Vasalou et al. leverage prototype theory and traditional theoretical approaches to construct and evaluate a dictionary intended for content analysis. Using an existing content analysis tool applied on several text corpora, they evaluate dictionary terms according to privacy-related categories. Interestingly, the same authors note that there is no consistent and uniform theory of privacy-sensitivity.
To bypass this problem, several authors adopt a simplification: they assume that the sensitivity of contents is strictly related to the choice of posting them anonymously. This also makes the construction of annotated corpora easier, because one just needs to consider contents posted anonymously as sensitive, while posts shared with identifiable information can be considered as non sensitive. Hence, for instance, Peddinti et al. adopt this strategy for analyzing anonymous and non anonymous posts in a famous question-and-answer website [15]. They analyze different basic machine learning models to predict whether a particular answer will be written anonymously. Similarly, Correa et al. define sensitivity of a social media post as the extent to which users think the post should be anonymous [17]. They compare content posted on anonymous and non-anonymous social media sites both in terms of topics and from the linguistic point of view, and conclude that sensitivity is often subjective and it may be perceived differently according to several aspects. Very recently, the same authors have published a sanitized version of nearly 90 million posts downloaded from Whisper, an anonymous social media platforms [28]. Biega et al. conduct a similar study, but restrict the analysis to sensitive topics with the aim of measuring the privacy risks of the users [29]. It is worth noticing that all these studies conclude that sensitivity is subjective.
Content sensitivity has been associated to privacy settings as well: similarly to anonymity, contents posted with restricted visibility are deemed sensitive. Yu et al. analyze sensitive pictures by learning the object-privacy correlation according to privacy settings to identify categories of privacy-sensitive objects using a deep multi-task learning architecture [18]. They also use their model to customize privacy settings automatically and to sanitize images by blurring sensitive objects.
Text sanitization is another close research field whose goal is to find and hide personally identifiable information and simultaneously preserve text utility. To this purpose, Jiang et al. present an information theoretic approach to hide sensitive terms by more general but semantically related terms to protect sensitive information [30]. Similarly, Sanchez et al. propose several information theoretic approaches that detect and hide sensitive textual information while preserving its meaning by exploiting knowledge bases [4,31,32]. Iwendi et al., instead, focus on unstructured medical datasets and propose a framework to completely anonymize the textual clinical records exploiting regular expressions, dictionaries and named entity recognition. Their methods is aimed at sanitizing the detected protected health information with its available generalization, according to a well-known medical ontology [5]. Finally, Hassan et al. use word embeddings to evaluate the disclosure caused by the textual terms on the entity to be protected according to the similarity between their vector representations [6]. All the above mentioned methods rely on the identification of named entities or quasi-identifying terms, and try to replace them with semantically close, although more general, terms. Hence, they all leverage some kind of knowledge bases or ontologies, and work well on some specific domains (e.g., on medical documents, criminal records and so on). Instead, we address a more general notion of sensitivity, that also includes texts that may reveal sensitive or simply private user's habits, feelings or characteristics.
A closely related concept is the so-called self-disclosure, defined as the act of revealing personal information to others [7]. Self-disclosure has been widely studied well before the advent of modern social media, in particular for its implications in online support groups, online discussion boards and forums. For instance, Barak et al. study, among the others, the reciprocity of self-disclosure in online support groups and discussion forums showing that there are substantial differences in how people behave in these two different media types [8]. Yang et al., instead, analyze the differences in the degree of positive and negative self-disclosure in public and private channels of online cancer support groups [9]. They show that people tend to self-disclose more in public channels than in private ones. Moreover, negative self-disclosure is also present more in public online support channels than in private chats or emails. To achieve these results, the authors study lexical, linguistic, topic-related and word-vector features of a relatively small annotated corpus using support vector machines. Ma et al. conduct a questionnaire-based mixed-factorial survey experiment to answer several questions concerning the relationships that regulate anonymity, intimacy and self-disclosure in social media [10]. They show, for instance, that intimacy always regulates self-disclosure, while anonymity tends to increase the level of self-disclosure and decrease its regulation, in particular for content of negative valence. Differently from the previous works, Jaidka et al. directly address the problem of selfdisclosure detection in texts posted in online forums, by reporting the results of a challenge concerning a relatively large annotated corpus made up of top posts collected from Reddit [11]. Contrarily to [28], in this corpus, all posts are directly annotated according to their degree of informational and emotional self-disclosure. The authors also intend to investigate the emotional and informational supportiveness of posts and to model the interplay between these two variables. Unfortunately, this corpus is not entirely adapted to our purpose (i.e., detecting the sensitivity of text content in general purpose social media platforms) mainly for four different reasons: first, the focus is on self -disclosure, although a post may reveal sensitive information about other people as well; second, posts on Reddit are published using pseudonyms, while general purpose social media foster the usage of real identities; third, a large part of the posts has been extracted from a subreddit explicitly devoted to people searching for other users' support; last but not least, all posts concern intimate relationships by design.
In conclusion, in our work, we do not make any "anonymity" or "privacy settings" assumption, since it has been shown that users tend to underestimate or simply overlook their privacy risk [12][13][14]. Consequently, we analyze and characterize sensitive posts directly. In a very preliminary version of our work, we tried to give a more generic definition of sensitivity [19]. However, our model was trained on very few posts and by using simple bag-of-words classifiers, thus achieving mild accuracy results. In this work, we construct a much larger and more reliable dataset of social media posts, directly annotated according to their sensitivity, and use more sophisticated and accurate models to help decide whether a post is sensitive or not. Additionally, we provide further lexical and semantic insights about sensitive and non sensitive texts.

An annotated corpus for content sensitivity
In this section, we introduce the data that we use in our study. We first provide a conceptualization of "content sensitivity" also in relation with existing similar concepts; then, we describe how we construct our annotated corpus and provide some characterization of it.

Privacy-sensitive content
Content sensitivity is strictly related to the concept of self-disclosure [7], a communication process by which one person reveals any kind of personal information about themself (e.g., feelings, goals, fears, likes, dislikes) to another. It has been described within the social penetration theory as one of the main factors enabling relationship development [33,34]. Due to the peculiarities of online communication (and its differences w.r.t. face to face communication), the social and psychological implications of self-disclosure in the Internet have been extensively studied as well [35]. For its implications on user privacy, self-disclosure has also been investigated in relation with privacy awareness, policies and control [36], and some rule-based detection techniques for self-disclosure in forums have been proposed [37], leading to some relatively large annotated corpora [11].
In this paper, we refer to content sensitivity as a more general concept than selfdisclosure. In [19] we gave a preliminary, subjective and user-centric definition of privacy sensitive content. In that work, we stated that a generic user-generated content object is privacy-sensitive if it makes the majority of users feel uncomfortable in writing or reading it because it may reveal some aspects of their own or others' private life to unintended people. This definition is motivated by the fact that each social media platform has its own peculiarities and the amount and quality of social ties also play a fundamental role in regulating self-disclosure [10]. However, it has many drawbacks, since it relies on the subjective perception of users and on a notion of uncomfortableness that can also be driven by other external factors. This also conditioned the preliminary annotation of a corpus, leading to poor detection results. Consequently, in this paper, we adopt a more objective definition of privacy-sensitive content.
Definition 1 (Privacy-sensitive content) A generic user-generated content is privacysensitive if it discloses, explicitly or implicitly, any kind of personal information about its author or other identifiable persons.
Differently from the concept of self-disclosure, our definition explicitly mention the disclosure of information concerning persons other than the author of the content. Furthermore, it also clearly includes contents that implicitly reveal personal information of any kind. For instance, the sentence "There's nothing worse than recovering from COVID-19", is a neutral sentence, apparently. However, it is very likely that the person who expresses this claim has also personally experienced the effects of SARS-CoV-2 infection.

Datasets
Most previous attempts of sensitivity analysis on text contents assume that sensitive posts are shared anonymously, while non sensitive posts are associated to real social profiles. Other available corpora do not explicitly require that distinction, but have been collected in very specific domains (e.g., health support groups [9]) or focus on limited types of selfdisclosure (e.g., intimate/family relationships [11]). Hence, we will consider a new generic dataset with explicit "sensitive/non-sensitive" annotations. To this purpose, we first need a corpus constituted of mixed sensitive and non-sensitive posts. Twitter is not the most suitable source for that, because most public tweets are of limited interest to our analysis, while tweets with restricted access can not be downloaded. Moreover, it is well known that users are significantly more likely to provide a more "honest" self-representation on Facebook [38,39]. Consequently, Facebook posts are more adapted to our purposes, but contents posted on personal profiles can not be downloaded, while public posts and comments published in pages do not fit the bill as they are, in general, non sensitive. Furthermore, they would require a huge sanitization effort in order to make them available to the research community. Fortunately, one of the datasets described in [40], and released publicly, has all the required characteristics. It is a sample of 9917 Facebook posts (status updates) collected for research purposes in 2009-2012 within the myPersonality project [41], by means of a Facebook application that implemented several psychological tests. The application obtained the consent from its users to record their data and use it for the research purposes. All the posts have been sanitized manually by their curators: each proper name of person (except for famous one, such as "Chopin" and "Mozart") has been replaced with a fixed string. Famous locations (such as "New York City" and "'Mexico") have not been removed, either. Almost all posts are written in English, with an average length of 80 characters (the minimum and maximum length are, respectively 2 and 435 characters). Since the recruitment has been carried out on Facebook, the dataset suffers from the typical sample bias due to the Facebook environment (some groups of people might be under-or over-represented). However, the same problem applies to other datasets as well [9,11,28]. Table 1 Guidelines and examples for the annotations   Category Guidelines Examples Sensitive A post is "sensitive" if the text is understandable, i.e., written in clear English, and the annotator is certain that it contains information that violates a person's privacy, not necessarily of the author of the post. A text violates a person's privacy if contains the following types of information (non-exhaustive list): • current or upcoming moves; • information on events in the private sphere; • information on health or mental status; • information about one's habits; • information that can help geolocalize the author of the post or other people mentioned; • information on the sentimental status; • considerations that may hint at the political orientation or religious belief of a mentioned person.
In general, given the subjectivity of the topic, a post can be sensitive if the person reading it feels discomfort due to the private content it contains (and not to other moral considerations).
"...heading to the gym with *PROPNAME*, *PROPNAME* and my sista!!" "is feeling uninspired and unmotivated. Can someone else please pay her bills and move her into her new apartment?" "is very sore and very tired... " "Just wanted to thank everyone for all the support (and great tips) yesterday, it meant a lot! made it through yesterday without smoking at all...and still going strong! :)" "Lazy day around the house after the family has left. " "ARGH. 2 whole years! Congratulations, *PROPNAME*! You've tolerated me for a total of 730 days! Plus 'getting to know you' time... hahaha!" "is shaking his head wondering when some of his conservative christian friends became so hate filled that they will join any anti-obama group on facebook. " Non sensitive A post is "non-sensitive" if the text is understandable, i.e., written in clear English, and the annotator is sure that it does not contain information that violates privacy, according to the indications of the "sensitive" category.
"Fabulous weekend :-)" "When we are no longer able to change a situation -we are challenged to change ourselves. Viktor E. Frankl" "loves summer evenings" Unknown A post is of "unknown sensitivity" if the text is understandable, i.e., written in clear English, but the annotator is unable to tell if it contains information that is sensitive for privacy, because (non-exhaustive motivations): • the context is not sufficient to understand the sensitivity of the message; • the post is incomplete, i.e., the text does not contain the whole post, and from the available portion one is unable to understand its sensitivity; • the post contains a reference to a media (an image, a link, a GIF) which is considered essential for understanding the message, if the text alone is not sufficient to understand its sensitivity.
"black" "Goodbye *PROPNAME*. :(" "I know 6 sick people at the moment, and now I'm... " "Check out what I've got written for The Book of *PROPNAME*. [link]" Unintelligible A post can be marked as "unintelligible" when: • it is written with slang/abbreviations or a grammar that does not render it understandable from a lexical point of view; • the post is written in a language other than English.
"hooked on PBS" "fml" "wahhhh,. di na ko. hurot na jud ako kwarta aning AI. huhuhu" "Pas de mauvaise nouvelle pour l'instant! Je presume donc que c'est une bonne chose!" All 9917 posts have been proposed to a pool of 12 volunteers (7 males and 5 females, aged from 24 to 41 years, mainly postgraduate/Ph.D. students and researchers), so as to have exactly three annotations per each post. Hence, we have formed four groups, each consisting of three annotators; every group has been assigned from 2479 to 2485 posts. For each post, the volunteers had to say whether they think that the post was sensitive, non-sensitive, or of unknown sensitivity. The choices also include a fourth option, unintelligible, used, for instance, to tag posts written in a language other than English. For each category, the annotators were given precise guidelines and examples (see Table 1). According to our guidelines, a post is "sensitive" if the text is understandable and the annotator is certain that it contains information that violates a person's privacy (not necessarily of the author of the post), because it contains, for instance: information about current or upcoming moves, on events in the private sphere, on health or mental status; information about one's habits or that can help geolocalize the author of the post or other people mentioned; information on the sentimental status; considerations that may hint at the political orientation or religious belief.
At the end of the period allowed for the annotation, all volunteers have accomplished their assigned task and we have computed some statistics regarding their agreement. In details, for each group, we have computed the Fleiss' κ statistics [42], which measure the reliability of agreement between a fixed number of annotators. The results (reported in Table 2) show fair to moderate agreement in all groups, also considering that the number of possible categories is four. This result also demonstrate that the task of deciding whether a post is sensitive or not is not straightforward, as shown by the percentage of identical annotation in each groups: overall, at least 93.91% of posts have at least two identical annotations, but the percentage drops down to 42.97% if we look for the perfect agreement (three unanimous annotators). Apparently, there are differences among the four groups, but they are smoothed by only considering posts with at least two "sensitive" or "nonsensitive" tags, as we will precise later.
In Table 3 we report the details of the annotations. Each column reports the number of posts that received exactly one, two or three annotations for each class. From this table it emerges how the majority (7923) of posts have been annotated at least once as nonsensitive, while the number of posts that have received at least one "sensitive" annotation are much less (5826). In addition, the number of posts with unknown sensitivity drops drastically from 1529 to 7 when the number of annotations considered increases from one to three. This means that for almost all posts (except unintelligible ones) at least one annotator was able to determine its sensitivity.
Starting from all the annotations, we generate two datasets. The first one contains all those posts that received at least two "sensitive" or "non-sensitive" annotations and we call it SENS2. The second, called SENS3 contains all those posts that received exactly three "sensitive" or "non sensitive" annotations. By operating this choice, we exclude automat-   ically all posts that have been annotated as "unknown" or "unintelligible" by at least two annotators. Notice that the portion of sensitive posts is almost the same in both samples. The details of these two datasets are reported in Table 4. The average length of the posts (in terms of number of words) is relatively small (15 words, on average), a typical characteristic of social media text contents, but there is a high variability (some posts are more than 85 words long). For comparison reasons, we also use two additional datasets. The first consists of top posts extracted from two subreddits in Reddit [11]: 1 "r/CasualConversations", a subcommunity where people are encouraged to share what's on their mind about any topic; "r/OffmyChest", a mutually supportive community where deeply emotional things are shared. By design, all posts mention one of the following terms: boyfriend, girlfriend, husband, wife, gf, bf. The annotators were required to annotate each post according to the amount of emotional and informational disclosure it contains. Here, we consider all posts that do not disclose anything as "non sensitive"; all remaining posts are tagged as "sensitive", in accordance with the choices made for annotating our dataset. We consider all the 12,860 labeled training data samples and the 5000 labeled test data samples. Overall, 10,793 posts are labeled as "sensitive", and 7067 as "non sensitive". All the details are given in Table 4. The reader is referred to [11] for further details about this dataset.
The second dataset is an anonymity-based corpus following the example of [17], where sensitive posts are constituted of anonymous posts shared on Whisper 2 (a popular social media platform allowing its users to post and share photo and video messages anonymously), while non-sensitive posts are taken from Twitter. Here, we generate ten samples, each consisting of a subset of 3336 sensitive posts selected randomly from a large collection of sanitized Whisper posts [28], 3 and a subset of 5429 non-sensitive posts randomly picked from a large collection of tweets [43]. The numbers of sensitive and non-sensitive posts have been chosen to mimic the distribution observed in dataset SENS2. We filter out posts containing retweets or placeholders, and that are shorter than 9 characters or not written in English (according to the fastText model [44]). Then, from each remaining post, we remove any mention and hashtag, in order to obtain samples of posts similar to the ones in SENS2 and SENS3. The ten samples are needed to limit any sampling bias.

Understanding sensitivity
In this section, we analyze our data in detail with the aim of characterizing sensitive and non-sensitive posts from a linguistic point of view. The goal of this analysis is to understand whether lexical features may help distinguish sensitive and non-sensitive content.

Analysis of the words
As first analysis, we extract the most relevant terms from each class of posts in all datasets considered in our study. To this purpose, all terms are first stemmed. Then, we compute the total number of their occurrences and their relative frequency for all classes as the number of occurrences of each word in each class (sensitive and non-sensitive) divided by its total number of occurrences. To avoid any bias, the number of occurrences and the relative frequency are computed on 10 random samples consisting of 500 sensitive and 500     [17], i.e., the fact that using different sources to gather anonymous and non-anonymous posts introduces a bias also in terms of discussion topics. Moreover, Table 7 shows the intrinsic bias of dataset OMC: the most prominent words for the sensitive class are related to friendship and personal feelings and wishes (e.g., friend, feel, would).

Analysis of the lexical features
Similarly as in [17], we categorize all words contained in each post into different dictionaries provided by LIWC [45]. LIWC is a hierarchical linguistic lexicon that classifies words into meaningful psychological categories: for each post, LIWC counts the percentage of words that belong to each psychological category. In addition, we also account for another, more specific, lexical resource, i.e., the Privacy Dictionary [26,27]. It consists of dictionary categories derived using prototype theory according to traditional theoretical approaches to privacy. The categories, together with some example of words, are presented in Table 9. Given 10 random samples consisting of 500 sensitive and 500 non-sensitive posts, we calculate the average percentage of sensitive and non-sensitive posts that contains words belonging to each dictionary as well as the sensitive to non-sensitive ratio for each dictionary. For the psychological categories, we only list the dictionaries whose ratio exceeds 1.3 (thus, it is over-represented in sensitive posts) or is below 0.7 (i.e., it is under-represented in sensitive posts) in each dataset. The results are shown in Table 10 (categories with high sensitive to non sensitive ratio are presented in bold), while the ratios for privacy-related categories are all reported in Table 9. It is worth noting that the number of relevant dictionaries in Table 10 differs significantly from one dataset to another: it is minimum in SENS2 and maximum in WH+TW. Interestingly, some categories are relevant in all datasets (e.g.,  some personal pronouns, family, friends and female), while other ones are specific to individual corpora (anxiety and feelings appear only in OMC and WH+TW, money only in SENS2 and SENS3). Overall, lexical features seems to help discriminate better OMC and WH+TW datasets rather than ours, and this observation is even more evident for the Privacy Dictionary (Table 9). In our data, with the exception of categories Law and Intimacy, almost all privacy categories are less represented in sensitive posts than in non-sensitive ones (ratios are less than one). Instead, almost all privacy categories are over-represented in sensitive posts belonging to WH+TW. In OMC, ratios are in general closer to one. These results confirm that relying on the anonymity of sources may introduce too much lexical bias, while considering sensitivity directly show less distinguishing lexical properties. This consideration is confirmed by a further experiment conducted to verify whether lexical features can help discriminate sensitive posts against non-sensitive ones. To this purpose, we set up a simple binary classification task, using a logistic regression (LR) classifier, a support vector machine (SVM) classifier with linear kernel, and a Random Forests (RF) classifier with default parameters. Each dataset is randomly divided into training (75%), validation (15%) and test (10%) sets: the same sets will be employed in each experiment presented in this paper. Here, the training set is used for training the model, and the test set for performance evaluation. We train and test the classifiers on different feature sets: the one including all dictionaries, the one including only psychological dictionaries, and the one consisting only of privacy categories. Each post is then represented by a vector whose values are the percentage of words in the post belonging to each dictionary. Values are standardized to have zero mean and unit variance. According to the results presented in Table 11, WH+TW seems to take greater advantage of lexical features w.r.t. all other datasets (in particular, OMC and the equally-sized SENS2). Another important observation concerns the impact of privacy categories on classification. Apparently, some classification results are penalized by these features and, when the classifier is trained on privacy categories only, the performances drop drastically to those of the majority classifier. One explanation is that such a dictionary is built upon technical documents and is not intended as a general-purpose lexical resource, although some categories also applies to our data (e.g., Intimacy). This is also confirmed by the fact that this feature space is very sparse (non-zeros are around 2% in all datasets). Nevertheless, in this analysis we have

In-depth analysis of dictionary-based classification results
To better understand the behavior of the classifiers, we analyze in detail the performance on the different classes (the sensible and the non-sensible ones), in terms of F1-score and for each dataset, considering the best performing classifiers according to the macroaveraged F1-score (see Table 11). The results are reported in Table 12. As expected, the majority class (the non-sensible one for every dataset except OMC) is the one for which the classifiers are the most accurate. However, from the classification point of view, WH+TW is the easiest dataset to analyze, as the two classes are better identified than in any other dataset, while on SENS2 and OMC the best classifiers achieve similar performances, slightly better than the majority voting classifier for the most frequent class. For such datasets, using dictionaries does not provide a reliable way to differentiate the two classes. Finally, we inspect the logistic regression classifier to identify the most relevant features for the sensitive class in each dataset. In Table 13 we report the top-20 relevant features together with the corresponding coefficients (the logarithms of the odds ratios). The results seem to confirm the conclusions reached with the previous experiments (feature names with capital initials are from the Privacy Dictionary [26,27]), but as further analysis, we compute the Spearman's rank correlation coefficient (referred to as ρ in the following) among the different feature coefficient vectors in order to investigate the similarities   among the different models. The results of this analysis show that, not surprisingly, the two most similar logistic regression models are those computed on SENS2 and SENS3 (ρ = 0.757). However, more interestingly, the model computed on WH+TW is more similar to the one computed on OMC (ρ = 0.25118) than to those computed on SENS2 and SENS3 (ρ = 0.1165 and ρ = -0.0007). This shows that the types of sensitiveness captured by OMC and WH+TW have something in common: this is probably due to the fact that the content of sensitive posts for both datasets is mostly related to family and intimate relationships. Finally, it is worth noting that the coefficients computed on OMC are more correlated with those computed on SENS3 (ρ = 0.3277) than with those returned for SENS2 (ρ = 0.1461). This can be explained by the fact that the annotators' agreement on SENS3 is the highest one: as a consequence, only highly sensitive posts (such as the ones tagged as sensitive in OMC, by construction) are marked as such. However, as already declared, we are interested in a more general concept of content sensitivity which does not rely on the most personal and intimate aspects of the human's life only.

Classifying posts according to their sensitivity
In this section, we provide the details of the experiments conducted within different classification scenarios, where the learning algorithms are applied directly on (embeddings of ) text data. Our goal is to measure the possible gain of applying recent state-of-the-art text classification techniques that consider text as sequences, over the usage of features extracted from dictionaries. In particular, we compare several different convolutional and recurrent neural networks architectures, a transformer-based neural network technique and, in addition, we also consider some baselines consisting in applying standard classifiers on bag-of-words representations of the datasets, similarly as in our previous work [19]. More in detail, we apply four different classifiers for each dataset: a one-dimensional Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN) with gated recurrent unit (GRU) nodes, a RNN with long short-term memory (LSTM) nodes, and BERT [46], a pre-training transformer-based network designed for learning language representation models. The CNN models have an embedding layer, followed by one or two one-dimensional convolutional layers (all with kernel size 8, Rectified Linear Unit as activation function, batch normalization and global average pooling), one or two dense layers, and one dense layer of 2 nodes with softmax activation function. The exact number of nodes per level of each model is reported in Table 14. The RNN models consist of one embedding layer, followed by one or two recurrent layers, one or two dense layers, and, finally, one dense layer of 2 nodes with softmax activation function. The number of layers and nodes of each model is reported in Table 15. The embedding layer projects each word of the input text into a word vector space: we use two different word embeddings pre-trained on Twitter data using GloVe [47]. 4 Each recurrent layer is bidirectional, and each layer has a dropout equals to 0.5. Instead, for each dataset, BERT is trained with a learning rate equal to 5 · 10 -5 and early stopping on the accuracy of the validation set, with patience equals to 5. Finally, the bag-of-words (BoW) models consists of standard classifiers trained on tfidf features extracted from text data after applying stemming and removing stopwords. We use the same classifiers as in Sect. 4.2, i.e., a logistic regression (LR) classifier, a support vector machine (SVM) classifier with linear kernel, and a random  In our experiments, we use Python implementations of the algorithms of Keras, scikitlearn, 5 and ktrain [48] libraries. All experiments are executed on a server with 32 Intel Xeon Skylake cores running at 2.1 GHz, 256 GB RAM, and one NVIDIA Tesla T4 GPU.
The results of the classification on the test sets are reported on Table 16. The results for WH+TW are averaged on the ten samples. We also compute the percentage gain of CNNs, RNNs and BERT w.r.t. the best bag-of-words classifier for each dataset. From this results, it emerges that the datasets that take the greatest advantage on the usage of recurrent neural networks and language models are SENS2, SENS3 and OMC (the gain is between 10.26% and 14.71%), while the maximum improvement for WH+TW is 8.80%. It is worth noting that the performance of BERT on OMC are similar to those achieved by using the dictionary-based features (see Sect. 4.2) and significantly lower than those achieved by the same model on SENS2 and SENS3. One possible explanation for this phenomenon is that the posts in this dataset deal with a limited number of very specific topics by construction. We recall, in fact, that its posts have been extracted from some targeted subreddits mentioning few very specific terms (see Sect. 3.2). As a consequence, a language representation model like BERT does not help improve classification results to a great extent. SENS3, instead, also has the highest F1-score using BERT (0.89), but it is worth recalling that this dataset has less than half the posts of all other datasets. Instead, the high performances achieved by BERT on WH+TW can be also explained by the fact that sensitive and non-sensitive posts are derived from two different microblogging platforms. Although this point is out of the scope of our work, the choice of a particular social media platform (especially when it promotes anonymous contents) may have an impact on both the lexicon and the language style adopted by the users. Finally, CNNs are less effective than RNNs and BERT. In WH+TW, they perform similarly as or even worse than any bag-of-words models. More detailed classification results for BERT are given in Table 17.
To measure the generalization strength of the classification models, we conduct the following additional experiment. We train the classification models on the training set of   To prevent any bias, when using SENS3 (resp. SENS2) as test set, instances that are also present in the training set of SENS2 (resp. SENS3) are removed. In Table 18 we report the macroaveraged F1-scores computed on the test sets reported in the columns using the training sets reported in each row. We only show the results for SVM trained on the bag-of-word representation and BERT. Interestingly, when BERT is trained on SENS2, its performances are good when tested on SENS3 too. Nonetheless, this is not that surprising, because SENS3 is a subset of SENS2 with less uncertainty on the class labels provided by the annotators (we recall that, in SENS3, the annotators' agreement is maximum). However, the most interesting results are the ones obtained by the classifier trained on SENS2 and tested on WH+TW, and viceversa. In this cases, the training and test sets are from completely different sources, and BERT trained on WH+TW has even worse performances than the bag-of-words model when tested on SENS2. Instead, BERT trained on SENS2 achieves noticeably higher performances. It is worth noting that the difference in performances is the highest among all pairs of diverse datasets: in fact, the F1-score is 13% higher for BERT trained on SENS2 and tested on WH+TW than for the opposite configuration. The performances of OMC on WH+TW with BERT are sensibly worse than those achieved by SENS2, although its performances on SENS2 and SENS3 are higher than those obtained by our datasets on the entire OMC dataset. This could be the consequence of the better representation provided by the training set of OMC, in particular for the sensitive class. In fact, the value of the F1-score for the sensitive class is 0.56 when the instances of SENS2 are predicted with BERT trained on the training set of OMC, while, for the opposite configuration, the F1-score is 0.39. For the pair of datasets composed by SENS3 and OMC, the same scores are, respectively, 0.57 and 0.36. It is worth noting that BERT trained on WH+TW achieves sensibly higher performances when tested on OMC rather than on SENS2 or SENS3. This confirms that the type of sensitivity captured by OMC and WH+TW are similar. For further analysis, we also conduct the same experiment with dictionary-based features (see Sect. 4.2), using the Random Forest classifier (DICT-RF in Table 18). The results show that the models trained on OMC and WH+TW do not perform well on our datasets (the F1-score are between 0.19 and 0.33). Instead, the same models achieve better performances on their reciprocal test sets (macro-averaged F1-scores are 0.52 and 0.54), confirming that those datasets address similar problems (i.e., a more specific concept of self-disclosure than ours).

Discussion of the results
In this section, we discuss more in detail the results of the experiments described in Sects. 4 and 5 and draw some generalized conclusions. In our paper, we have performed many different data analysis tasks with the aim of investigating whether state-of-the-art approaches to self-disclosure detection in texts and the related text corpora, which have made available to the public, are adapted to identify privacy-sensitive posts shared in general purpose social media. Our main target is the typical social media post, which, in principle, may deal with arbitrary topics, and is communicated to different kinds of audiences, both in terms of extension (the number of profiles that can read the post) and type (close friends, acquaintances, general public). So far, the problem has been addressed by assuming that sensitive posts are published anonymously [15][16][17], or by considering a less general problem called self-disclosure [11].
In the experiments, not only have we shown the limitations of both approaches, but we have also pointed out the drawbacks of existing text corpora that might be used to train classification models capable of determining whether a given text is sensitive or not. Such corpora, in fact, are extracted from microblogging or forum platforms under very specific sections (e.g., dealing with family life or intimate relationships). As a result, they are not able to capture sensitive contents with wider topic coverage. Furthermore, we have created a new text corpus, consisting of around ten thousand Facebook posts, each annotated by three experts. In our corpus, sensitivity has a broader definition than self-disclosure and we think that this better captures the actual privacy-sensitive content that can be found in general-purpose social media. More than that, we do not make any anonymity assumption, in line with recent studies on the privacy paradox [12] and privacy fatigue [13] that show that many users tend to underestimate or simply overlook their privacy risk when posting on social media platforms.
All our experiments confirm that tackling the problem of content sensitivity by leveraging anonymity solves a less general problem than ours. By addressing sensitivity directly, we show that dictionary-based or bag-of-words based approaches are not that effective. Sequential models as Recurrent Neural Networks and language models, instead, lead to more accurate analysis and predictions and, more interestingly, introduce a significant performance gain on text annotated according to criteria that are not mediated by the lens of anonymity. Interestingly, OMC, a dataset that is specifically annotated according to self-disclosure [11], does not take advantage of RNNs or BERT to such a great extent: the results of these deep learning algorithms are comparable with those obtained by Random Forests trained on lexical features. The general mild performances of all types of classifier on this dataset could be explained by the overrepresentation of the sensitive class (corresponding to posts containing some form of self-disclosure). Unfortunately, this is by design, also because the dataset has been published with a different objective (i.e., the study of affect in response to interactive content). More interestingly, the posts extracted according to the anonymity criterion (WH+TW ) and those extracted following the classic definition of self-disclosure (OMC) share some common properties, as testified by the cross-classification results (Table 18) and the mild correlation of the relevant feature for the logistic regression classifier (Table 13). This is probably the result of the particular choice of sources for the posts composing the sensitive class of those corpora (a subreddit on family relationships for OMC and Whisper for WH+TW ). Finally, our experiments have shown that, for our datasets, only RNNs and BERT provide a significant performance boost. This phenomenon can be explained by the fact that, in general purpose social media, the context of a word/sentence (well captured by transformer-based language models) is more adapted to model the sensitivity of a post than simple lexical features. It is worth noting that BERT achieves good performances on WH+TW too. However, in this case, its performances could be biased by the fact that sensitive and non-sensitive posts are extracted from two different social media and, consequently, the network is not learning how to detect the sensitivity of a post, but, rather, the source of it. Although deserving further investigations, we leave this point for future research work.
Despite the results obtained and their analysis largely confirm our hypotheses, the extent of our work is in part limited by the fact that we have not controlled data acquisition, but, instead, rely on a corpus of Facebook posts collected ten years ago for different research purposes (i.e., predicting some psychological traits of users according to their behavior on the well known social network). Currently, it is not possible to collect such data, as Facebook has been limiting the amount of information that can be obtained by using its API since 2015. Nevertheless, it is the only available dataset composed of the so-called profile status updates. Other available Facebook posts are crawled from public pages, so they could hardly fit our objectives. Moreover, although we think that our work could foster further research on related topics, its impact is mitigated by the rapid changes in the social media world. Currently, the most popular social platforms (e.g., Instagram, TikTok) are designed for sharing multimedia content such as pictures and short videos. Although many results on text content presented in this paper (and in other similar research works) can be adapted or transferred to multimedia posts, new efforts should be undertaken to detect sensitive contents in pictures and videos accurately.

Conclusion
With the final goal of supporting privacy awareness and risk assessment, we have introduced a new way to address the problem of sensitivity analysis of user-generated content without explicitly considering the so-called anonymity assumption. We have shown that the "lens of anonymity" could indeed distort the actual sensitivity of text posts. Consequently, differently from state-of-the-art approaches, we have measured the sensitivity directly, and we have collected reliable sensitivity annotations for an existing corpus of around ten thousand social media posts. In our experiments, we have shown that our problem is more challenging than anonymity-driven ones, as lexical features are not sufficient for discriminating between sensitive and non-sensitive contents. Moreover, we have also investigated how the problem of self-disclosure is related to content sensitivity analysis, and show that existing text corpora are not adequate to analyze the sensitivity of posts shared in general purpose social media platforms. Instead, recent sequential deep neural network models may help achieve good accuracy results. Our work could represent a new gold standard in content sensitivity analysis and could be used, for instance, in privacy risk assessment procedures involving user-generated content. 6 On the other hand, our analysis has also pointed out that predicting content sensitivity by simply classifying text can not capture the manifold of privacy sensitivity with high accuracy. So, more complex and heterogenous models should be considered. Probably, an accurate sensitivity content analysis tool should consider lexical, semantic as well as grammatical features. Topics are certainly important, but sentence construction and lexical choices are also fundamental. Therefore, reliable solutions would consist of a combination of computational linguistic techniques, machine learning algorithms and semantic analysis. Finally, the success of picture and video sharing platforms (such as Instagram and TikTok), implies that any successful sensitivity content analysis tool should be able to cope with audiovisual contents and, in general, with multimodal/multimedia objects (an open problem in sentiment analysis as well [49]).