Analysis and classification of privacy-sensitive content in social media posts

Bioglio, Livio; Pensa, Ruggero G.

doi:10.1140/epjds/s13688-022-00324-y

Regular article
Open access
Published: 03 March 2022

Analysis and classification of privacy-sensitive content in social media posts

EPJ Data Science volume 11, Article number: 12 (2022) Cite this article

6270 Accesses
4 Citations
6 Altmetric
Metrics details

Abstract

User-generated contents often contain private information, even when they are shared publicly on social media and on the web in general. Although many filtering and natural language approaches for automatically detecting obscenities or hate speech have been proposed, determining whether a shared post contains sensitive information is still an open issue. The problem has been addressed by assuming, for instance, that sensitive contents are published anonymously, on anonymous social media platforms or with more restrictive privacy settings, but these assumptions are far from being realistic, since the authors of posts often underestimate or overlook their actual exposure to privacy risks. Hence, in this paper, we address the problem of content sensitivity analysis directly, by presenting and characterizing a new annotated corpus with around ten thousand posts, each one annotated as sensitive or non-sensitive by a pool of experts. We characterize our data with respect to the closely-related problem of self-disclosure, pointing out the main differences between the two tasks. We also present the results of several deep neural network models that outperform previous naive attempts of classifying social media posts according to their sensitivity, and show that state-of-the-art approaches based on anonymity and lexical analysis do not work in realistic application scenarios.

1 Introduction

The Web is pervaded with user-generated contents as Internet users have multiple and increasing ways to express themselves. They can post reviews of products, businesses, services and experiences; they can share their thoughts, pictures and videos through different social media platforms; they reply to surveys, forums and newsgroups and some of them have their own blogs and web pages. Many companies are encouraging this behavior, because user-generated content has more attractive power on other users than professional contents, and this increases their engagement on web platforms. However, texts, photos and videos posted by users may harm their own and other’s privacy, thus exposing themselves (and other users) to many risks, from discrimination or cyberbullying to frauds and identity theft. Although user-generated content is often subject to moderation, also adopting automated recognition techniques such as inappropriate content [1], hate speech [2] and cyberbullying [3] detection, there is no control on the sensitivity of posted contents. It is worth noting that social media and forums are not the only platforms that store and publish private contents. Surveys, or contact/helpdesk forms are other examples where the users are free to enter any type of text and other contents, together with other more structured personal information. Often, such data need to be transferred to third parties to be analyzed, and the lack of control on free-text fields could put the privacy of respondents at risk. A common quick solution consists in totally removing all such fields or sanitizing them automatically or at hand. However, existing automatic sanitization approaches [4–6] try to replace sensitive terms belonging to specific domains (e.g., medical or criminal records) with more general ones, and rely on existing knowledge bases and natural language processing techniques such as named entity recognition and linking. In some cases, sanitization techniques destroy the informativeness (and sometimes the meaning itself) of the text.

Self-disclosure, i.e., the act of revealing personal information to others [7], is a social phenomenon that has also been extensively studied in relation with online forums [8], online support groups [9] and social media [10]. Although self-disclosure is also closely related to content sensitivity, it has often been investigated in the context of intrinsically sensitive topics, such as in forums related to health issues, intimate relationships, sex life, or forum sections explicitly devoted to people searching for support from strangers [11]. In these settings, the identity of the users is often masked by pseudonyms or entirely anonymous. Instead, general purpose social media platforms usually encourage the usage of the real identity, although this does not prevent their users from disclosing very private information [12–14]. Moreover, the sensitivity of social media texts is harder to detect, because the context of a post play a fundamental role as well. Finally, social media posts are sometimes very short; yet, they may disclose a lot of private information.

To better understand the problem, let us observe the post in Fig. 1: it does not mention any sensitive term or topic, but discloses information about the author and his friend Alice Green, and contains hidden spatiotemporal references that are immediately clear from the context (the author is about to leave for a journey, which implies that he will be far from home for a month, disclosing a potentially sensitive information). On the other hand, there may exist posts that contain very sensitive terms, but are not sensitive at all, when contextualized correctly. An example is given by the post in Fig. 2, where several sensitive terms (struggling, suffering, COVID-19) and topics (health, economic crisis) are mentioned, but no private information is disclosed about any specific person. In these cases, the automatic assessment of text sensitivity could save a lot of rich information and help automate the sanitization process. Furthermore, an automatic warning system able to detect the true potential sensitiveness of a post, may help a user decide whether to share it or not.

Indeed, the problem of assessing and characterizing the sensitivity of content posted in general purpose social media has been already studied, but, due to the unavailability of specifically annotated text corpora, it has been tackled through the lens of anonymity, by assuming that sensitive contents are posted anonymously [15, 16], on anonymous platforms [17], or with more restrictive privacy settings [18], while non sensitive ones are posted by identifiable users and/or made available to everyone. However, as we pointed out in [19], anonymity and sensitivity are not straightforwardly related to each other. The decision of posting anonymously could be determined uniquely by the sensitivity of the topic, but not by the sensitivity of the posted content itself. Analogously, many non anonymous social media posts contain very private information, just because their sensitivity [12] or their visibility [14] are underestimated by the content authors. These considerations make what we call the “anonymity assumption” too simplistic, or even unrealistic in practice. Other existing annotated corpora concern posts extracted from Reddit [11] and support groups for cancer patients [8, 9]. Unfortunately, these corpora focus on very specific (and intrinsically sensitive) topics or give a very restrictive interpretation of self-disclosure: in [11], for instance, only posts disclosing personal information or feelings about the authors are annotated as sensitive. Moreover, it has a strong focus on mutually supportive communities and intimate relationships. To cope with this problems, very recently, we have introduced a more general task called content sensitivity analysis as a machine learning task aimed at assigning a sensitivity score to content [19]. However, in that preliminary work, we model the problem as a simple bag-of-words classification task on a very small text dataset (less than 700 social media posts) with mild accuracy results (just above the majority classifier).

In this paper, we address all the limitations of previous works by analyzing a new large corpus of nearly 10,000 text posts, all annotated as sensitive or non sensitive by humans, without assuming any implicit and forced link between anonymity and privacy. We provide an in-depth analysis of sensitive and non sensitive posts, and introduce several sequential deep neural network models that outperform bag-of-words classifiers. We also show that models trained according to the anonymity assumption do not work properly in realistic scenarios. Moreover, we also study how the problem of self-disclosure is related to ours and show that existing text corpora are not adequate to analyze the sensitivity of posts shared in general purpose social media platforms. At the best of our knowledge, this is the first work addressing the problem of directly and efficiently evaluating the real sensitivity of short text posts. It has then the potential to represent a new gold standard in content sensitivity analysis and self-disclosure, and could open new research opportunities for improving the users’ awareness on privacy and performing privacy risk assessment analysis or sanitization on data containing free text fields.

Our paper is organized as follows. In Sect. 2, we review some closely related work and discuss their limitations. We define formally our concept of privacy-sensitive content, describe how we have constructed our annotated corpus, and present the datasets used in our analysis in Sect. 3. Section 4 contains an in-depth analysis of the lexical features characterizing sensitive content in the different datasets, while, in Sect. 5, we report the results on multiple classification tasks conducted under different settings. In Sect. 6 we discuss more in detail the results of the experiments and draw some generalized conclusions. Finally, Sect. 7 concludes by also presenting some future research perspectives.

2 Related work

With the success of online social networks and content sharing platforms, understanding and measuring the exposure of user privacy in the Web has become crucial [20, 21]. Thus, many different metrics and methods have been proposed with the goal of assessing the risk of privacy leakage in posting activities [22, 23]. Most research efforts, however, focus on measuring the overall exposure of users according to their privacy settings [24, 25] or position within the network [14]. Instead, the problem of characterizing and detecting the sensitivity of user-generated content, has been subject of very few studies in the last decade. One of the first work in this direction has tried to address this problem using a lexicographic approach [26, 27]. Similarly to sentiment analysis or emotion detection, in fact, linguistic resources may help identify sensitive content in texts. In their work, Vasalou et al. leverage prototype theory and traditional theoretical approaches to construct and evaluate a dictionary intended for content analysis. Using an existing content analysis tool applied on several text corpora, they evaluate dictionary terms according to privacy-related categories. Interestingly, the same authors note that there is no consistent and uniform theory of privacy-sensitivity.

To bypass this problem, several authors adopt a simplification: they assume that the sensitivity of contents is strictly related to the choice of posting them anonymously. This also makes the construction of annotated corpora easier, because one just needs to consider contents posted anonymously as sensitive, while posts shared with identifiable information can be considered as non sensitive. Hence, for instance, Peddinti et al. adopt this strategy for analyzing anonymous and non anonymous posts in a famous question-and-answer website [15]. They analyze different basic machine learning models to predict whether a particular answer will be written anonymously. Similarly, Correa et al. define sensitivity of a social media post as the extent to which users think the post should be anonymous [17]. They compare content posted on anonymous and non-anonymous social media sites both in terms of topics and from the linguistic point of view, and conclude that sensitivity is often subjective and it may be perceived differently according to several aspects. Very recently, the same authors have published a sanitized version of nearly 90 million posts downloaded from Whisper, an anonymous social media platforms [28]. Biega et al. conduct a similar study, but restrict the analysis to sensitive topics with the aim of measuring the privacy risks of the users [29]. It is worth noticing that all these studies conclude that sensitivity is subjective.

Content sensitivity has been associated to privacy settings as well: similarly to anonymity, contents posted with restricted visibility are deemed sensitive. Yu et al. analyze sensitive pictures by learning the object-privacy correlation according to privacy settings to identify categories of privacy-sensitive objects using a deep multi-task learning architecture [18]. They also use their model to customize privacy settings automatically and to sanitize images by blurring sensitive objects.

Text sanitization is another close research field whose goal is to find and hide personally identifiable information and simultaneously preserve text utility. To this purpose, Jiang et al. present an information theoretic approach to hide sensitive terms by more general but semantically related terms to protect sensitive information [30]. Similarly, Sanchez et al. propose several information theoretic approaches that detect and hide sensitive textual information while preserving its meaning by exploiting knowledge bases [4, 31, 32]. Iwendi et al., instead, focus on unstructured medical datasets and propose a framework to completely anonymize the textual clinical records exploiting regular expressions, dictionaries and named entity recognition. Their methods is aimed at sanitizing the detected protected health information with its available generalization, according to a well-known medical ontology [5]. Finally, Hassan et al. use word embeddings to evaluate the disclosure caused by the textual terms on the entity to be protected according to the similarity between their vector representations [6]. All the above mentioned methods rely on the identification of named entities or quasi-identifying terms, and try to replace them with semantically close, although more general, terms. Hence, they all leverage some kind of knowledge bases or ontologies, and work well on some specific domains (e.g., on medical documents, criminal records and so on). Instead, we address a more general notion of sensitivity, that also includes texts that may reveal sensitive or simply private user’s habits, feelings or characteristics.

A closely related concept is the so-called self-disclosure, defined as the act of revealing personal information to others [7]. Self-disclosure has been widely studied well before the advent of modern social media, in particular for its implications in online support groups, online discussion boards and forums. For instance, Barak et al. study, among the others, the reciprocity of self-disclosure in online support groups and discussion forums showing that there are substantial differences in how people behave in these two different media types [8]. Yang et al., instead, analyze the differences in the degree of positive and negative self-disclosure in public and private channels of online cancer support groups [9]. They show that people tend to self-disclose more in public channels than in private ones. Moreover, negative self-disclosure is also present more in public online support channels than in private chats or emails. To achieve these results, the authors study lexical, linguistic, topic-related and word-vector features of a relatively small annotated corpus using support vector machines. Ma et al. conduct a questionnaire-based mixed-factorial survey experiment to answer several questions concerning the relationships that regulate anonymity, intimacy and self-disclosure in social media [10]. They show, for instance, that intimacy always regulates self-disclosure, while anonymity tends to increase the level of self-disclosure and decrease its regulation, in particular for content of negative valence. Differently from the previous works, Jaidka et al. directly address the problem of self-disclosure detection in texts posted in online forums, by reporting the results of a challenge concerning a relatively large annotated corpus made up of top posts collected from Reddit [11]. Contrarily to [28], in this corpus, all posts are directly annotated according to their degree of informational and emotional self-disclosure. The authors also intend to investigate the emotional and informational supportiveness of posts and to model the interplay between these two variables. Unfortunately, this corpus is not entirely adapted to our purpose (i.e., detecting the sensitivity of text content in general purpose social media platforms) mainly for four different reasons: first, the focus is on self-disclosure, although a post may reveal sensitive information about other people as well; second, posts on Reddit are published using pseudonyms, while general purpose social media foster the usage of real identities; third, a large part of the posts has been extracted from a subreddit explicitly devoted to people searching for other users’ support; last but not least, all posts concern intimate relationships by design.

In conclusion, in our work, we do not make any “anonymity” or “privacy settings” assumption, since it has been shown that users tend to underestimate or simply overlook their privacy risk [12–14]. Consequently, we analyze and characterize sensitive posts directly. In a very preliminary version of our work, we tried to give a more generic definition of sensitivity [19]. However, our model was trained on very few posts and by using simple bag-of-words classifiers, thus achieving mild accuracy results. In this work, we construct a much larger and more reliable dataset of social media posts, directly annotated according to their sensitivity, and use more sophisticated and accurate models to help decide whether a post is sensitive or not. Additionally, we provide further lexical and semantic insights about sensitive and non sensitive texts.

3 An annotated corpus for content sensitivity

In this section, we introduce the data that we use in our study. We first provide a conceptualization of “content sensitivity” also in relation with existing similar concepts; then, we describe how we construct our annotated corpus and provide some characterization of it.

3.1 Privacy-sensitive content

Content sensitivity is strictly related to the concept of self-disclosure [7], a communication process by which one person reveals any kind of personal information about themself (e.g., feelings, goals, fears, likes, dislikes) to another. It has been described within the social penetration theory as one of the main factors enabling relationship development [33, 34]. Due to the peculiarities of online communication (and its differences w.r.t. face to face communication), the social and psychological implications of self-disclosure in the Internet have been extensively studied as well [35]. For its implications on user privacy, self-disclosure has also been investigated in relation with privacy awareness, policies and control [36], and some rule-based detection techniques for self-disclosure in forums have been proposed [37], leading to some relatively large annotated corpora [11].

In this paper, we refer to content sensitivity as a more general concept than self-disclosure. In [19] we gave a preliminary, subjective and user-centric definition of privacy sensitive content. In that work, we stated that a generic user-generated content object is privacy-sensitive if it makes the majority of users feel uncomfortable in writing or reading it because it may reveal some aspects of their own or others’ private life to unintended people. This definition is motivated by the fact that each social media platform has its own peculiarities and the amount and quality of social ties also play a fundamental role in regulating self-disclosure [10]. However, it has many drawbacks, since it relies on the subjective perception of users and on a notion of uncomfortableness that can also be driven by other external factors. This also conditioned the preliminary annotation of a corpus, leading to poor detection results. Consequently, in this paper, we adopt a more objective definition of privacy-sensitive content.

Definition 1

(Privacy-sensitive content)

A generic user-generated content is privacy-sensitive if it discloses, explicitly or implicitly, any kind of personal information about its author or other identifiable persons.

Differently from the concept of self-disclosure, our definition explicitly mention the disclosure of information concerning persons other than the author of the content. Furthermore, it also clearly includes contents that implicitly reveal personal information of any kind. For instance, the sentence “There’s nothing worse than recovering from COVID-19”, is a neutral sentence, apparently. However, it is very likely that the person who expresses this claim has also personally experienced the effects of SARS-CoV-2 infection.

3.2 Datasets

Most previous attempts of sensitivity analysis on text contents assume that sensitive posts are shared anonymously, while non sensitive posts are associated to real social profiles. Other available corpora do not explicitly require that distinction, but have been collected in very specific domains (e.g., health support groups [9]) or focus on limited types of self-disclosure (e.g., intimate/family relationships [11]). Hence, we will consider a new generic dataset with explicit “sensitive/non-sensitive” annotations. To this purpose, we first need a corpus constituted of mixed sensitive and non-sensitive posts. Twitter is not the most suitable source for that, because most public tweets are of limited interest to our analysis, while tweets with restricted access can not be downloaded. Moreover, it is well known that users are significantly more likely to provide a more “honest” self-representation on Facebook [38, 39]. Consequently, Facebook posts are more adapted to our purposes, but contents posted on personal profiles can not be downloaded, while public posts and comments published in pages do not fit the bill as they are, in general, non sensitive. Furthermore, they would require a huge sanitization effort in order to make them available to the research community. Fortunately, one of the datasets described in [40], and released publicly, has all the required characteristics. It is a sample of 9917 Facebook posts (status updates) collected for research purposes in 2009-2012 within the myPersonality project [41], by means of a Facebook application that implemented several psychological tests. The application obtained the consent from its users to record their data and use it for the research purposes. All the posts have been sanitized manually by their curators: each proper name of person (except for famous one, such as “Chopin” and “Mozart”) has been replaced with a fixed string. Famous locations (such as “New York City” and ```Mexico”) have not been removed, either. Almost all posts are written in English, with an average length of 80 characters (the minimum and maximum length are, respectively 2 and 435 characters). Since the recruitment has been carried out on Facebook, the dataset suffers from the typical sample bias due to the Facebook environment (some groups of people might be under- or over- represented). However, the same problem applies to other datasets as well [9, 11, 28].

All 9917 posts have been proposed to a pool of 12 volunteers (7 males and 5 females, aged from 24 to 41 years, mainly postgraduate/Ph.D. students and researchers), so as to have exactly three annotations per each post. Hence, we have formed four groups, each consisting of three annotators; every group has been assigned from 2479 to 2485 posts. For each post, the volunteers had to say whether they think that the post was sensitive, non-sensitive, or of unknown sensitivity. The choices also include a fourth option, unintelligible, used, for instance, to tag posts written in a language other than English. For each category, the annotators were given precise guidelines and examples (see Table 1). According to our guidelines, a post is “sensitive” if the text is understandable and the annotator is certain that it contains information that violates a person’s privacy (not necessarily of the author of the post), because it contains, for instance: information about current or upcoming moves, on events in the private sphere, on health or mental status; information about one’s habits or that can help geolocalize the author of the post or other people mentioned; information on the sentimental status; considerations that may hint at the political orientation or religious belief.

Table 1 Guidelines and examples for the annotations

Analysis and classification of privacy-sensitive content in social media posts

Abstract

1 Introduction

2 Related work

3 An annotated corpus for content sensitivity

3.1 Privacy-sensitive content

Definition 1

3.2 Datasets

4 Understanding sensitivity

4.1 Analysis of the words

4.2 Analysis of the lexical features

4.3 In-depth analysis of dictionary-based classification results

5 Classifying posts according to their sensitivity

6 Discussion of the results

7 Conclusion

Availability of data and materials

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords