2.1 Ego network construction
We started our data collection by producing a set of ego users whose information will constitute the ground truth to evaluate predictions. To generate an initial unbiased random sample of users, we applied the Random Digit Search method [24, 25]: We generated random Twitter user ids in the range between 1 and 30 Billion, looked them up through the Twitter REST API,Footnote 2 and saved the basic user information of the valid sampled users. To avoid celebrities and spammers, we filtered out users with a ratio of followers to friends below 0.1 or above 10, as well as users with less than 50 friends or followers. To have a homogeneous sample for biographical data analysis, we included only users that have English as the language of the Twitter account. This process generated a set of 1,017 ego users, which are the starting point of a larger dataset including their social contacts and their activity in Twitter.
We collect the timeline of tweets of each ego user up to 3,200 tweets.Footnote 3 Based on those timelines, we identify alter users as the ones that have been mentioned at least four times by an ego user, following this way a set of friendship links that capture communication and not just followership or retweeting [26]. We use these links as an approximation to the underlying social network between Twitter users that is revealed when users share their contact lists through mobile phone apps or through importing tools. This way we generate a set of 68,447 alter users, collecting also their timeline of tweets and biographical information. As a result, we count with a total of 157,408,012 tweets in our dataset from both ego and alter users.
2.2 User analysis
We identify the location of users by combining geographical data of their tweets as well as their self-disclosed location and biographical text. Our dataset contains more than 5.6 Million geotagged tweets that contain a geographical location reference with precise coordinates. We process those coordinates with the Google Maps Geocoding APIFootnote 4 to identify the municipality in which they are located, which we refer to as their city. For each user with at least one geotagged tweet, we label their location as the most frequent city where their tweets have been geotagged. For users without geotagged tweets, we process the location and biographical text of users through Google’s Text Analysis API.Footnote 5 As a result, we located 630 ego users and 38,936 alters, taking this data as an approximation to the better location information that Twitter has access to. The top panel of Figure 1 shows the locations of users in the dataset, illustrating that users come from a wide variety of countries but are generally located in countries where Twitter adoption is high [27, 28]. The lower panel of Figure 1 shows the ego network using only preceding alters as explained below, with nodes colored according to the user country. A clear assortativity pattern can be observed, which is the foundation of the unsupervised predictors explained below.
We processed the biographical text provided by each user in the dataset by removing stop words that appear in the NLTK stopword listFootnote 6 and stemming its tokens with Porter’s stemming algorithm [29]. For our analysis, we consider only those users which have 3 or more tokens in their biographical text after this step (49,576 alters and 676 ego users). Over these texts, we applied a 100-dimensional Doc2Vec [30] model trained on a separate corpus of 1.7 Million biographical texts generated in previous research [31]. Doc2Vec fits a language model that represents documents (in this case bios) as vectors such that semantically and linguistically similar documents are close in the representation space. We chose a dimensionality of 100 to follow previous applications of similar models [32]. To further reduce the dimensionality of the dataset, we applied Principal Component Analysis (PCA) and took the two most informative components as a quantification of the content in biographical texts. More details about the PCA are presented in the Additional file 1. As a result, we count with a 2-dimensional biographical vector for each user such that semantically similar texts will have similar orientations of their vector representations and dissimilar texts will be pointing in very different directions.
The Twitter API provides a source field for each tweet that identifies the way the tweet was produced. Among these sources we can find mobile phone applications for Android and iPhone, allowing us to identify wich users installed one of this applications and shared their contact lists with Twitter.Footnote 7Footnote 8 We mark as disclosing alters all the alters that produced at least one tweet with the source “Twitter for iPhone” or “Twitter for Android”. This way we identify 54,658 disclosing users (934 ego users and 53,724 alters), which amount to more than 78% of the users in our dataset.
2.3 The shadow profile problem
We adapt the problem formulation of shadow profiles for Facebook [22] and Friendster [10] to the case of Twitter. The left panel of Figure 2 shows a schema of the ego-centered data we use: we count with the connections between ego and alter users and the location and biographical vectors for the users that shared that data on Twitter. The right panel of Figure 2 shows the problem of constructing a shadow profile for the ego user, in which only a historical subset of the data is used to evaluate if the information provided by users (alters) was predictive of non-users in the past (ego users).
In the shadow profile problem, all alters that joined Twitter later are excluded in the prediction of ego data, since the information of these alters was not available to Twitter before the ego created an account. Out of the alters that joined Twitter before the ego user, only a subset of them were disclosing alters, i.e. they shared their contact lists with Twitter, for example through the mobile phone app. The shadow profile problem in Twitter is to generate predictions of the location and biographical vectors of ego users based only on the information of their disclosing alters who joined Twitter before. Therefore, we go over the history of Twitter and evaluate predictions only based on data that was available before the ego user joined.
We study the conditions that drive the quality of shadow profiles in two analysis scenarios. First, we perform an empirical shadow profile analysis by applying the predictors explained below over the set of disclosing alters found through the source of their tweets. We evaluate the predictions against the ground truth of ego users and compare against a Null Model to test the shadow profile hypothesis. Furthermore, we analyze the relationship between the quality of predictions and the number of disclosing alters of each ego user with the purpose to evaluate if profiles are more accurate for users with more friends who already joined Twitter.
Second, we perform a disclosure tendency analysis to study how the tendency of users to share their contact lists can affect the quality of shadow profiles. In this analysis scenario, instead of using the set of disclosing alters, we randomly sample subsets of all alters that joined Twitter before the ego user. We define the disclosure parameter ρ as the probability that an alter shares its contact lists with Twitter, to analyze how the quality of shadow profiles depends on disclosure tendencies. For each value of ρ between 0.1 and 0.9 in increments of 0.1, we generate 1000 samples and generate predictions based on that subset of the data. In addition, we record the number of alters sampled this way to evaluate if any relationship between prediction quality and number of friends also appears in this sampling scenario.
2.4 Unsupervised predictors and evaluation
We apply two unsupervised predictors for location and biographical vectors to evaluate the shadow profile hypothesis on Twitter. To predict the location of ego users, we take the locations of all disclosing alters and identify the most frequent city among alters, i.e. the modal predictor. We use this location as the unsupervised prediction of location to be compared against the ground truth of the location of the ego user. We evaluate the quality of the prediction by measuring the Haversine distance in Km between the predicted point and the ground truth. We predict the biographical vector of each alter as the average vector of its disclosing alters. We evaluate this prediction through the cosine similarity of predicted and ground truth vectors. Therefore, a high similarity will mean a high accuracy of the predictor.
We compare both predictors against a Null Model that takes a uniformly random sample of all users to construct a prediction. For each prediction of the model, we generate 100 Null Model predictions by sampling the same number of users from the whole dataset. By comparing the Null Model with the shadow profile predictions we ensure that our results are not an artefact of limited data samples or uneven distribution of locations and biographical data.