In this section we will describe our approach to modelling commuter flow with Twitter data. We will also describe the census data which we use as a ground truth for evaluating the accuracy of the Twitter model, and the radiation model which we use as a benchmark.
Twitter data is based on messages, known as ‘tweets’, that people who are members of the social networking site send when making use of the service. Only a small percentage of these tweets are ‘geotagged’, by which we mean that they come with metainformation containing the location from which the tweet was sent [17] (geotagging often occurs when people send tweets from their mobile phone [18]). Although historic studies of geotagged tweets have made use of exact co-ordinate data, currently the majority of geotagged data produced by the social media network is relatively coarse, accurate only to the city or municipality level.Footnote 1
Geolocated tweets indicate, of course, where a user currently is, rather than anything about any journey they may make. However, it seems reasonable to assume that, whilst making use of the social network, many people may tweet from both their home and work locations. Hence a pattern of geotagged tweets, over a period of time, ought to contain information about patterns of commuting. Of course, there will be a certain amount of noise in the data as people will tweet from other locations: on their way to/from work, from restaurants, on holiday, etc. Furthermore, not all Twitter users will have a job, nor will all jobs require regular commuting. One of the central questions in this article is to observe the extent to which commuting patterns can be inferred in spite of this noise.
Going from geolocated tweets to commuting patterns requires us to choose a heuristic for deciding which location is a ‘home’ location and which location is a ‘work’ location for each user, based on a pattern of geolocated tweets which may come from a variety of areas. There is a growing literature on the best way of inferring these locations from a pattern of digital trace data such as tweets or mobile phone calls [19–22]. We make use of arguably the simplest of these: a frequency count (although in Section 4 we experiment with ways of improving on this simple method by making use of temporal information, it is helpful to first observe the amount of signal that can be extracted from the simplest possible heuristics applied to the data). Hence, the area that a user most frequently tweets from is assumed to be their home location; the second most frequent is assumed to be their work location (users which have a tie for home and work location are discarded). All other locations are assumed to be areas visited which are unrelated to either living or working. To account for users who live and work in the same area, we use a threshold λ: if more than λ% of tweets are sent from the same area, we assume the user both lives and works in that area. In our results section, we experiment with different values of λ to test the sensitivity of the model to this threshold.
Having assigned users to home and work locations, we can construct a commuter flow matrix T:
$$\begin{aligned} \mathbf {T}_{ij} = tw_{ij}, \end{aligned}$$
(1)
where \(tw_{ij}\) is the number of users which have their home in location i and work in location j.
We take as our ground truth dataset commuting data from the 2011 UK census.Footnote 2 The census gives data on commuting volumes within and between ‘local authorities’, which are administrative regions within the UK (of which there are 378 in totalFootnote 3). By comparing estimates from our Twitter model to actual numbers derived from the census, we can assess the accuracy of commuting predictions derived from Twitter. The census also provides information on commuting volumes across different demographic groups, which allow us to assess demographic variation in the accuracy of our results.
We also want to observe how the accuracy of estimates generated by the Twitter model compares to the accuracy of existing methods of commuting flow estimation. There are, currently, two main estimation methods which are used within the literature: the widely used gravity model (e.g. [14, 23, 24]), and the more recent radiation model [25]. In this paper, we opt to use the radiation model as a benchmark, as it has been shown to outperform the gravity model. It is also well-suited to our purposes since it is parameter-free and only requires basic information about each area. Hence it offers a reasonable comparison to our context, where the aim is to infer commuting patterns with only a minimal amount of observational data.
The standard radiation model estimates the commuter flow matrix T using:
$$\begin{aligned} \mathbf {T}_{ij} &= c_{i}\operatorname{Prob}(\mathrm{work}=j|\mathrm{home}=i) \\ &=c_{i}\frac{n_{i}n_{j}}{(n_{i}+s_{ij})(n_{i} + n_{j} + s_{ij})}, \end{aligned}$$
where:
-
\(\mathbf {T}_{ij}\) is the ijth entry of T, the number of commuters who live in area i and work in area j;
-
\(c_{i}\) is the total number of outward commuters who live in area i;
-
\(n_{i}\) is the population in area i and \(n_{j}\) is the population in area j;
-
\(s_{ij}\) is the population within a circle centered at area i and with a radius equal to the distance between areas i and j (the populations of areas i and j are not included).
The model assumes that the number of outward or external commuters from an area is proportional to its population. Hence, \(c_{i}=C(n_{i}/N)\) where C is the total number of outward commuters in the population and N is the total population (given by \(N=\sum_{i}n_{i}\)). If C is unknown, the model can estimate \(\operatorname{Prob}(\mathrm{work}=j|\mathrm{home}=i)\), but not absolute commuter numbers. It is worth noting that the radiation model does not offer a prediction for internal commuting, i.e. the number of people who live and work in the same area, a point to which we will return below.
Yang et al. [26] have introduced a 1-parameter variant of the radiation model, which has been shown to outperform the parameter-free version. We hence also include this model in our analysis. The estimated flow for the 1-parameter model is given by:
$$\begin{aligned} \mathbf {T}_{ij} = c_{i}\frac{ [{(a_{ij} + n_{j})}^{\alpha}- a_{ij}^{\alpha}](n_{i}^{\alpha}+ 1)}{(a_{ij}^{\alpha}+1) [{(a_{ij} + n_{j})}^{\alpha}+1 ]}, \end{aligned}$$
(2)
where \(a_{ij} = n_{i} + s_{ij}\). Yang et al. construct the parameter α so that it varies with the average size of regions for which commuting is being estimated. In particular, they calculate α using:
$$ \alpha= { \biggl(\frac{l}{36~[\text{km}]} \biggr)}^{1.33}, $$
(3)
where l is the mean ‘length’ of regions under consideration, defined as the square root of their area (for our particular case of UK local authorities, l is equal to 19 km).
We will now give details of the Twitter data collected for the study. Our data covers a one year period from June 1, 2015 to May 31, 2016. This time period is not ideal, of course, as we are comparing patterns of Twitter data to the census, which took place in 2011. Nevertheless, it is the best data available for addressing the question. Using the filter stream of the Twitter API with public, ‘spritzer’ level access, we collected all geotagged tweets from within a bounding box around the British Isles.Footnote 4 It is worth remarking that the choice of a full year period is significant. We expect (and indeed we found) that user engagement with the platform is bursty [27], meaning that a large time window is required to build up a consistent pattern of tweets for one user. However, using a year long period means that we are capturing certain types of bias in our data: for example, we may pick up occasional long distance movements, such as students moving between their homes and places of study (as found by [22]), which shorter time spans might avoid.
We logged rate limiting messages from the Twitter Streaming API and found that few tweets were omitted per day due to rate limiting. We experienced no rate limiting at all for 176 days and only slight rate limiting on other days (median 4.5 tweets lost on days with rate limiting messages). Power interruptions and network connectivity resulted in additional data loss, but there is no indication of any systematic bias from these interruptions. During our time window, 1,980,600 individual users sent a geolocated tweet which fell within our bounding box.
The distribution of user activity is, as might be expected, heavy tailed, with a majority of users relatively inactive. We expect that users who tweet more frequently will give a more accurate signal about their home and work locations, hence we decided to impose a number of filters on the dataset to only include users of the platform who had a relatively high level of engagement. We make use of three filters in particular. First, we discarded users who had less than 5 tweets, as a kind of minimum threshold for extracting any kind of signal from the pattern of user engagement. Second, we discarded users who did not have either two tweets in two separate local authorities, or greater than the λ threshold of tweets within one local authority (meaning it could be assigned as both a home and work location): again, this was done to set a minimum threshold for extracting signal from a pattern of user engagement. Finally, we discarded users whose first and last tweets in their detected locations were less than 30 days apart (to try and eliminate, for example, people who only sent a short burst of tweets from a holiday destination). Application of these filters resulted in a large number of low-intensity users being discarded. After these steps the exact size of the dataset varied with the λ threshold, from just over 560,000 for λ = 0.70 to just over 380,000 for λ = 0.95. We discuss the potential impact of this filtering more in our results section below.
In order to assign living and working locations to users, we first assigned each geolocated tweet to a local authority area. This assignment was achieved in one of two ways. When exact coordinates were included with the tweet, assignment was simple, as any pair of co-ordinates will fall within only one local authority. If co-ordinates are not included, what Twitter includes instead is a bounding box around a given place or region of geographic interest (for example, a city, a county or even a country); Twitter also includes information about the type of bounding box.Footnote 5 If the type of place is defined as a ‘city’ we use the centroid of the bounding-box as our point for geolocation, on the assumption that the majority of cities do not cross local authority boundaries (it is worth noting that a ‘city’ in this context also refers to an area of London). In total, we were able to assign 87% of tweets to a local authority, or 122 million tweets in total. Geolocated tweets which could not be assigned are those where the area of geolocation was too high to meaningfully assign to a local authority (for example, tweets can be geolocated to ‘United Kingdom’ or ‘East England’).
Of course, we do expect this geographical assignment process to contain some error within it. Users may assign any place name to a tweet: they are not required to assign the ‘correct’ name. Furthermore, some bounding boxes may cross local authority boundaries, making the centroid an unreliable means of distinguishing location. Nevertheless, we expect the process to be broadly accurate. This is something supported by an observed strong correlation (\(r=0.78\)) between geolocated tweets and the population of each local authority (a finding which also offers further confirmation for the results from [11]).