Finding disease outbreak locations from human mobility data

Finding the origin location of an infectious disease outbreak quickly is crucial in mitigating its further dissemination. Current methods to identify outbreak locations early on rely on interviewing affected individuals and correlating their movements, which is a manual, time-consuming, and error-prone process. Other methods such as contact tracing, genomic sequencing or theoretical models of epidemic spread offer help, but they are not applicable at the onset of an outbreak as they require highly processed information or established transmission chains. Digital data sources such as mobile phones offer new ways to find outbreak sources in an automated way. Here, we propose a novel method to determine outbreak origins from geolocated movement data of individuals affected by the outbreak. Our algorithm scans movement trajectories for shared locations and identifies the outbreak origin as the most dominant among them. We test the method using various empirical and synthetic datasets, and demonstrate that it is able to single out the true outbreak location with high accuracy, requiring only data of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$N=4$\end{document}N=4 individuals. The method can be applied to scenarios with multiple outbreak locations, and is even able to estimate the number of outbreak sources if unknown, while being robust to noise. Our method is the first to offer a reliable, accurate out-of-the-box approach to identify outbreak locations in the initial phase of an outbreak. It can be easily and quickly applied in a crisis situation, improving on previous manual approaches. The method is not only applicable in the context of disease outbreaks, but can be used to find shared locations in movement data in other contexts as well. Supplementary Information The online version contains supplementary material available at 10.1140/epjds/s13688-021-00306-6.


Outbreak detection method
Here we provide a detailed description of the method used to detect outbreak locations from human mobility data. The method is given as a pseudo-algorithm (see algorithm 1) and is available as a implementation in Python at https: //github.com/franksh/outbreak-detection, which was used to conduct the measurements in the manuscript.
As input, the method uses a set X = {X i } of mobility trajectories X i of i = 1, ..., N individuals. Each mobility trajectory is X i a set of observations of locations x j recorded at times t j = t 0 , ..., T max at intervals of ∆t = t j+1 − t j . In practice, the method can also be applied if the intervals ∆t are irregular, or if the recording times t j vary between individuals, as long as the core inference (i.e. determining the maximum of the score function) is run for all times when the position of any individual in the dataset changes. For ease of notation, we assume regular intervals ∆t and the same recording times t j for all individuals here. Besides the mobility trajectories X , the method uses the spatial variance σ as an input parameter (see section 2 for more details on the choice of σ).
The core loop of the outbreak detection method is the following (see algorithm 1): For each time t = t 0 , ..., T max , we determine the spatial maximum x * of the score function S(x, σ, X ), where X is the set of the locations of all individuals at time t. This location = (x * , t, S * ) is then added to the list of candidate locations L.
Determining the spatial maximum of S(x, σ, X ) can be done by performing a grid search within the spatial bounds of the dataset. In practice, we find that the global spatial maximum can reliably be found by starting a gradient descent at each of the locations X of the individuals at time t, although we provide no proof that this guarantees finding the global maximum.
The output of the algorithm is a list of outbreak location candidates L, which can be sorted by their score S . The proposed outbreak location is then the location L * with the highest score S * . A ranked list of the highest-scoring locations can also be reviewed manually, with the actual outbreak likely among them.
Before determining the estimated outbreak origin L * we perform a final intermediary data-processing step, namely a spatial clustering of all locations in the set L.
This clustering is useful as, in general, many locations are found multiple times, for example at adjacent time points. We use the DBSCAN algorithm with the parameter σ = 100m as the maximum distance between two observations for them to be considered part of the same location.

Algorithm 1: Outbreak detection algorithm
} is a set of observations of locations x j recorded at the times t j = t 0 , ..., Tmax at intervals ∆t = t j+1 − t j .
Input: The mobility trajectories X and the spatial variance σ.
Output: A set of likely outbreak locations L = {(x j , t j , S j )} including the spatial location x j , time t j , and score S j of the location.
Procedure Find-outbreak-locations(X , σ): while t < Tmax do // Get current locations of individuals X ← {x j ∈ X |t j = t}; // Determine location with maximum score x * = maxx(S(x, σ, X ));  Figure 1 Influence of the standard deviation σ on the detection accuracy. We use the scenario of a single outbreak origin with a sample size of N = 5 individuals, for which we expect close to 100% accuracy (see results in main manuscript). We find little dependence of the accuracy on the value of σ. Results are shown for the dEPR dataset for 1, 000 measurements at each value.

Influence of σ on the detection accuracy
The standard deviation σ is the only free parameter in the score function It is the standard deviation of the spatial probability kernel that is placed at each individuals' position (see also eq. 2 in the main text).
We find that the value of the σ has little impact on the accuracy of the outbreak detection method, see Figure 1. We test the method for a range of values of σ, ranging from 100m to 100km, which we translate from cartesian to radial coordinates using the formula σ = σ radians = σ cartesian × r earth with the earth radius r earth = 6371.0 km.
The limited influence of σ on the method is not surprising as it does not influence the location of the maxima in the score function S (or the objective function F ), but mainly influences the efficiency of the numerical optimization.

CNS dataset
The CNS dataset is a subset of the data collected during the Copenhagen Network Study in 2014. In the study, a group of 1000 students were provided smartphones and social sensors. The data they generated was collected, including calls and text messages, social media usage and face-to-face contacts. The location was measured using the GPS measurement of the smartphone sensor or the cell tower location, as available. The accuracy collected position can vary between a few meters for GPS locations, to hundreds of meters for cell tower location, where 90% of the samples have a reported accuracy better than 40 meters. A detailed description of the data and study methodology is given in [1].
The CNS dataset contains the GPS trajectories of 689 individuals, recorded during the month of February 2014. In total, 1, 472, 296 data points are included, with a maximum temporal resolution of 15 minutes. The data density varies per individual and over time. On average, 2, 136 data points are recorded per user, corresponding to 76.3 data points per day. This means that for 79% of all 15-minute-segments, location data was recorded for the average user. If a time-segment is missing data, we use the value from the previous segment (thus assuming the user has not moved).
We perform no additional processing to the data.

GEOLIFE dataset
The GEOLIFE dataset contains GPS trajectories collected in the Geolife project conducted by Microsoft Research Asia. The GPS signal was recorded using different GPS loggers and GPS-enabled phones. Participants were mostly students and faculty attending Beijing University. The study and its methodology is described in detail in [2,3,4].
The original dataset contains data of 178 users, collected over a period of four years The supersampling ensures that the dataset GEOLIFE has high-enough density to reliably find common locations among individuals and generate outbreak scenarios (the scenario generation is described in the main manuscript). However, we in our measurements we limit the maximum sample size to N = 5 individuals to avoid systematic biases. Increasing the sample size beyond N = 5 leaves very few locations where this many individuals have gathered at any point in the dataset. We find that this limited set of locations, which are mostly living areas or lecture halls at Beijing university, introduces systematic biases in the data. For example, the error in determining the outbreak time ∆t increases starting from N = 4, in contrast to all other datasets where the error decreases monotonously with sample size N (see Fig. 2 in main text). The increase in ∆t is most likely due to the increasingly limited set of locations, which emphasizes locations that are visited regularly by a similar group of peoples, such as lecture halls. 4 Synthetic datatsets

Exploration and preferential return model (dEPR dataset)
The dEPR dataset is created from simulations of the density-EPR (d-EPR) model.
The d-EPR model is a variation of the EPR model (exploration and preferential return) which was first proposed in [5] and subsequently expanded to include spatial density in [6]. We use the default parameters as described in the original literature whenever possible. We will briefly recapitulate the model here and then describe the simulation algorithm used to generate the dataset.

Model description
In the d-EPR model, the mobility of an individual i follows two processes: One can either return to a previously visited location, or explore an new location. The probability for either process depends on the number of previously visited distinct locations S(t) up to time t. The probability to explore a new location is and the complementary probability to return to a previously visited location The parameter 0 < ρ < 1 signifies the likelihood to explore at each step, and the parameter γ ≥ 0 determines the rate at which new locations are explored: Say an individual has performed n steps up to time t. As S increases by 1 each time a new location is explored, and thus dS/dn = P new , it follows that the number of visited location grows as S ∼ n 1/(1+γ) .
When exploring a new location, the new location j = i is chosen among a set of locations G with associated densities or frequencies g j (which is the "density" extension of the original EPR model). Here, we use the locations obtained from Twitter data in the Berlin area (see section 5.1), where the frequency g j is the number of Tweets counted at the location. Then, the probability to choose location j is given as where r ij is the geographic distance between i and j and G = i,j =i p ij is a normalization constant.
When returning to a previously visited location, a location s is chosen randomly, proportional to the number of previous visitations f s to each location. This preferential return mechanism results in a distribution of visitation frequencies that follows a Zipf's law, where locations are sorted by rank k, which is often observed in empirical data [7].

Simulation algorithm
We generate the dataset dEPR by generating mobility trajectories {x i (t)} for a total of N = 10, 000 individuals for a period of T max = 30 days in the following manner (the following description follows SI note 14 in [6]): Let G be the set of location obtained from Twitter data with associated frequencies g i (see above). Starting at time t = 0, we place the individual at a location i chosen from the probabilities p i = 1 P g i with normalization P = i g i . The number of locations visited is S = 1. Then, we follow these steps: 1 We draw the waiting time ∆t, that is the time until the next mobility event takes place, from a power-law distribution with an exponential cutoff, P (∆t) ∼ ∆t −1−β e (∆t/τ ) . We use the parameters β = 0.8 and τ = 17 hours as measured in the original EPR-study [5]. The random numbers for the waiting time are generated using the process described in algorithm 5 in [7]. We update the time to t → t + ∆t. 2 We calculate the probability P new = ρS −γ that an individual explores a new location, where we use ρ = 0.6 and γ = 0.21 as in [5]. If exploration is chosen, continue at step 3, otherwise at step 4.

Model description
The datasets sOD and dOD are generated with agent based simulation based on origin-destination (OD) mobility data. The mobility data we use is extracted from is the probability that a trip that started in spatial cell i at time t ended in cell j.
In our model, we assume that individuals transition between spatial cells according to the probabilities P ij . The datasets sOD and dOD differ in how the exact latitudelongitude location within the spatial cells is chosen: For the dataset sOD (spatial-OD), the location is chosen randomly within the bounds of the spatial cell. For the dataset dOD (density-OD), the location is chosen from the locations extracted from Twitter that fall within the spatial cell, proportional to the frequency of Tweets at the location.

Simulation algorithm
The datasets sOD and dOD each consist of mobility trajectories of N = 10, 000 individuals over a period of T max = 30 days that are generated in the following way: Starting at time t = 0, we place the individual in a starting cell i, which is chosen randomly proportional to the total flux F i that originates from each cell over all observation times t k , that is the starting cell i is chosen according the probabilities Within the starting cell i, the individual is placed Then, we follow the following steps: 1 We draw the waiting time ∆t, that is the time until the next mobility event takes place, from a power-law distribution with an exponential cutoff, same as detailed for the dEPR model described above. We update the time to t → t+∆t.
2 We draw the target spatial cell j from the transition probabilities P ij (t).
3 Within the spatial cell j, we choose the exact spatial coordinates x m as detailed for the starting cell above. 4 We add the location x m at the time t to the movement trajectory {x i (t)} and return to step 1, unless t ≥ T max .

Dataset
The statistics of the sOD dOD datasets are very similar to the dEPR dataset, as large parts of the methodology are the same. Both datasets contain roughly 2.25 million datapoints divided among 10, 000 individuals, corresponding to 7.49 datapoints per individual per day.

Berlin twitter dataset
The dataset tweets-berlin comprises locations that were extracted from geolocated posts on Twitter in the area of Berlin. The dataset is used in the dEPR and dOD models, where one requires a spatial density distribution of locations, that is a set of locations G with associated visitation frequencies g i . We obtain these locations G from geo-located posts from Twitter, which we will describe in the following. The resulting dataset tweets-berlin is made available in the OSF repository.  We collected posts published on Twitter during January to December 2018, using the Twitter API v1 [8] and the python package tweepy [1] . Among all posts, we only collected those posts that contained a geo-tag, that is a geo-location as measured by the device (mostly via GPS). Around 0.85% of all posts include such a geo-tag [9]. Furthermore, we only collect posts in the area of Berlin, which we define as geo- From the collected posts, we extract a set of locations G by clustering the spatial coordinates of the posts. Clustering is a common method used to identify unique locations from geolocated data [10]. Posts that come from the same locations are likely to have small variations in the longitude-latitude coordinates, despite representing the same location. By clustering posts in close spatial vicinity, one can identify unique locations and correctly count the number of posts there. We use the density-based DBSCAN algorithm for the spatial clustering, which is a common choice for this task [11]. The algorithm has one main parameter, which is the maximum distance between two posts at which they can still be considered part of the same location, which we set as 100m, which is reasonable in our context and the default choice in the scikit-mobility Python package [10]. Finally, we define the spatial coordinates of each location identified by the clustering as the average longitude and latitude coordinates of all posts associated with the location.
The resulting dataset tweets-berlin contains 63, 607 posts by 7, 493 unique users. Among these posts, we identify 7, 436 unique locations by clustering, which we use as the set of locations G in the mobility models. The location frequency g i is the number of associated posts at that location. Figure 3 shows the spatial distribution and some characteristic statistics of the dataset.

OD dataset
In the mobility models we employ and origin-destination matrix F, which we obtained from call-data records of a mobile phone provider for the area of Berlin (see an illustration in Fig. 4). The dataset ODdata is collected from the 46 million German customers of the provider. Included are all trips that either started or ended within the district of Berlin. A trip is recorded when a device leaves its current cell tower area A, passes through one or multiple other towers, until it becomes stationary again in cell tower area B. This is counted as one trip from cell A to B. It is assumed that a device is "stationary" if no movement is recorded for approximately 15 minutes. The start-and end-areas A and B can be the same, meaning that self-loops are recorded as well.
The trips recorded on the level of cell towers are subsequently aggregated into spatial cells i, which range in size from 500m × 500m in the city center to 5km × 5km in more rural areas, depending on cell tower density (see Fig. 4). Our dataset encompasses 780 spatial cells, which are either cells within the district of Berlin, or cells where trips from or to such Berlin-cells started or ended, respectively.
Trips are also temporally aggregated, in hourly bins. Specifically, the data we use was collected during the Q4 of 2018, and averaged to one "average week": For As a result, we obtain the origin-destination matrix F, where each entry F ij (t) denotes the number of trips from spatial cells i to j at time t, where t ranges in hourly increments from Monday 0 am to Sunday 12 pm. In the simulations of the mobility models, we employ periodic boundary conditions to simulate trajectories longer than one week, i.e. F(t + 7days) := F(t). The matrix F includes a total of 6, 945, 418 trips, or an average number of 41, 341 trips starting in each hourly bin.
The average number of trips between (or within) the 780 spatial cells in each hourly bin is 234.

Data availability
The synthetic datasets dEPR, sOD and dOD are available in the OpenScienceFramework (OSF) repository https://osf.io/3rzh8/. These each include the simulated mobility trajectories of 10, 000 individuals over 30 days at a resolution of 15 minutes. The two auxilliary datasets tweets-berlin and ODdata, which are used for the mobility simulations, are also made available in the OSF repository.
The GEOLIFE dataset collected by Microsoft Research is available online and linked in the original study [2,3,4]).
The CNS dataset was provided to the authors for the purpose of this study by the original authors of the Copenhagen Network Study; please refer to them for requests for access to the data [1].