An alternative approach to the limits of predictability in human mobility
 Edin Lind Ikanovic^{1} and
 Anders Mollgaard^{1}Email author
Received: 19 September 2016
Accepted: 4 June 2017
Published: 19 June 2017
Abstract
Next place prediction algorithms are invaluable tools, capable of increasing the efficiency of a wide variety of tasks, ranging from reducing the spreading of diseases to better resource management in areas such as urban planning. In this work we estimate upper and lower limits on the predictability of human mobility to help assess the performance of competing algorithms. We do this using GPS traces from 604 individuals participating in a multi year long experiment, The Copenhagen Networks study. Earlier works, focusing on the prediction of a participant’s whereabouts in the next time bin, have found very high upper limits (\({>}90\%\)). We show that these upper limits are highly dependent on the choice of a spatiotemporal scales and mostly reflect stationarity, i.e. the fact that people tend to not move during small changes in time. This leads us to propose an alternative approach, which aims to predict the next location, rather than the location in the next bin. Our approach is independent of the temporal scale and introduces a natural length scale. By removing the effects of stationarity we show that the predictability of the next location is significantly lower (71%) than the predictability of the location in the next bin.
Keywords
1 Introduction
The understanding of human mobility patterns has changed greatly in the last couple of decades. This has mainly been due to new technologies enabling human displacements to be studied with higher accuracy over a longer period of time. Starting with the tracking of bank notes [1] as a proxy for human movement, studies quickly evolved towards the current use of hand held devices for tracking, using either GSM data [2, 3], connections to wifi hotspots [4] or GPS receivers [5] to determine location. The main results from these studies have been the discoveries of power laws governing step size and wait time distributions [1], a universal probability density governing human mobility [6], and simple models capturing many statistical features of human mobility [5–8]. It has furthermore been explored how mobility is affected by recency [9], exploration [10], and return to previously visited places [6] and friends [11]. Such discoveries and models can help predict the spread of diseases [12] and cellphone viruses [13], and also enhance socioeconomic forecasting [14–16], city planning [17] and many other fields [5, 18, 19]. Further contribution to progress in these areas can be made if geolocation data can be used to accurately predict an individual’s future whereabouts. A crucial part of this work is the construction of viable evaluation mechanisms, thereby raising the question: what are the upper and lower limits, \(\Pi ^{\mathrm{max}}\) and \(\Pi^{\mathrm{min}}\), on the predictability of human mobility?
This question was initially investigated using call detail records from 45,000 cellphones [3]. Each call corresponded to a known location represented by a Voronoi cell, around the closest cell tower, with an average area of 3 km^{2}. The known locations were grouped into 1 hour bins, giving a history of locations \(T_{i}\), for each user i. The work focused on determining how well the best possible algorithm can predict the location of an individual in the next time bin, given \(T_{i}\). They reported an upper limit narrowly peaked at \(\Pi ^{\mathrm{max}} = 93 \%\) and a lower limit of \(\Pi^{\mathrm{min}} = 70 \%\).
This work led to questions being raised about possible biases introduced when using call detail records [20] and about the influence of spatiotemporal scales [21]. The temporal resolution [22, 23] and spatial resolution [4, 23, 24] were investigated with GSM and GPS data for smaller populations. Overall, it was found that the predictability increases with temporal resolution and decreases with spatial resolution. The limits of predictability, as defined in [3], therefore depend on the choice of temporal resolution Δt and spatial resolution Δs.
 −:

How long will an individual stay in his/her current location?
 −:

Where will he/she go next?
2 Data and methods
 −:

\(T_{i}^{\mathrm{bins}}\): Series of time bins.
 −:

\(T_{i}^{\mathrm{loc}}\): Series of locations.
Next we introduce the new mobility encoding \(T_{i}^{\mathrm{loc}}\), which aims to describe trajectories by a sequence of unique locations. Details can be found in the Methods section. We start by filtering all the GPS information such that travel between locations is removed. This leaves us with a set of stationary GPS points that are distributed around the preferred places of the individual. We then use a clustering algorithm (DBSCAN [26]) on the stationary data points to determine the different locations automatically. This approach results in locations, which better represent the places where individuals spend their time, than the more commonly used Voronoi or square grid cells.
We also examine the lower limit of predictability. For the location sequence \(T_{i}^{\mathrm{loc}}\), we use a first order Markov chain to predict the next location [28], i.e. we expect the location that most often follows the current location. If the current location has not been explored before, then we expect the most visited location as the next one. For the time bin sequence \(T_{i}^{\mathrm {bins}}\) we use a simple predictor, which expects the current location to continue into the next time bin. This predictor will be referred to as “the trivial predictor” and it measures the amount of stationarity in the mobility sequence.
3 Results
Next we fix the temporal scale \(\Delta t = 15\) min and vary the spatial scale Δs (Figure 4, right panel). Both the upper limit (squares) and lower limit (discs) increase when Δs is increased, again in agreement with (1). We note that the upper limit is not very sensitive to the spatial scales investigated here (\(\Delta s > 100\) m). We furthermore note the impressive performance of the trivial predictor at large spatial scales. For comparison we also compute the limits of predictability at the spatiotemporal scales considered in [3] (\(\Delta t = 60\) min and \(\Delta s = 1.7\) km). We find that the trivial predictor is successful in \(88.3 \pm3.8 \%\) of the cases, while the upper bound is \(95.5 \pm1.8 \%\), i.e. almost all of the predictability reflects the fact that people do not change location.
We note that another group has simultaneously been working on the same data set with the same methods and they have found \(\Pi ^{\mathrm{max}}=0.68\) [29]. Despite the close match in results they have actually been using very different DBSCAN paramters, namely \(\epsilon_{\mathrm{vicinity}} = 50\) (we use \(\epsilon _{\mathrm{vicinity}} = 5\)) and \(\mathrm{min}\_\mathrm{pts}=2\) (we use \(\mathrm{min}\_\mathrm{pts}=4\)), thereby further underlining the robustness of the results. Our main contribution relative to their work is to derive the length scale from the data, to directly state and investigate conjecture (1), and to relate the predictability of the next location to psychological factors.
Examining which factors impact the predictability of human mobility patterns. \(\pmb{r_{g}}\) is the radius of gyration, \(\pmb{\mathrm{eff}_{\mathrm{places}}}\) is the effective number of places an individual chooses from when changing to a new location and is defined as \(\pmb{2^{H_{\mathrm{unc}}}}\) . We also examine the impact of basic personality traits using the Big Five psychological profile [ 30 ]. Error bars are determined using the bootstrap method
Measure  Correlation with \(\boldsymbol {\Pi_{\mathrm{max}}}\) 

\(r_{g}\)  −0.05 ± 0.05 
\(\mathrm{eff}_{\mathrm{places}}\)  −0.26 ± 0.05 
\(\Pi_{\mathrm{min}}\)  0.49 ± 0.04 
Agreeableness  −0.05 ± 0.06 
Conscientiousness  0.04 ± 0.06 
Extroversion  −0.13 ± 0.05 
Neuroticism  0.06 ± 0.06 
Openess  −0.004 ± 0.059 
Finally, utilizing the psychological profiles of the participants, we are able to examine the impact of their psychological traits on their predictability. The only significant correlation we find here is with extroversion, meaning that the next location of an extroverted individual is statistically harder to predict.
4 Conclusion
Our results show that it is possible to extract a wide range of upper and lower limits of predictability of human mobility depending on the filtering and discretization scheme chosen. We have shown the strong dependency of “next bin” predictability on spatiotemporal scales. Furthermore, we have shown that the predictability at large spatial scales and small temporal scales mostly reflect stationarity, namely that people stay in the same spatial bin. This raises the need for an alternative approach to estimate the predictability of human mobility patterns.
The task of predicting human mobility is two fold: how long will a person stay in a certain location and where they will go next. Here we determined an upper limit on the predictability of the latter. We found that the upper limit of this task is much lower than the previously stated ones of \({\sim}93\%\). In particular, by using the natural length scale of human locations we found an upper limit on predictability of \(71.1 \pm4.7 \%\). A lower limit was likewise found using a first order Markov chain model with a success rate of \(39.8 \pm5.9 \%\). Overall, our results indicate that it might not be so trivial to predict human mobility after all.
5 Methods
Converting the raw data into \(T_{i}^{\mathrm{bins}}\). We start by employing an accuracy filter, which removes all the data points with an accuracy below 50 meter. The grid map used is characterized by two parameters: a length scale Δs and the origin of the map. The Technical University of Denmark, where most of the participants were enrolled, was chosen as the origin. This ensured that the grid cells had sides of approximately equal length Δs at the locations where most of the data was collected. The length scales used are \(\Delta s \in [ 100, 200, 400, 800, 1600 ]\) meters.
Small changes in the origin of the grid map can effect the number of locations detected [24]. To mitigate the possible bias introduced by having a fixed origin of the grid map, we add a random offset for each participant chosen randomly from a uniform distribution on \([0, \Delta s ]\).
Our data was not sampled at a fixed rate. A time binning with a fixed temporal resolution Δt allowed us to convert the raw data into a time series. The binning is done such that for each time bin we chose the most visited location. If two or more locations are the most visited locations, then we chose one of them at random. The time scales used are \(\Delta t \in [15, 30, 60, 120, 240 ]\) minutes. Time bins with no recorded locations are denoted using a special ? marker. Thus we end up with a time series \(T_{i}^{\mathrm{bins}}\) which depends primarily on Δs and Δt.
Converting the raw data into \(T_{i}^{\mathrm{loc}}\). Again we start by employing the accuracy filter. To reduce the number of data points associated with travel, we employ a second filter inspired by the pausebased model used in [5]. It detects all the data points which are \(15 \pm1.5\) min apart and for which the distance between the two measurements are less than 100 m. These two measurements are then averaged into a single data point representing a place where a participant stood still for roughly a quarter of an hour. This filters out most of the travel information in the dataset, except interruptions such as traffic jams and waiting for public transport.
The fraction of missing data, q, changes the entropy rate estimate. By artificially removing data in complete records we can study possible extrapolation methods. We have used a subset of 47 individuals with a complete location record spanning at least 2 weeks. For each of these complete records we determined \(H_{\mathrm{true}}\) using the estimator (6). Removing data from these complete records and comparing the entropy rate determined by our method, \(H_{\mathrm{est}}\), with \(H_{\mathrm{true}}\), we found that we could estimate \(H_{\mathrm{true}}\) within \({\pm}10 \% \) as long as \(q \leq0.5\). Our method is thus able to determine the entropy rate even when we only know half of the locations visited. Earlier this method has been used up to \(q \leq0.7\) [3], but our tests show reliable results only when \(q \leq0.5\).
Declarations
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Brockmann D, Hufnagel L, Geisel T (2006) Nature 439(7075):462 View ArticleGoogle Scholar
 Gonzalez MC, Hidalgo CA, Barabasi AL (2008) Nature 453(7196):779 View ArticleGoogle Scholar
 Song C, Qu Z, Blumm N, Barabási AL (2010) Science 327(5968):1018 http://science.sciencemag.org/content/327/5968/1018. doi:10.1126/science.1177170 MathSciNetView ArticleGoogle Scholar
 Qian W, Stanley KG, Osgood ND (2013) In: Web and wireless geographical information systems. Springer, Berlin, pp 2540 View ArticleGoogle Scholar
 Rhee I, Shin M, Hong S, Lee K, Kim SJ, Chong S (2011) IEEE/ACM Trans Netw 19(3):630 View ArticleGoogle Scholar
 Song C, Koren T, Wang P, Barabási AL (2010) Nat Phys 6(10):818 View ArticleGoogle Scholar
 Jiang S, Yang Y, Gupta S, Veneziano D, Athavale S, González MC (2016) Proceedings of the National Academy of Sciences p 201524261 Google Scholar
 Pappalardo L, Simini F (2016) arXiv preprint arXiv:1607.05952
 Barbosa H, de LimaNeto FB, Evsukoff A, Menezes R (2015) EPJ Data Sci 4(1):21 View ArticleGoogle Scholar
 Pappalardo L, Simini F, Rinzivillo S, Pedreschi D, Giannotti F, Barabási AL (2015) Nature communications 6 Google Scholar
 Toole JL, HerreraYaqüe C, Schneider CM, González MC (2015) J R Soc Interface 12(105):20141128 View ArticleGoogle Scholar
 Colizza V, Barrat A, Barthelemy M, Valleron AJ, Vespignani A (2007) PLoS Med 4(1):e13 View ArticleGoogle Scholar
 Kleinberg J (2007) Nature 449(7160):287 View ArticleGoogle Scholar
 Gabaix X, Gopikrishnan P, Plerou V, Stanley HE (2003) Nature 423(6937):267 View ArticleGoogle Scholar
 Pappalardo L, Vanhoof M, Gabrielli L, Smoreda Z, Pedreschi D, Giannotti F (2016) Int J Data Sci Anal 2(12):75 View ArticleGoogle Scholar
 FriasMartinez V, Virseda J (2012) In: Proceedings of the fifth international conference on information and communication technologies and development. ACM, New York, pp 7684 Google Scholar
 Makse HA, Andrade JS, Batty M, Havlin S, Stanley HE et al. (1998) Phys Rev E 58(6):7054 View ArticleGoogle Scholar
 Kitamura R, Chen C, Pendyala RM, Narayanan R (2000) Transportation 27(1):25 View ArticleGoogle Scholar
 Krings G, Calabrese F, Ratti C, Blondel VD (2009) J Stat Mech Theory Exp 2009(7):L07003 View ArticleGoogle Scholar
 Ranjan G, Zang H, Zhang ZL, Bolot J (2012) Mob Comput Commun Rev 16(3):33 View ArticleGoogle Scholar
 Lin M, Hsu WJ (2014) Pervasive Mob Comput 12:1 View ArticleGoogle Scholar
 Jensen BS, Larsen JE, Jensen K, Larsen J, Hansen LK (2010) In: Machine learning for signal processing (MLSP), 2010 IEEE international workshop on. IEEE, New York, pp 196201 View ArticleGoogle Scholar
 Smith G, Wieser R, Goulding J, Barrack D (2014) In: Pervasive computing and communications (PerCom), 2014 IEEE international conference on. IEEE, New York, pp 8894 View ArticleGoogle Scholar
 Lin M, Hsu WJ, Lee ZQ (2012) In: Proceedings of the 2012 ACM conference on ubiquitous computing. ACM, New York, pp 381390 View ArticleGoogle Scholar
 Stopczynski A, Sekara V, Sapiezynski P, Cuttone A, Madsen MM, Larsen JE, Lehmann S (2014) PLoS ONE 9(4):e95978 View ArticleGoogle Scholar
 Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) J Mach Learn Res 12:2825 MathSciNetGoogle Scholar
 Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Pearson, Upper Saddle River Google Scholar
 Lu X, Wetter E, Bharti N, Tatem AJ, Bengtsson L (2013) Scientific reports 3 Google Scholar
 Cuttone A, Lehmann S, González MC (2016) arXiv preprint arXiv:1608.01939
 Digman JM (1990) Annu Rev Psychol 41(1):417 View ArticleGoogle Scholar
 Kontoyiannis I, Algoet PH, Suhov YM, Wyner AJ (1998) Information theory IEEE Trans Inf Theory 44(3):1319 View ArticleGoogle Scholar