We focused our study on Santiago, the most populated city in the country, with almost 8 million inhabitants. Comprising an area of 867.75 km2, urban Santiago is composed of 35 independent administrative units called municipalities. The city has experienced accelerated growth in the last few decades, a trend that has been predicted to continue at least until 2045 [8]. Chile, and Santiago in particular, is one of the developing regions of South America with the highest mobile phone penetration index. There are about 132 mobile subscriptions per 100 people.Footnote 6 Santiago’s growth and the general availability of mobile phones makes it an excellent city to perform research based on mobile communication data.
2.1 Datasets
We use the following complementary datasets:
Santiago Travel Survey and traffic analysis zones
The Santiago 2012 Travel SurveyFootnote 7 (also known as Origin-Destination survey, or ODS) contains 96,013 trips from 40,889 users. The results of this survey are used in the design of public policies related to transportation and land use. The survey includes traffic analysis zones of the entire Metropolitan Region, encompassing Santiago and nearby cities. We use this zoning resource for two reasons. On the one hand, the extent and boundaries of each area within a zone take residential and floating population density, administrative boundaries, and city infrastructure into account. This enables the comparison of several phenomena between zones. On the other hand, it allows us to integrate other sources of information, providing results that can be compared to other datasets such as land use properties [9].
The complete survey includes 866 zones; however, we were interested in urban areas of a single city. Since these are densely populated, we restricted our analysis to zones with a surface of under 20 km2. As result, the maximum zone area is 18.37 km2, with mean 1.34 and median 0.72 km2. Finally, we were interested in zones that have both cell phone towers and Pokémon points of interests (see Figure 1), resulting in 499 zones covering 667 km2, about 77% of Santiago.
Ingress portals/Pokémon points of interest
Before Pokémon Go, Niantic Labs launched Ingress [1] in 2012. Ingress is a location-based game where players choose one team (from two available), and try to hack (take control) several portals placed in real locations world-wide. Portal locations are crowd-sourced and include ‘a location with a cool story, a place of historical or educational value,’ ‘a cool piece of art or unique architecture,’ ‘a hidden-gem or a hyper-local spot,’ among others.Footnote 8 The definition of a portal, thus, includes points of interest that go beyond the definition used in, for instance, check-in based social networks [2]. Once a portal is ‘hacked,’ it belongs to the corresponding faction. A set of portals belonging to the same faction defines the limit of an area controlled by it. Since players need to be close to portals to hack them, this makes players explore the city to find portals to hack and conquer for their own teams.
Pokémon Go shares many game mechanics with Ingress, including the team concept. The main difference between the games is that while in Ingress players capture portals, in Pokémon Go they capture wild pocket monsters. A subset of Ingress portals is defined to be a PokéStop (a place to check-in and get items) or a PokéGym (a place to battle against the Pokémon of other factions). In this paper, both are referred to as PokéPoints. Additionally, there are hidden Pokémon respawn points, where different creatures tend to appear. The main mechanic of the game is that players must walk around and explore to find creatures to capture. Note that all players see the same creatures, and one creature may be captured by many players. The game motivates walking in two ways: first, walking to points of interests that are scattered around city (see Figure 1); and second, by walking 1, 5 or 10 km, players can also hatch eggs containing random pocket monsters that have better biological properties than those caught on the wild.
Mobile communication records
Telefónica has 1,464 cell phone towers in the municipalities under consideration. We studied an anonymized Call Detail Records (CDR) dataset from Telefónica Chile. This dataset contains records from seven days prior to the launch of Pokémon Go in 2016 (from July 27th to August 2nd) and seven days after (from August 4th to August 10th). We did not take into account the day of the official launch of Pokémon Go, as there was no specific hour in which the game was officially and generally available. Also note that the dataset contains pre-paid and contract subscriptions from Telefónica.
The dataset contains data-type events rather than voice CDRs [6]. Unlike typical Call Detail Records for voice, each data event has only one assigned tower, as there is no need for a destination tower. Each event has a size attribute that indicates the number of KiB downloaded since the last registered event.
We did not analyze the records from the entire customer population in Santiago. Instead, we applied the following filtering procedure: (i) we filtered out those records that do not fall within the limits of the zones from the travel survey, and also those with a timestamp outside the range between 6:00 AM and 11:59 PM; (ii) to be considered, mobile devices must have been active every day under study, because a device that does not show regular events may belong to a tourist, someone who is not from the city, or does not evidence human-like mobility patterns such as points-of-sale (which are mostly static); (iii) only devices that downloaded more than 2.5 MiB and less than 500 MiB per day were included, as that indicates either inactivity or an unusual activity for a human (i.e., the device could be running an automated process); (iv) we used a Telefónica categorization scheme that associates an anonymized device ID with a certain category of service: for example pay-as-you-go, contractual, enterprise, etc. This gives us a good idea of the general kind of account holders. Thus, every step of this procedure was taken to ensure that events were triggered by humans. After applying these filters, the dataset comprised records from 142,988 devices.
Our filtering procedure ensures that a positive difference in the number of connections between two different days represents more people within a given zone. Depending on conditions such as time and location, we may interpret that some of those people are on the street. For instance, we may look at typical times where people commute, or at places where people are either inevitably outdoors (e.g., in a park) or inevitably at residential areas, which tend to have WiFi networks.Footnote 9
Land use clusters
We may take each traffic analysis zone to belong to one category of land use: residential, business, and areas with mixed activities (e.g., business plus recreation or shopping activities, etc.). These categories are the result of our previous work on land use and CDR data, which is based on hierarchical clustering of time-series of connections at each zone of the city [9].
2.2 Approach
This study uses a natural experiment approach to measure the Pokémon Go effect at the city scale. To do this, we analyzed the change in population patterns before and after the launch of Pokémon Go as evidenced by CDR data. First, we described a method of smoothing the number of connected devices at each cell tower according to several snapshots of the tower network. A snapshot is the status of the cell phone network in a given time-window [10]. Then, we aggregated these device counts at the zone level to define a set of observations that we evaluated in a regression model. We took into account covariates that enabled us to isolate and quantify the Pokémon Go effect.
Device counts at each tower and zone level aggregation
Let \(e\in E\) be a network event, and \(|E|\) is the total number of such events. A network event e is a tuple \((d,u,b,z)\), where d is a timestamp with a granularity of one minute, u is some (anonymized) user id, b is a tower id, and z is one of the previously defined geographical areas of Santiago. For each tower b and time d we developed a time-series \(B_{d,b}\) which represented the number of unique users from E connected to b at d. Due to the sparsity of CDR data, it is possible that B is not continuous. As consequence, the time-series could be null (\(B = 0\)) at a point of time where there were active devices at the corresponding tower. To account for this sparsity and obtain a continuous time-series, we smoothed each time-series B using Locally Weighted Scatterplot Smoothing (LOWESS) interpolation [11], obtaining \(B'_{d,b}\). To obtain a LOWESS curve, several non-parametric polynomial regressions are performed in a moving window. The size of this window is the bandwidth parameter for the model. In our implementation, this value is 30, which is interpreted as follows: each connection influences (i.e., is counted into) its correspondent location during 30 minutes. Then, for each zone z, we aggregated all time-series \(B'_{d,b_{i}}\) into \(S_{d,z}\) by computing the sum of all time-series \(B'_{d,b_{i}}\), where the tower \(b_{i}\) lies in z (determined using a point-in-polygon test, as in [12]). Finally, each \(S_{d,z}\) time-series represents the floating population profile for each zone under study.
Measuring the Pokémon Go effect
To measure the city-wide Pokémon Go effect we considered the availability of the game as a city intervention, which started on the day of the launch of the game. Our hypothesis is that the following days would show an effect of the game if people went out, regardless of being players or not, and this effect would show on the number of people connected at each zone of the city. To do so, we used Negative Binomial Regression (NB) [7, 13] applied to our dataset at 1-minute intervals during a day. The NB regression model has been used frequently to analyze over-dispersed count data, i.e., when the variance is much larger than the mean, contrary to the Poisson model [14].
For every minute under study (note that we restricted ourselves from 6 AM until 12 AM), we performed a NB regression using the observed device counts at each zone of the city in all available days in the dataset. The model is specified as follows:
$$\log E\bigl[X(t)\bigr] = \log a + \beta_{0} + \beta_{1} \mathit{PoGo} + \beta_{2} \mathit{DayOfWeek} + \beta_{3} \mathit{LandUse} + \beta_{4} \mathit{PokePoints}, $$
where \(E[X(t)]\) is the expected value of the number of active devices within a zone at time t. The PoGo factor is a binary variable that has a value of 0 when the game was not available, and 1 when it was available. The covariates DayOfWeek (with values business_day, Saturday, and Sunday) and LandUse (with values residential, business_only, and mixed_activities) account for the fluctuations in population on different days according to land use. Both factors use dummy coding because they are categorical. The covariate PokéPoints represents the number of PokéStops and PokéGyms, which are proxies for points of interest within an area, accounting for the number of potential attracting places in each zone. The exposure value a represents the surface area of each zone. Because urbanists designed each zone having into account population density, transportation infrastructure, and administrative boundaries, the exposure parameter also allowed to control indirectly for these potential covariates.
The model output allows the following interpretation: the β coefficient assigned to each factor represents the difference of the logarithm of expected counts in a zone at time t, if all other factors were held equal. Since \(\beta= \log{\mu_{1}} - \log{\mu_{0}} = \log{\frac{\mu_{1}}{\mu_{0}}}\), then the difference of logarithms equals the logarithm of the ratio between population counts before and after the availability of the game. The exponential of this coefficient is defined as Incidence Rate Ratio, \(\operatorname{IRR}_{\beta}(t) = e^{\beta(t)}\). We developed a time-series of \(\operatorname{IRR}_{\beta}(t)\) values for each factor. By analyzing these time-series we determined when, in terms of time-windows within a day, there were significant effects for each factor.