Unveiling patterns of international communities in a global city using mobile phone data

We analyse a large mobile phone activity dataset provided by Telecom Italia for the Telecom Big Data Challenge contest. The dataset reports the international country codes of every call/SMS made and received by mobile phone users in Milan, Italy, between November and December 2013, with a spatial resolution of about 200 meters. We first show that the observed spatial distribution of international codes well matches the distribution of international communities reported by official statistics, confirming the value of mobile phone data for demographic research. Next, we define an entropy function to measure the heterogeneity of the international phone activity in space and time. By comparing the entropy function to empirical data, we show that it can be used to identify the city’s hotspots, defined by the presence of points of interests. Eventually, we use the entropy function to characterize the spatial distribution of international communities in the city. Adopting a topological data analysis approach, we find that international mobile phone users exhibit some robust clustering patterns that correlate with basic socio-economic variables. Our results suggest that mobile phone records can be used in conjunction with topological data analysis tools to study the geography of migrant communities in a global city.


Kendall's Tau Rank Correlation
Given a set of observations {(x 1 , y 1 ), . . ., (x N , y N )} of two joint random variables X and Y, the Kendall's tau rank correlation coefficient τ quantifies the statistical association between their ranked values.A pair of observations (x i , y i ), (x j , y j ) is said to be concordant if x i > x j and y i > y j or if x i < x j and y i < y j , discordant otherwise, except for ties, x i = x j or y i = y j , which are neither concordant nor discordant.Kendall's tau is then defined as: where P is the number of concordant pairs, Q the number of discordant pairs and N the total number of observations.Eventual ties are not taken in consideration.
In the main text we used a particular variant of Kendall's tau, τ b , accounting for ties and defined as [4]: where P is the number of concordant pairs, Q the number of discordant pairs, T the number of ties only in x, and U the number of ties only in y.If a tie occurs for the same pair in both x and y, it is not added to either T or U.The values of τ b range from −1 (perfect rank inversion) to 1 (perfect rank agreement).
To produce the τ b values reported in the main text we paired the aggregate call volume v c and the number of foreign residents r c for each country c found in the dataset and computed τ b for the set of paired observations (v c , r c ).The same procedure was carried out at different aggregation levels: for the entire city of Milan and for each NIL (Figure 1B in the main text).

Persistent homology
The most prominent topological feature amenable to computation is homology [1], and for real datasets, its robust version: persistent homology [1].Homology studies the holes of different dimensions in a given manifold.The classical example is the torus.There are two non-equivalent cycles that one can draw on it: one loops around the hole in the middle, the second goes around the hole inside.Finally, there is also a void inside the torus.All of these features (the 1d cycles, the 3d void etc) are captured by homology in terms of the dimension of the homology groups H k of a simplicial complex that approximates the torus.In order to proceed, we need to give a few definitions.A k-simplex is a finite set of vertices [p 0 , p 1 , . . ., p k−1 ] and can intuitively thought as the corresponding polyhedron (a 1-simplex is an edge, 2-simplex a full triangle, a 3-simplex a tetrahedron and so on).A simplicial complex K is formally defined as a set of simplices {σ} respecting two properties: (i) if σ ∈ K all the faces f (the subsets of vertices) of σ are in K; (ii) for any two simplices σ 1 , σ 2 ∈ K, their intersection is either empty or a face in K. Intuitively, this means that a simplicial complex is a family of polygons that glue well together, namely along faces and edges.One can then consider the chain complex on K, a sequence of abelian groups C k generated by combinations of simplices of the same dimension: i a i σ i .These groups are called chain groups C k and come equipped with a boundary operator sends a chain in a composition of the faces of its composing simplices according to: where pi means that vertex p i is removed.It is easy to see that It is easy to imagine now why this formulation of homology is problematic when approaching real datasets.Let us assume we have a dataset obtained by sampling a thick circle with uniform probability.The homology of the underlying topological space, the circle, is simple to guess: one connected component (β 0 = 1) and one 1d cycle (β 1 = 1), with all the higher homology group being trivial.However, the problem here is how to go from a set of disconnected points in a metrical space to the simplicial complex encoding the properties of the underlying topological space.The most natural approach is through the Rips-Vietoris complex [2]: for a given distance , form a n-simplex σ = [p 0 , . . ., p n−1 ] for every subset of n points with maximum diameter ; it is easy to convince oneself that this construction yields a valid simplicial complex K for all values of .Varying produces a sequence of different simplicial complexes approximating the underlying circle.In particular, as increases between 0 and the diameter of the point cloud δ, we will see many connected components gradually grow and collapse into a single one.Higher order holes appear at some value birth and disappear at death along the sequence.The central idea of persistent homology is that relevant topological features will persist for large intervals of (π = death − birth δ), while irrelevant information -topological noise-will have shorter persistences (π = death − birth δ).In the following these quantities will suffice for our purposes, however further technical details on persistent homology and its applications can be found in [3] and references therein.In this work, we need to characterize the shape of a two-dimensional distribution of points on the plane, a subset of the Milan grid, in terms of its topological invariants.We will therefore focus on two particular derived quantities, r 0 and r 1 : 1. Let g 0 , g 1 , g 2 , . . .be the generators of the 0-th persistent homology group H 0 (corresponding to connected components), ordered such that δ = π 0 > π 1 > . ..; then r 0 = δ/π 1 describes the spatial coherency of the point cloud.
2. similarly for the 1st persistent homology group H 1 , consider the ratio r 1 = π 0 /π 1 which gives a rough measure of circularity of the point cloud.
These two quantities are those appearing in Figure 7 and 8 of the main text and provide the basis for the classification of the spatial properties of migrant communities defined in terms of their mobile phone activity and entropy.

Sensitivity analysis on the clustering of international communities
The results reported in Section 4 of the main text correspond to the choice z = −2.It is however important to check that our conclusions do not significantly change when perturbing the threshold value for the entropy.Hence, we tested the sensitivity under changes of z between −3 and −0.5.The correspondence between spatial features and socio-economic indicators appears to be rather sturdy.In particular, we find that the two groups identified show significantly different (at 5%) average GDP per capita distributions for all studied values of z (Fig. S3.In the case of the average remittances, we find similar results, although the interval of z values for which the two group are significantly different is somewhat reduced ([−2.3,−0.9], see Fig. S4).

Figure
Figure S1: (A) Average daily entropy of mobile phone calls versus number of points of interest in a cell.Red dots are scatter plot for each cell where at least one point of interest is located.(B) Cumulative distribution of the average daily entropy function measured for the cells without points of interest (blue) and for those with at least one point of interest (red).

Figure S2 :
Figure S2: The maps highlights the locations with an unusual entropy pattern on December 11, 2013.Dots are close to the airport, the stadium and the hospital Niguarda.These locations correspond to the locations most visited by Dutch football supporters, who came to attend the Milan-Ajax match.

Figure S3 :
FigureS3: Sensitivity for GDP data.The average GDP distributions for the two groups identified by k-means clustering (k=2) on the pairs (r 0 , r 1 ) obtained varying z.The title of each subplot reports the corresponding z values and the results of a two-sample KS test between the two distributions.
The null space of ∂ k is usually referred to as Z k while the image of ∂ k+1 , that is the boundaries of chains contained in C k+1 , is called B k .Homology is interested in the k-cycles that are not boundaries of a (k + 1) chain -that is, actual holes-and this is mathematically encoded in the homology group as the factor group of these two,H k (K) = ker(∂ k )/im(∂ k+1 ) = Z k /B k .The homology group H k thus describes the holes bounded by k-dimensional chains, e.g.H 0 represents connected components, H 1 holes bounded by 1d loops, H 2 voids bounded by 2d surfaces and so on.The dimensions of these groups are called Betti numbers β k = dim(H k ) and encode the number of non-equivalent holes in a given simplicial complex K.