A survey of results on mobile phone datasets analysis

In this paper, we review some advances made recently in the study of mobile phone datasets. This area of research has emerged a decade ago, with the increasing availability of large-scale anonymized datasets, and has grown into a stand-alone topic. We will survey the contributions made so far on the social networks that can be constructed with such data, the study of personal mobility, geographical partitioning, urban planning, and help towards development as well as security and privacy issues.


Introduction
As the Internet has been the technological breakthrough of the 's, mobile phones have changed our communication habits in the first decade of the twenty-first century. In a few years, the world coverage of mobile phone subscriptions has raised from % of the world population in  up to % in  -. billion subscribers -corresponding to a penetration of % in the developed world and % in developing countries []. Mobile communication has initiated the decline of landline use -decreasing both in developing and developed world since  -and allows people to be connected even in the most remote places of the world.
In short, mobile phones are ubiquitous. In most countries of the developed world, the coverage reaches % of the population, and even in remote villages of developing countries, it is not unusual to cross paths with someone in the street talking on a mobile phone. Due to their ubiquity, mobile phones have stimulated the creativity of scientists to use them as millions of potential sensors of their environment. Mobile phones have been used, as distributed seismographs, as motorway traffic sensors, as transmitters of medical imagery or as communication hubs for high-level data such as the reporting of invading species [] to only cite a few of their many side-uses.
Besides these applications of voluntary reporting, where users install applications on their mobile phones in the aim to serve as sensor, the essence of mobile phones have revealed them to be a source of even much richer data. The call data records (CDRs), needed by the mobile phone operators for billing purposes, contain an enormous amount of information on how, when, and with whom we communicate.
In the past, research on social interactions between individuals were mostly done by surveys, for which the number of participants ranges typically around , people, and for which the results were biased by the subjectivity of the participants' answers. Mobile phone CDRs, instead, contain the information on communications between millions of people at a time, and contain real observations of communications between them rather than self-reported information.
In addition, CDRs also contain location data and may be coupled to external data on customers such as age or gender. Such a combination of personal data makes of mobile phone CDRs an extremely rich and informative source of data for scientists. The past few years have seen the rise of research based on the analysis of CDRs. First presented as a sidetopic in network theory, it has now become a whole field of research in itself, and has been for a few years the leading topic of NetMob, an international conference on the analysis of mobile phone datasets, of which the fourth edition took place in April . Closely related to this conference, a side-topic has now risen, namely the analysis of mobile phone datasets for the purpose of development. The telecom company Orange has, to this end, proposed a challenge named DD, whose concept is to give access to a large number of research teams throughout the world to the same dataset from an African country. Their purpose is to make suggestions for development, on the basis of the observations extracted from the mobile phone dataset. The first challenge, conducted in  was such a success that other initiatives such as this one have followed [, ], and the results of a second DD challenge were presented at the NetMob conference in April .
Of course, there are restrictions on the availability of some types of data and on the projected applications. First, the content of communications (SMS or phone discussions) is not recorded by the operator, and thus inaccessible to any third party -exception made of cases of phone tapping, which are not part of this subject. Secondly, while mobile phone operators have access to all the information filed by their customers and the CDRs, they may not give the same access to all the information to a third party (such as researchers), depending on their own privacy policies and the laws on protection of privacy that apply in the country of application. For example, names and phone numbers are never transmitted to external parties. In some countries, location data, i.e., the base stations at which each call is made, have to remain confidential -some operators are even not allowed to use their own data for private research.
Finally, when a company transmits data to a third party, it goes along with non-disclosure agreements (NDA's) and contracts that strongly regulate the authorised research directions, in order to protect the users' privacy.
Recently, with the rise of smartphones, other methods of collecting data overcoming those drawbacks have been designed: projects such as Reality Mining [], OtaSizzle [], or Sensible DTU [] consist in distributing smartphones to individuals who volunteered for the study. A previously installed software then records data, and these data are further used for research by the team that distributed the smartphones. This new approach overcomes the privacy problems, as participants are clearly informed and consent to their data being used. On the one hand, these projects gather very rich data, as they usually collect more than just call logs, but also bluetooth proximity data, application usage, etc. . . . On the other hand, the sample of participants is always much more limited than in the case of CDRs shared by a provider, and the dataset contains information on , participants at most.
Yet, even the smallest bit of information is enough for triggering bursts of new applications, and day after day researchers discover new purposes one can get from mobile phone data. The first application of a study of phone logs (not mobile, though) appeared in , with the seminal paper by George Zipf modeling the influence of distance on communication []. Since then, phone logs have been studied in order to infer relationships between the volume of communication and other parameters (see e.g. []), but the apparition of mobile phone data in massive quantities, and of computers and methods that are able to handle those data efficiently, has definitely made a breakthrough in that domain. Being personal objects, mobile phones enabled to infer real social networks from their CDRs, while fixed phones are shared by users of one same geographical space (a house, an office). The communications recorded on a mobile phone are thus representative of a part of the social network of one single person, where the records of a fixed phone show a superposition of several social actors. By being mobile, a mobile phone has two additional advantages: first, its owner has almost always the possibility to pick up a call, thus the communications are reflecting the temporal patterns of communications in great detail, and second, the positioning data of a mobile phone allows to track the displacements of its owner.
Given the large amount of research related to mobile phones, we will focus in this paper on contributions related to the analysis of massive CDR datasets. A chapter of the (unpublished) PhD thesis of Gautier Krings [] gives an overview of the literature on mobile phone datasets analysis. This research area is growing fast and this survey is a significantly expanded version of that chapter, with additional sections and figures and an updated list of references. The paper is organized following the different types of data that may be used in related research. In Section  we will survey the contributions studying the topological properties of the static social network constructed from the calls between users. When information on the position of each node is available, such a network becomes a geographical network, and the relationship between distance and the structure of the network can be analyzed. This will be addressed in Section . Phone calls are always localized in time, and some of them might represent transient relationships while others rather long-lasting interactions. This has led researchers to study these networks as temporal networks, which will be presented in Section . In Section , we will focus on the abundant literature that has been produced on human mobility, made possible by the spatio-temporal information contained in CDR data. As mobile phone networks represent in their essence the transmission of information or more recently data between users, we will cover this topic in Section , with contributions on information diffusion and the spread of mobile phone viruses. Some contributions combine many of these different approaches to use mobile phone data towards many different applications, which will be the object of Section . Finally, in Section  we will consider privacy issues raised by the availability and use of personal data.

Social networks
In its simplest representation, a dataset of people making phone calls to each other is represented by a network where nodes are people and links are drawn between two nodes who call each other. In the first publications related to telecommunications datasets, the datasets were rather used as an example for demonstration of the potential applications of an algorithm [] or model [] rather than for a purpose of analysis. However, it quickly appeared that the so-called mobile call graphs (MCG) were structurally different from other complex networks, such as the web and internet, and deserved particular attention, see Figure  for an example of snowball sampling of a mobile phone network. We will Figure 1 Sample of a mobile phone network, obtained with a snowball sampling. The source node is represented by a square, bulk nodes by a + sign and surface nodes by an empty circle. Figure reproduced from [25].
review here the different contributions on network analysis. We will address the construction of a social network from CDR data, which is not a trivial exercise, simple statistical properties of such networks and models that manage to reproduce them, more complex organizing principles, and community structure, and finally we will discuss the relevance of the analysis of mobile phone networks.

Construction
While the network construction scheme mentioned above seems relatively simple, there exist many possible interpretations on how to define a link of the network, given a dataset.
The primary aim of social network analysis is to observe social interactions, but not every phone call is made with the same social purpose. Some calls might be for business purposes, some might be accidental calls, some nodes may be call centers that call a large number of people, and all such interactions are present in CDRs. In short, CDRs are noisy datasets. 'Cleaning' operations are usually needed to eliminate some of the accidental edges. For example, Lambiotte et al. [] imposed as condition for a link that at least one call is made in both directions (reciprocity) and that at least  calls are made in total over  months of the dataset. This filtering operation appeared to remove a large fraction of the links of the network, but at the same time, the total weight (the total number of calls passed in by all users) was reduced by only a small fraction. The threshold of  calls in  months may be questionable, but a stability analysis around this value can comfort that the exact choice of the threshold is not crucial. Similarly, Onnela et al. [] analyzed the differences between the degree distribution of two versions of the same dataset, one containing all calls of the dataset, and the other containing only calls that are reciprocated. Some nodes in the complete network have up to , different neighbors, while in the reciprocated network, the maximal degree is close to . Clearly, in the first case it is hard to imagine that node representing a single person, while the latter is a much more realistic bound. However, even if calls have been reciprocated, the question of setting a meaningful weight on each link is far from easy. Li et al. suggest another more statistical approach in [], and use multiple hypothesis testing to filter out the links that appeared randomly in the network and that are therefore not the mirror of a true social relationship. Further than these considerations of which calls or texts are representative of a true relationship, further corruptions of the data can arise from multiple calls, multiple text messages or calls reaching an answering machine. Depending on the context, it may be preferable to filter out these communications, but they remain, most of the time, difficult to identify in the datasets. Moreover, additional biases can arise from the pricing plans of the operator, putting a preferential price for SMSs or for voice calls, thus influencing the behavior of the users that opted for such a pricing plan [].
It is sometimes convenient to represent a mobile call network by an undirected network, arguing that communication during a single phone call goes both ways, and set the weight of the link as the sum of the weights from both directions. However, who initiates the call might be important in other contexts than the passing of information, depending on the aim of the research, and Kovanen et al. have showed that reciprocal calls are often strongly imbalanced []. In the interacting pair, one user is often initiating most of the calls, so how can this be represented in an undirected network by a representative link weight? In a closely related question, most CDRs contain both information on voice calls and text messages, but so far it is not clear how to incorporate both pieces of information into one simple measure. Moreover, there seems to be a generational difference in the use of text messages or preference between texts and voice calls which may introduce a bias in measures that only take one type of communication into account [].
Besides these considerations on the treatment of noise, the way to represent social ties may vary as well: they may be binary, weighted, symmetric or directed. Different answers to such decisions lead to different network characteristics, and result in diverse possible interpretations of the same dataset. For example, Nanavati et al. [] keep their network as a directed network, in order to obtain information on the strongly connected component of the network, while Onnela et al. [] rather focus on an undirected network, weighted by the sum of calls going in both directions. A few different options for definitions of link weights and measures on nodes are given in Table .
It is close to impossible to define unified construction rules functioning for any dataset, given the many sources of variance between two sets of CDRs. Besides the cases mentioned above, let us cite differences in social behaviors between inhabitants of different countries or differences in uses of uncaptured technologies such as email, landline or messaging. The construction of a social network from CDRs should always be of primary importance for the researcher, bearing in mind that there's no 'one size fits all' technique available.

Topological properties
The simplest information one can get out of CDRs is statistical information on the number of acquaintances of a node, on the local density of the network or on its connectivity. Like social networks, mobile call graphs differ from random networks and lattices by their broad degree distribution [], their small diameter and their high clustering []. The level of clustering of a graph G is measured by its clustering coefficient, defined as the proportion of closed triangles among the connected triplets of nodes in G: This coefficient takes values in the interval [, ], and is found to be typically high in social networks and mobile call graphs. An alternative is to take the average of a local measure of the clustering around a given node i, defined as: where k i is node i's degree, and hence k i (k i -) is the maximum number of possible triangles around node i. The diameter of a graph G measures the greatest distance (in terms of number of edges) between any two vertices, and is typically small in social networks.
As for the degree distribution, while all analyzed datasets present similar general shapes, their range and their fine shape differ due to differences between the datasets, the construction scheme, the size, or the time span of the collection period.
In one of the first studies involving CDR data Aiello et al.
[] observed a power law degree distribution, which was well explained by a massive random graph model P(α, β) described by its power-law degree distribution p(d = x) = e α x -β . Power-law distributions have often been observed in empirical datasets, but characterizing their parameters and determining whether the data really corresponds to a power-law distribution is not an easy question, as presented by Clauset et al. [].
Random graph models have often been used in order to model networks, and manage to reproduce some observations from real-world networks, such as the small diameter and the presence of a giant component, such as observed on mobile datasets. However, they fail to uncover more complex features, such as degree-degree correlations. Nanavati et al.
[] observed in the study of  mobile datasets that besides the power-law tail of the degree distribution, the degree of a node is strongly correlated with the degree of its neighbors.
Characterizing the exact shape of the degree distribution is not an easy task, which has been the focus of a study by Seshradi et al. []. They observed that the degree distribution of their data can be fitted with a Double Pareto Log Normal (DPLN) distribution, two power-laws joined by a hyperbolic segment -which can be related to a model of social wealth acquisition ruled by a lognormal multiplicative process. Those different degree distributions are depicted on Figure     the network. The authors also observed that the degree and weight distributions become stationary after a few days and a few weeks respectively. The effect of the placement of the time window has most influence for short time windows, and depends mostly on whether it contains holiday periods or weekends, during which the behavioral patterns have been shown to be significantly different than during normal weekdays.
What information do we get from these distributions? They mostly reflect the heterogeneity of communication behaviors, a common feature for complex networks []. The fat tail of the degree distribution is responsible for large statistical fluctuations around the average, indication that there is no particular scale representative of the system. The majority of users have a small number of contacts, while a tiny fraction of nodes are hubs, or super-connectors. However, it is not clear whether these hubs represent true popular users or are artefacts of noise in the data, as was observed by Onnela et al. [] in their comparison of the reciprocated and non-reciprocated network.
The heterogeneity of degrees is also observed on node strengths and link weight, which is also to be expected for social networks. All studies also mention high clustering coefficient, which indicates that the nodes arrange themselves locally in well-organized structures. We will address this topic in more detail further.
However large the datasets studied may be, one may still question the significance of the measures presented above. The data studied is always about only a limited (yet significant) sample of the population, and it is very difficult to examine whether this sample is biased or not, without additional information on the population of the country and the users of the mobile provider considered. We will discuss this topic further in Section , but the topological properties presented in this section, such as the degree distributions should be analyzed with care, as one may only expect qualitatively close results for similar datasets on similar populations. If the sample of the population studied is biased, and one only observes people that share specific characteristics (such as age, gender, or a specific profile), then topological properties might be very different from those characterizing the mobile phone networks presented in the above paragraph. We have discussed the impact of the size of the time window of observation on the results. Most studies usually try to infer general trends characterizing the network of acquaintances of a population, based on observation during only a finite time window. The ideal model capturing the network of people would be based on a very long time of observation of interactions between people of the whole population around the globe. This is of course impossible to achieve, and limited time window of observations as well as population sample introduce inherent biases in the results, even though these biases are almost impossible to characterize and quantify.

Advanced network characteristics
Beyond statistical distributions, more complex analyses provide a better understanding of the structure of our communication networks. The heterogeneity of link weights deserves particular attention. Strong links represent intense relationships, hence the correlation between weight and topology is of primary interest. Recalling that mobile call graphs show high clustering coefficient, and thus are locally dense, one can differentiate links based on their position in the network.
The overlap of a link, introduced in [] (and illustrated on Figure ), is an appropriate measure which characterizes the position of a link as the ratio of observed common neighbors n ij over the maximal possible, depending on the degrees k i and k j of the nodes and defined as: The authors show that link weight and topology are strongly correlated, the strongest links lying inside dense structures of the network, while weaker links act as connectors between these densely organized groups. This finding has an important consequence on processes such as link percolation or the spread of information on networks, since the weak ties act as bridges between disconnected dense parts of the network, illustrating Granovetter's hypothesis on the strength of weak ties []. The structure of the dense subparts of the network provides essential information on the self-organizing principles lying behind communication behaviors. Before moving to the analysis of communities, we will focus on properties of cliques. The structure of cliques The overlap of a link is defined as the ratio between the common neighbors of both nodes and the maximum possible common neighbors. Here, the overlap is given for the green link. (Right) The average overlap increases with the cumulative weight in the real network (blue circles) and is constant in the random reference where link weights are shuffled (red squares). The overlap also decreases with the cumulative betweenness centrality P cum (b) (black diamonds). Figure reproduced from [14].
is reflected by how weights are distributed among their links. In a group where everyone talks to everyone, is communication balanced? Or are small subgroups observable? A simple measure to analyze the balance of weights is the measure of coherence q(g). This measure was introduced in [] before its application to mobile phone data in [], and is calculated as the ratio between the geometric mean of the link weights and the arithmetic mean, where g is a subgraph of the network and l g is its set of links. This measure takes values in the range ], ],  corresponding to equilibrium. On average, cliques appear to be more coherent than what would be expected in the random case, in particular for triangles, which show high coherence values. On a related topic, Du et al. [] focused instead on the propensity of nodes to participate to cliques, and in particular on the balance of link weights inside triangles. Their observations differ slightly from Onnela et al.: on average, the weights of links in triangles can be expressed as powers of one another. The authors managed to reproduce this singular situation with a utility-driven model, where users try to maximize their return from contacts.

Communities
The previous analysis of cliques and triangles opens the way for an analysis of more complex structures, such as communities in mobile phone networks. The analysis of communities provides information on how communication networks are organized at large scale. In conjunction with external data, such as age, gender or cultural differences, it provides sociological information on how acquaintances are distributed over the population. From a corporate point of view, the knowledge of well-connected structures is of primary importance for marketing purposes. In this paragraph, we will only address simple results on community analysis, but this topic will be addressed again further in the document, when it relates to geographic dispersal of networks or dynamical networks.
At small scale, traditional clustering techniques may be applied, see [] and [] for examples of applications on small datasets. However, on large mobile call graphs involv-ing millions of users, such clustering techniques are outplayed by community detection algorithms.
Uncovering the community structure in a mobile phone network is highly dependent on the used definition of communities and detection method. One could argue that there exist as many plausible analyses as there are community detection methods. Moreover, the particular structure of mobile call graphs induces some issues for traditional community detection methods. Tibely et al. [] show that even though some community detection methods perform well on benchmark networks, they do not produce clear community structures on mobile call graphs. Mobile call graphs contain many small tree-like structures, which are badly handled by most community detection methods. The comparison of three well-known methods: the Louvain method [], Infomap [] and the Clique Percolation method [] produce different results on mobile call graphs. The Louvain method and Infomap both build a partition of the nodes of the network, so that every node belongs to exactly one community. In contrast Clique Percolation only keeps as community dense subparts of the network (see Figure ).
As observed in Tibely et al. the small tree-like structures are often considered as communities, although their structure is sparse. Such a result is counter-intuitive given the intrinsic meaning of communities and raises the question: is community detection hence unusable on mobile call graphs? The results have probably to be considered with caution, but as this is always the case for community detection methods, whatever network is used, this special character of communities in mobile call graphs appears rather as a particularity than a problem. Although they might have singular shapes, communities can provide significant information, when usefully combined with external information. Proof is made by the study of the linguistic distribution of communities in a Belgian mobile call graph [],  where the communities returned by the Louvain method strikingly show a well-known linguistic split, as illustrated on Figure .
The notion of communities in social networks, such as rendered by mobile phone networks, has raised a debate on the exact vision one has of what a community is and what it is not. In particular, several authors have favored the idea of overlapping communities, such that one node may belong to several communities, in opposition with the classical vision that communities are a partition of the nodes of a network. An argument in favor of this vision is that one is most often part of several groups of acquaintances who do not share common interests, such as family, work and sports activities. In [], Ahn et al. show how overlapping communities can be detected by partitioning edges rather than nodes, and illustrated their methods with a mobile phone dataset. For each node, they had additional information about its center of activities, with which they showed that communities were geographically consistent. This discussion of the exact definition of communities and of the best method to detect them can further be influenced by a series of factors. Indeed, one could want to introduce additional information before searching for communities such as, for example, age, gender or specific profiles of people. Moreover, when spatio-temporal information is available, one could want to detect strong communities whose links remain active through time, or detect which geographic areas belong to the same community thus partitioning space, as we will explain further in this paper. Finally, as mentioned in the previous paragraph, if the sample of the population that is available is biased, the structure of the network, and hence also the detection of communities may be influenced. All these additional considerations might significantly change the result of the detection of communities, including their internal topology.

Social analysis
The use of mobile call data in the purpose of analysis of social relationships raises two questions. First, how faithful is such a dataset of real interactions? Second, can we extract information on the users themselves from their calling behavior?
It has often been claimed that mobile phone data analysis is a significant advance for social sciences, since it allowed scientists to use massive datasets containing the activity of entire populations. The study of mobile phone datasets is part of an emerging field known as computational social science []. These massive datasets, it is said, are free from the bias of self-reporting, which is that the answers to a survey are usually biased by the own perception of the subject, who is not objective. Still, the question remains: how much does self-reporting differ from our real behavior, what is the exact added value of having location data? This has been studied by Eagle et al. [] in the well-known Reality Mining project. By studying the behavior of about  persons both by recording their movements and encounters using GSM and Bluetooth technology and with the use of surveys, they managed to quantify the difference between self-reported behavior and what could be observed. It appears that observed behavior strongly differs from what has been self-reported, confirming that the subjectivity of the subjects' own perception produces a significant bias in surveys. In contrast, collected data allows to reduce this bias significantly. However, mobile phone data introduce a different bias, namely, that they only contain social contacts that were expressed through phone calls, thus missing all other types of social interactions out [].
While most studies use external data as validation tool to confirm the validity of results, Blumenstock et al. shortly addressed a different question, namely if it was possible to infer information on people's social class based on their communication behavior. Apparently, this task is hard to perform, even if significant differences appear in calling behavior between different classes of the population []. While inferring information about users from their calling activity still seems difficult, many studies show strong correlations between calling behavior and other information included in some datasets, such as gender or age. In a study on landline use, Smoreda et al. highlight the differences in the use of the domestic telephone based on the genders of both the caller and the callee [], and show not only that women call more often than men but also that the gender of the callee has more influence than the gender of the caller on the duration of the call. Those same trends have also been observed in later studies of mobile phone datasets []. Further than just observing the gender differences in mobile phone use, Frias-Martinez et al. propose a method to infer the gender of a user based on several variables extracted from mobile phone activity [], and achieve a success rate of prediction between % and % on a dataset of a developing economy. In a later study on data from Rwanda, Blumenstock et al. show that differences of social class induce more striking differences in mobile phone use than differences of gender [].
Further than analyzing the nodes of a network, Chawla et al. take a closer look at the links of the network, and introduce a measure of reciprocity to quantify how balanced the relationship between two users is []: where p ij is the probability that if i makes a communication, it will be directed towards j.
They also test this measure on a mobile communications dataset, and show that there are very large degrees of non-reciprocity, far above what could be expected if only balanced relationships were kept. Going one step further, instead of inferring information on the nodes of the mobile calling graph, Motahari et al. study the difference in calling behavior depending on the relationship between two subscribers, characterizing different types of links. They show that the links within a family generate the highest number of calls, and that the network topology around those links looks significantly different from the topology of a network of utility communications [].
If we can infer so much information from looking at mobile phone communications, would it be possible to predict existing acquaintances that are unobserved in the dataset available? This question is known as the link prediction problem, and has been addressed by several research teams. As the approach usually takes into account the time component of the dataset, we will address this topic further in Section .

Adding space -geographical networks
Besides basic CDR data, it happens that geographic information is available about the nodes, such as the home location (available for billing purposes) or the most often used antenna. This allows then to assign each node to one geographic point, and to study the interplay between geography and mobile phone usage. Studies on geographical networks have already been performed on a range of different types of networks []. One of the very basic applications is to use mobile phone data to estimate the density of population in the different regions covered by the dataset. Deville et al. explored this idea [], using the number of people who are calling from each antenna, they are able to produce timely estimates of the population density in France and Portugal. In the developing world, census data is often very costly or even impossible to obtain, and existing data is often very old and outdated. Using CDRs can then provide very useful and updated information on the actual density of population in remote parts of the world. Another example is given by

Relationship space-communication
Lambiotte et al. [], investigated the interplay between geography and communications, and assigned each of the . million users from a Belgian mobile phone operator to the ZIP code location where they were billed. By approximating the position of the users to the center of each ZIP code area, they showed that the probability of two users to be connected decreases with the distance r separating them, following a power law of exponent -. The probability of a link to be part of a triangle decreases with distance, until a threshold distance of  km, after which the probability is constant. Interestingly, this threshold of  km is also a saturation point for the average duration of a call (see Figure ). A different study on the same dataset also showed that total communication duration between communes in Belgium was well fitted by a gravity law, showing positive linear contribution of the number of users in each commune and negative quadratic influence of distance [, ]: where l ab represents the total communication between communes a and b, c a and c b the number of customers in each commune and r ab the distance that separates them. While it seems sure that distance has a negative impact on communication, its exact influence is not unique. Onnela et al. [] observed in a different dataset a probability of connection decreasing as r -. rather than the gravity model observed by Lambiotte et al., and a later study on Ivory Coast by Bucicovschi et al. [] observe that the total duration of communication between two cities decays with r -/ . However, these differences might be explained by the differences that exist between the studied countries, such as the dis-tribution of the population density. A different study on mobility data from the locationbased service Foursquare [] levelled those variations using a rank-based distance [], which could also be helpful in this case. Another comparison is presented by Carolan et al.
[] who compare two different types of distance, namely the spatial travel distance and the travel time taken to link two cities. Interestingly, it appears that the use of the spatial distance rather than the time taken gives a better fit of the number of communications between two cities with the gravity model. Their observations also show that the gravity model fits the data better when data is collected during the daytime on weekdays than during evenings and weekends.
Instead of studying the communication between cities, Schläpfer et al. looked at the relationship between city size and the structure of local networks of people living in those cities []. They show that the number of contacts and communication activity both grow with city size, but that the probability of being friends with a friend's friend remains the same independently of the city size. Jo et al. propose another approach and study the evolution with age of the distance between a person and the person with whom they have the most contacts []. They thus show that young couples tend to live within longer distances than old couples.
Instead of only taking into account the distance between two places to predict the number links between them, Herrera-Yagüe et al. make another hypothesis, namely that the probability of someone living in a location i has contacts with a person living in another location j is inversely proportional to the total population within an ellipse []. The ellipse is defined as the one whose foci are i and j, and whose surface is the smallest such that both circles of radius r ij centred around i and j are contained in the ellipse. If we name e ij the total population within the ellipse, the number of contacts between locations i and j is thus described by: where K is a normalisation parameter depending on the total number of relationships to predict, and n i and n j are the populations of locations i and j respectively. Further, Onnela et al. also studied the geographic structure of communities, and showed on the one hand that nodes that are topologically central inside a community may not be central from a geographical point of view, and on the other hand that the geographical shape of communities varies with their size. Communities smaller than  individuals show a smooth increase of geographical span with size, but bounces suddenly at the size of , which could not be clearly explained by the authors, see Figure .

Geographic partitioning
The availability to place customers in higher level entities, such as communes or counties, gave researchers the idea of drawing the 'social borders' inside a country based on the interactions between those entities []. Individual call patterns of users are aggregated at a higher level to a network of entities, which can in turn be partitioned into a set of communities based on the intensities of calls between the nodes of this macroscopic network. It is important to notice that, in contrast with the microscopic network (the network of users), the macroscopic network is not a sparse network at all. Since the nodes represent the aggregated behavior of many users, there is a high chance of having a link between most pairs of communes or counties. Hence, the weights on the links of the macroscopic network are of crucial importance, since they define the complete structure of the network. Such a partition exercise using CDR datasets has been applied, among others, on Belgium, or Ivory Coast [, ]. An initial study of the communities in Belgium [] used the Louvain method optimizing modularity for weighted directed networks to partition the Belgian communes based on two link weights: the frequency of calls between two communes and the average duration of a call. The obtained partitions were geographically connected, with the influence of distance, of influential cities, and the cultural barrier of language being observable in the optimal partitions.
Given that the intensity of communication between two cities can be well-modeled by a gravity law, Expert et al. [] proposed to replace Newman's modularity by a more appropriate null model, given that geographic information was available. The spatial modularity (SPA) compares the intensity between communes to a null model influenced both by the sizes c a and c b of the communes and the distance that separates them The influence of distance is estimated from the data by a function f , which is calculated for distance bins [r -, r + ] as Using their null model, the authors obtained an almost perfect bipartition of the Belgian communes which renders the Belgian linguistic border. Moreover, they showed with a simple example that such a null model allows to remove the influence of geography and obtain communities showing geography-independent features. On an identical topic, Ratti et al. used an algorithm of spectral modularity optimization, to partition the map of Great Britain [] based on phone calls between geographic locations. Similarly to results obtained on Belgium, they obtained spatially connected commu- nities after a fine-grain tuning of their algorithm, which correspond to meaningful areas, such as Scotland or Greater London, see Figure . A stability analysis of the obtained partition showed that while some variation appears on the boundary of communities, the obtained communities are geographically centered at the same place. The intersections between several results of the same algorithm showed  spatially well-defined 'cores' corresponding to densely populated areas of Great Britain. Interestingly, the map of the cores loosely corresponds to the historical British regions.
A later study using the data of antenna to antenna volumes of communications in Ivory Coast confirmed the very strong influence of language on the formation of communities in a large country. Using the same method as was used by Blondel et al. for the Belgian dataset, they show that the borders of the communities formed in Ivory Coast strongly correlate with the language borders, even in the presence of much more than two language groups [].
Going a bit further, Blumenstock et al. introduce a measure of the social and spatial segregation that can be observed through mobile phone communication records []. They define the spatial segregation as the proportion of people from ethnicity t in a region r as: where N r is the total population of region r. They also define social segregation of ethnicity t as the fraction of contacts that individuals of ethnicity t form with the same type of people: where s t is the number of contacts that a person of type t has with people from the same ethnicity, and d t is the number of contacts that people of type t have with people from other ethnicities. With these measures, it is then possible to map the more or less segregated parts of a city, see which ethnicities occupy which regions, and show how strong or weak the links between these ethnicities are.

Communications reveal regional economy
Lately, with the growth of mobile phone coverage even in the most remote regions of the developing world, a new question has risen, namely: is it possible to use CDR data to evaluate the socio-economic state of the different regions of a country? Being able to estimate and update poverty rates in different regions of a country could help governments make informed political decisions knowing how their country is developing economically. A first step in that direction was explored by Eagle et al. in a study using data from the UK []. The authors investigated if some relationship could be found between the structure of a user's social network and the type of environment in which they live. Using both CDRs of fixed landline (% coverage) and mobile phones (% coverage), they showed that the social and geographical diversity of nodes' contacts, measured using the entropy of contact frequencies, correlates positively with a socio-economic factor of the neighborhood. Given a node i, calling each of his d i neighbors j at frequency p ij , and calling each of the A locations a at frequency p ia , his social and spatial diversity are given by which is  if the node has diversified contacts. On Figure   of calls between each pair of antenna. They observe that a high CallRank index seems to correspond well to a region that is important for the national economy. However, lacking accurate data to validate the results, they only conclude that this measure is probably a good indicator, without being able to evaluate its accuracy quantitatively. Another analysis of the same dataset was proposed by Smith-Clarke et al. who extracted a series of features to see which ones showed the best correlation with poverty levels []. The authors show that besides the total volume of calls, poverty levels are also linked to deviations from the expected flow of communications: if the amount of communications is significantly lower than expected from and to a certain area, then higher poverty levels are to be expected in that area. Another indicator of poverty was also explored by Frias-Martinez et al. who analyzed the link between the mobility of people and socio-economic levels of a city in Latin-America []. The authors propose several measures to quantify the mobility of users, and show that socio-economic levels present a linear correspondence with three indicators of mobility, namely the number of different antennas used, the radius of gyration and diameter of the area of often visited locations, indicating that the more mobile people are, the less poor the area in which they live seems to be. In a further study by the same research group, Frias-Martinez et al. go one step further, and propose a method not only to estimate, but also to forecast future socio-economic levels, based on time series of different variables gathered from mobile phone data []. They show preliminary evidence that the socio-economic levels could follow a pattern, allowing for prediction with mobile phone data.
Another valuable, and rather new, source of data extracted from mobile phone activity is the history of airtime purchases of each user. Using this data on the network of Ivory Coast, Gutierrez et al. propose another approach to infer the socio-economic state of the different regions of a developing country []. The authors make the hypothesis that people who make many small purchases are probably less wealthy than those who make fewer larger purchases, supposing that the poorer will not have enough cash flow to buy large amounts at the same time. Figure  shows the map of average purchases throughout the country. Here again, lacking external reliable data to validate those results and compare them with socio-economic data, the authors provide an interpretation of the differences observed between the different regions, and show that the hypothesis they make seems plausible.

Adding time -dynamical networks
A particularity of a mobile call graph is that the links are very precisely located in time.
Although each call has a precise time stamp and duration, the previously presented studies consider mobile call graphs as static networks, where edges are aggregated over time. This aggregation leads to a loss of information on the one hand about the dynamics of the links (some may appear or disappear during the collection period) but on the other hand about the dynamics on the links. Recently, some authors have attempted to avoid this issue by taking the dynamical component of links into account in the definition of such networks. The topic of dynamical -or temporal -networks has been studied broadly regarding several types of networks [], but the study of mobile phone graphs as evolving ones is rather recent, and given their inherent dynamical nature, mobile call graphs are excellent sources of information for such studies.

Dynamics of structural properties
One such question regards the so-called link prediction problem, that is, predict for a future time window, whether a given link will appear or disappear from the network. This problem has already been studied in machine learning and applied to different empirical networks such as e-mails and co-authorship [] or movie preferences []. Some work has already been carried out in this framework using mobile phone data for social network analysis, of which we give a short overview here.
How long does a link last in a network? By analyzing slices of  weeks of a mobile phone network, Hidalgo and Rodriguez-Sickert observed that the frequency of presence of links in the different slices, the persistence, followed a bimodal distribution [], as illustrated on Figure . The persistence of link (i, j) is defined as: where A ij (T) is  if the link (i, j) is in slice T and  otherwise, M being the number of slices. Most links in the network are only present in one window, and the probability of a link to be observed in several windows decreases with the number of windows, but there is an unexpectedly large number of links that are present in all windows. These highly recurrent links represent thus strong temporally consistent relationships, in contrast with the large number of volatile connections appearing in only one of the slices. A deeper analysis of correlations between the persistence and static measures further shows that clustering, reciprocity and high topological overlap are usually associated with a strong persistence.
Raeder et al.
[] dig a bit deeper into that last topic, by attempting to predict which link will decay and which will persist, based on several local indicators. They quantify the information provided by each indicator with the decrease of entropy on the probability of an edge to persist, and obtain that the most informative indicators are the number of calls passed between both nodes as well as its scaled version. By trying both a decision-tree classifier and a logistic regression classifier, they manage to predict correctly about % of the persistent edges and decays.
While most approaches to the link prediction problem use the network structure and similarity between nodes to extract information on future interactions [-], Miritello goes a step further and introduces temporal components to build a new link prediction model []. The author defines the temporal stability of a link as: where t min ij and t max ij are, respectively, the time instants at which the link has first and last been active in the observation period T. High values of temporal stability, close to T, indicate that the link may be active beyond the time window of observation, whereas small values of ij may indicate that the link was only active for a short time. Miritello further shows that introducing temporal components in a link prediction model significantly improves the performance, and presents a threshold model achieving % accuracy in predicting whether a link will decay or persist in a future time window.
Addressing a related question, Miritello also observed that it is difficult to distinguish between a link that has decayed and a link that simply has not been observed for a long time, especially when the time window of observation is only a few months long [].
On a very close topic, Karsai et al. studied how the weights of the links in a network vary with time, how strong ties form, and how this process is related to the formation of new ties []. They start by measuring the probability p k (n) that the next communication of an individual that has degree n will occur with the formation of a new (n + )th tie. This probability depends on the parameter k that corresponds to the final degree of the individual at the end of the observation period. They find that the process of the formation of new ties follows a very consistent pattern, namely where c(k) is an offset constant that depends on the degree k considered. Using the measured c for each degree class, the authors then show that rescaling the distributions p k (n) allows to collapse all curves into one (see Figure ), suggesting that the evolution of the ego-network of each individual is governed by roughly the same mechanism. The reasons for the decay and persistence of links remain various and unknown. However, Miritello et al. addressed a related question, namely how many links can a person maintain active in time []? By looking at a large time-window (around  months of data), they evaluate how many contacts are new acquaintances, and how many ties are de-activated during a smaller time-window. It appears that individuals show a finite communication capacity, limiting the number of ties that they are able to maintain active in time: in the network of a single user, the number of active ties remains approximately constant on the long term. From a social point of view, apart from the balanced social strategy between a user's communication capacity and activity, the authors discern between two kinds of rather extreme behavior that they name social explorer and social keeper. While the social explorer shows a very high turnover in his social contacts and has a very high activity compared to his capacity, keeping only a very little stable network, the social keeper has a very stable social circle, and only has a very small pace of activating and deactivating ties. The authors further show that the social strategy of an individual can be linked to the topology of its local network. In a related paper, Miritello et al. [] further show that even though people who have a large network tend to spend more time on the phone than those who have few contacts, the total communication time seems to reach a maximum, and the strength of ties starts decaying for people who have more than  contacts.
Despite this turnover in links and the fact that links appear and disappear, there seems to be some consistency in a person's network of contacts. In a related study, Saramäki et al. showed how a turnover in contacts did not imply a change in the structure of the local network around a person []. They study a network of students who, during the time window covered by the dataset, move from high school to college. Despite the very high turnover in a user's contacts, the distribution of the weights on the links around the user, that the authors call the social signature of this user, stays very similar through time.
From an evolving network perspective, the question of stability and survival of communities is closely linked to the previous questions. Palla et al. studied the temporal stability of a mobile phone network [], analyzing communities detected on slices of two weeks. They observed that communities have different conditions to survive, depending on their size; small communities require to be stable, while large groups require to be highly dynamic and often change their composition. Recently, several community detection techniques developed for static graphs have been extended to take into account the dynamics of human interactions []. One approach of this question is to detect communities in a multislice network, where each slice represents the network at a given point in time, and nodes of a slice are linked to their counterparts in adjacent slices []. However, to our knowledge this approach hasn't yet been applied to mobile phone networks. Using another approach on the Reality Mining dataset, Xu et al. detect communities using evolving adjacency matrices [], and show that this approach gives more consistency to the results than detecting communities independently in subsequent snapshots of the network [].
On a shorter time scale, Kovanen et al. identified temporal motifs of sequences of adjacent events involving a small number of nodes (typically  or ) []. Events are said to be t-adjacent if they have at least one node in common, and the timing between the two events is less than t (typically of the order of minutes). The authors analyze the most common motifs present in a mobile phone database and find that the most common temporal motifs of three events involve only two nodes, and motifs that allow a causal hypothesis are more frequent than those that do not.
The availability of timestamps in datasets allows to segment the calls between office hours and home hours. By supposing that calls made during office hours are for a purpose of business, while private calls are made early morning, in the evening or over the weekend, Kirkpatrick et al. managed to build two separate networks based on a mobile and landline dataset from the UK []. The degree and clustering coefficient distributions of both networks are mostly similar, but a deeper analysis of the network structure shows that some important differences exist between them. By decomposing the network into k-cores and monitoring the speed of information diffusion, they observe that the work network is much more con nected than the leisure network, and that information diffuses almost twice as fast. Addressing a related question, Pielot et al. present a method to predict the attentiveness to an instant message, that is, the time it will take to the receiver to attend the message []. They use data of the user's interaction with the smartphone to predict with % accuracy how fast the receiver will pay attention to the communication.

Burstiness
The dynamics of many random systems are modeled by a Poisson process, where the average interval between two events is distributed following an exponential, wellcharacterized by its average. However, it has appeared that human interactions show a different temporal pattern, with many interactions happening in very short times, separated by less frequent long waiting times [].
The same holds for mobile phone calls. Karsai et al. studied the implications of the bursty patterns on the links of a mobile call graph []. They observed that indeed, the interevent time ranges over a multiple orders of magnitude, and in particular, the burstiness of human communication induces long waiting times, which slows down the spreading of information over the network (see Section  for more results on spreading processes). In a further paper [], Karsai et al. also analyzed the distribution of numbers of events in bursty cascades, thus better explaining the correlations and heterogeneities in temporal sequences that arise from the effects of memory in the timing of events. In another study, Wu et al. find that the distribution of times between two consecutive events is neither a power-law nor exponential, but rather a bimodal distribution represented by a power-law with an exponential tail [].
It is interesting to note that in the previous papers, the authors observed the inter-event time on links, by sorting links by weight. In [], Candia et al. perform a similar task but for nodes, and measure the inter-event time for nodes, by grouping them based on the number of calls they made. Similarly to Karsai et al.'s observations, the inter-event times range over several orders of magnitude, and the distribution is shifted to higher interevent times for nodes of lower activity. By rescaling with the average of each distribution, the inter-event time distributions collapse into a single curve fitted by a power law with exponent . followed by an exponential cutoff at  days.

p( T) = ( T) -α exp( T/τ c ).
() In a further paper, Karsai et al. study bursty trains, and show that the burstiness observed in communication networks is mainly a link property, rather than a node property []. They show that bursty trains usually involve the same pair of individuals, rather than one node and several of their neighbors. They further observe that within those bursty trains, there is a strong imbalance within a link with respect to who initiates the communication when voice calls are observed, while trains of SMSs are much more balanced.
The origin of this burstiness in human behavior has been discussed in several papers in the last few years. It is expected, for example, that people will have more activity during the daytime than at night, and that some times of the day will represent peaks of activity []. Therefore, could the burstiness of phone calls only be due to the daily patterns present in our lives? Jo et al. studied this question and looked at how much of the burstiness of events still remained if they removed the circadian and weekly patterns that appear in a mobile phone dataset []. They dilated (contracted) the time of their dataset at times of high (low) activity. They observed that much of the burstiness remained after removing the circadian and weekly patterns, indicating that there is probably another cause of burstiness coming from the mechanisms of correlated patterns of human behavior. Another hypothesis was suggested by Barabási [] who suggests that the burstiness comes from task prioritizing in human behavior: if an individual decides to always do first the highest priority task on their list, then the high-priority tasks will be executed soon after their arrival on the list, while lower-priority tasks will stay on the list for a much longer time, waiting until all higher priority tasks are executed. This process leads to a fat tailed distribution of waiting times, as was shown in [].
Mobile phone networks are composed of complex patterns and interactions, but still only little work has been done yet in order to characterize these interactions. The temporal arrival and disappearance of more complex structures than simple edges and the timescales of human communication are only two examples of the wide possible research that still needs to be explored in this matter.

Combining space and time -mobility
Given their portability, mobile phones are trusty devices to record mobility traces of users. The availability of spatio-temporal information of mobile phone users has already led to a tremendous number of research projects, and potential applications (see Section ) which would be too large to review exhaustively here. The increasing number of smartphone applications that offer services based on the geolocation of the user are a proof that this information still has a lot of potential uses that are yet to be discovered. In this section, we concentrate on the contributions that present new observations or methods for analyzing and modeling human mobility, while the contributions that propose new applications or uses of these methods are presented in Section .

Individual mobility is far from random
A mobility trace is represented as a sequence of cell phone towers at which a specific user has been recorded while making a phone call. By studying the traces of , mobile , who identify significant differences between observational data and two typical models of human displacement: the continuous time random walk and the Lévy flight. Instead, the authors show that a model mixing the propensity of users to return to previously visited locations and a drift for exploration manages to reproduce characteristics present in their data but absent from traditional models. In their model, each time a user decides to change location, they can either choose a new location with a probability that decreases with the number of already visited locations (p new ∝ S -γ , where S is the number of visited locations, and γ a constant), or they can return to a previously visited location. Despite the simplicity of this model they manage to explain the temporal growth of the number of distinct locations, the shape of the probability distribution of presence in each location, and the slowness of diffusion.
In another approach, Csáji et al. show how small the number of frequently visited locations is []. They define a frequently visited location of a user as a place where more than % of phone calls were initiated. Using a sample of , users randomly chosen in a dataset of communications of Portugal, the authors find that the average number of frequently visited locations is only ., and that % of the users visit frequently less than  locations. Instead of making a list of frequently visited locations, Bagrow et al. propose another method to group frequently visited locations representing recurrent mobility into one 'habitat' []. The primary 'habitats' will therefore capture the typical daily mobility, and subsidiary 'habitats' will represent occasional travel. Interestingly, they show that the mobility within each habitat presents universal scaling patterns and that the radius of gyration of motion within a habitat is usually an order of magnitude smaller than that of the total mobility.
However synchronized and predictable the mobility of most countries presented here seem to be, most of these studies are based on data from developed countries, where the cultural and lingual diversity do not play as big a role as in the developing world. Amini et al. analyze and quantify the differences between mobility patterns in Portugal and Ivory Coast, and show that models that perform well for developed countries can be challenged by the cultural and lingual diversity of Ivory Coast, that counts  distinct tribes []. They show, for example, that commuters in Ivory Coast tend to travel much longer distances than their counterparts in Portugal, and that mobility patterns vary much more across the country in Ivory Coast than in Portugal.
If mobility traces are not random, and if users often return to their previous visited locations, could one state that human mobility could be predicted? Song et al. [] addressed this question and investigated to what extent one could predict the subsequent location of a user based on the sequence of his previous visited locations. This predictability is given by the entropy rate of the sequence of locations at which the user is observed. Importantly, one has to point out that not only the frequency of visits at each location is taken into account, but also the temporal correlations between those visits. Their results show that the temporal correlations of the users' displacements reduces drastically the uncertainty on the presence of a mobile phone user, see Figure . Using Fano's inequality, they deduce that an appropriate algorithm could predict up to % of a user's location on average. The most surprising finding is that not only users are highly predictable on average, but this predictability remains constant across the whole population, whatever distance users are used to travel. While one would expect that people traveling often and far would be less predictable than those who stay in their neighborhood, Song's results seem to point out that there is no variation in predictability in the population.
While the aim of the previous work was to show how predictable human motion could be, the authors did not provide any prediction algorithm, keeping their contribution on the theoretical side. Calabrese et al. went a step further and proposed in [] a predictive model for the location of people. Their algorithm is both based on the past trajectory of the targeted user and on a general drift of the collectivity, imposed by geographical features and points of interest. The prediction is then a weighted average between an individual behavior and a collective behavior. The individual behavior is modeled as a first-order approximation of the concept proposed by Song [], building a Markov chain where states are locations visited by the user and the probability of moving from state i to state j is proportional to the number of times it has been observed in the data. The collective behavior is then modeled as a weighted average between the influence of distance, points of interest and land use. The predictions of their model on a sample of a dataset containing the records of  million people on  months shows that in % of their predictions, they manage to predict correctly the next location of a user.
The Markov chain approach used by Calabrese et al. for modeling the individual behavior is also at the base of a study proposed by Park et al. []. They showed how the temporal evolution of the radius of gyration of a user can be explained by the eigenmode analysis of the transition matrix of the Markov chain. More precisely, the eigenvectors of the transition matrix provide fine-grain information on the traces of individuals.
Instead of looking at the general mobility of people, Simini et al. focused on the modeling the commuting fluxes between cities, and introduced the radiation model [], overcoming some of the limitations of the gravity model (recall Section ). The radiation model is a stochastic model, assigning a person from a county i to a job of another county j with a probability depending on the estimated number of job opportunities close to the county of origin i. The estimated number of job opportunities in a given county is also a stochastic variable proportional to the total population of the county. If we name d ij the distance between counties i and j, the average number of commuters between the two counties depends on the population of both counties (m i and n j , respectively), and of s ij , representing the total population in a circle of radius d ij : where T i is the total number of commuters from county i. The radiation model, however efficient, still relies on the knowledge of the distribution of the population, which may be difficult to get in some areas such as the developing world. Overcoming this limitation, Palchykov et al. suggest a new model using only communication patterns []. The communication model supposes that the mobility between two places i and j is a function of the distance d ij separating the two locations, and of the intensity of communication between these two locations, c ij : where k is a normalization constant. The authors find fitting values for the parameter β around . or . depending on whether they consider the mobility at intra-or inter-city level, respectively. As it appears, the massive amount of mobility data, which would on first view be considered as random motion, respects a strict routine. Mathematical models, prediction algorithms and visualization tools (see for example Martino's work []) have recently shed light on this routine, allowing to construct better human displacement models which can be used to predict epidemics outbreaks. At individual level, this routine appears to be strictly ruling our daily behavior, as Eagle and Pentland [] show that six eigenvectors of the mobility patterns of users are sufficient to reconstruct % of the variance observed. They also observed that individuals tend to have synchronized behaviors, which will be described in the next paragraph.

Aggregate mobility reveal synchronized behavior of populations
At a higher level, those datasets allow to consider whole populations from a God-eye point of view. More practically, the availability of such massive data allows us first to observe and quantify the interaction of people with their environment, and second to quantify the synchronicity of those interactions.
Initial projects, such as the Mobile Landscapes [] project and Real Time Rome [] have shed light on the potential of such an approach, contributions being essentially visual. However, the next step has been made by Reades et al. [], who used tower signals as a digital signature of the neighborhood. They showed how similar locations presented similar signatures, which implies that a clustering of the urban space is possible, based on the phone usage recorded by its antennas. In particular, the obtained clusters reveal known segmentations of the town, such as residential areas, commercial areas, bars or parks. In short, such a technique may be used as a cheap census method on area usage, which could be of great interest to local authorities. Going a bit further, the same team showed how using an eigendecomposition [] of the signatures of different locations in town it is possible to extract significant information on differences and similarities in space usage, see Figure   and thus identify which places typically correspond to work or home calling patterns (see Figure ).
Addressing a closely related question, Karikoski and Soikkeli studied data collected from smartphones in the context of the OtaSizzle project at Aalto University, where users agreed to share their data []. The authors study whether different contexts trigger different usage patterns of smartphones. From the mobility traces of users, they classify places where the user is observed between: home, work, other meaningful, and elsewhere, the latter representing only pass-by places. They are able to show that depending on the context, users will have different usage patterns. For example, they show that voice calls are longer and more intensively used when people are at home, and that SMSs are more popular in the office context, where the voice calls are the shortest. In a paper studying the same dataset, Jo et al. study the contextual and temporal correlations between service usage and thus characterize typical usage patterns of smartphone services []. The authors further use k-means clustering to extract typical weekly behavior, and thus classify users between morning-type and evening-type usage patterns. Addressing a very closely related question, Trestian et al. show that the mobility and locations of people also influence the choice of smartphone applications they use []. Using a similar approach, Naboulsi et al. classify call profiles of snapshots of the network, corresponding to an aggregation of the traffic going through the network during a given time window []. They measure the similarity between two snapshots, comparing volumes and distribution of the traffic through the network. They further extract typical usage patterns, and propose a method to detect outlying behavior in the network. It is interesting to notice that even though the methods are very similar, this last approach is only based on antenna-to-antenna traffic, and not on individual behavior and mobility patterns, as were the previous studies.
Beyond the analysis of a single city, Isaacman et al. explored behavioral differences between inhabitants of different cities []. By analyzing the mobility of hundreds of thousands of inhabitants of Los Angeles and New York City, they showed that Angelenos travel on average twice as far as New Yorkers. Finding an explanation for such a significant difference seems possible, if the inhomogeneities of population density and city surfaces are taken into account. See, for example the work of Noulas et al. [], who show using Foursquare location data that using a rank-based distance, the differences between cities are leveled. A rank-based distance measures the distance between two places i and j as the number of potential opportunities (people, places of interest) being closer to i than j. Given the geographic distance r ij and the density of opportunities expressed in radial coordinates and centered in i, p i (r, θ ), such a distance reads In a city of large population density, there will be more opportunities at short geographical distance than in a city with low population density. Hence, users are likely to travel over shorter distances in city of large population density. These distortions of the use of geographical distance are here leveled by the rank-based distance. In a recent study, Louail et al. suggest another way to formalize these differences and analyze the spatial structure of cities by detecting hot-spots or points of interest in  spanish metropolitan areas []. The authors show that the average distance between individuals evolves during the day, highlighting the spatial structure of the hot spots and the differences and similarities between different types of cities. They distinguish between cities that are monocentric where the spatial distribution is dependent on land use, and polycentric cities where spatial mixing between land uses is more important. In a similar approach, Trasarti et al. also analyze the correlations that arise in terms of co-variations of the local density of people, and uncover highly correlated temporal variations of population, at the city level but also at the country level []. If the detection of the hot-spots and places of interest in a city is possible, then is it possible to go one step further and infer the type of activity that people engage in, from looking at their mobility patterns ? Jiang et al. present a first approach to achieve this in [], by first extracting and characterizing areas where people will stay or only pass-by, and then infer the type of activity that they engage in depending on the timing of their visit to certain specific locations. In many cases, modeling the mobility of users starts by creating an Origin-Destination matrix that represents how many people will travel between a specific pair of (origin, destination) locations within a given time frame [-]. After extracting which places and times of the day correspond to which activities, Alexander et al. propose a method to estimate OD-matrices depending on the time of the day and on the purpose of the trip. The authors' results extracted from data in the area of Boston, are surprisingly consistent with several travel survey sources.

Extreme situation monitoring
If the availability of data containing the time-stamped activity of a large population allows to perform monitoring of routine in population activities, it also enables to observe the population's collective response to emergencies. Many recent papers addressed this interesting question. is computed for each place a, for the time interval [t, t + T] between the different individual behaviors n i (a, t, T) and the average expected behavior. Comparing this variance with the normally expected variance allows to identify locations where users are acting abnormally, and that such locations are, in case of emergencies, spatially clustered. In cases of extreme emergencies, the response of populations can even be monitored as geographically and temporally located spikes of activity.
In a related paper, Bagrow et al. [] analyzed the reaction of populations to different emergency situations, such as a bombing, a plane crash or an earthquake ( Figure ). They observed such spikes of information when eye witnesses and their neighbors reacted almost directly after the event. The reaction was mostly driven by calls made by nodes who don't usually call at that time, rather than an increase of call rate of usually active nodes. A detailed study of the paths followed by the information during its propagation shows the efficiency of the collective response, with  to  degrees from eye witnesses being contacted within minutes after the situation. Gao et al. further analyzed these dynamics in [], and observed that the reciprocity of calls, i.e., 'call-back' actions, showed a sharp increase in emergency cases, such as a bombing or plane crash. The same kind of spikes of behavior, though with different characteristics, are also known to appear at recently also introduced another method they call the social amplifier to detect anomalous behavior and thus detect emergencies []. Hubs of the network are nodes that have a very high degree, and are thus very well connected to the rest of the network, enabling them to amplify the diffusion of information through the social graph. Using those particular nodes as social amplifiers, the authors show that only analyzing the local behavior of nodes that are close to the hubs of the network can be efficient to detect anomalies of the whole network, and thus detect emergencies. This approach has the advantage that only keeping an eye on a limited fraction of the network is computationally much easier than monitoring and keeping updates on the whole network activity.
Further than detecting emergencies, Lu et al. studied whether the mobility of populations after a disaster could be predicted, analyzing as case study the mobility of populations before and after the  Haiti earthquake []. Interestingly, the predictability of people's trajectories remained high and even increased in the three months following the earthquake. The authors also show that the destinations of people who left the capital were highly correlated with their previous mobility patterns, and thus that, with further research, mobile phone data could be used in the future to monitor extreme situations and predict the movements of populations after natural disasters. These results are very encouraging for many humanitarian organizations who are now trying to use Big Data to save lives. After the earthquake and the following tsunami that struck Japan in , several research teams started a project together combining several big data sources, such as GPS devices, mobile phones, Twitter or Facebook to analyze how the analysis of this data could help save lives in the future, if natural disasters were to strike these regions again. Similar research has been conducted by Kryvasheyeu et al. analyzing Twitter data during and after the hurricane Sandy in  measuring the performance of friendship links to raise awareness []. This area of research still needs to be explored, especially as so many data sources are now becoming available, combining datasets could prove very useful, and even life-saving for some people.

Mobility and social ties
The common availability of mobility traces and social interactions in the same dataset allows to address causality questions on the creation of social links. From the work of Calabrese et al. it appears that users who call each other have almost always physically met at least once over a one year interval []. Users call each other mostly right before or after physical co-location, and interestingly, the frequency of meetings between users is highly correlated with their frequency of calls as well as with the distance separating them.
Going a step further, one may wonder if social ties could be predicted using mobility data. Wang et al. [] showed that indeed, nodes that are not connected in the network, but topologically close, and who show similar mobility patterns are likely to create a link. By combining the mobility similarity and the topological distances in a decision-tree classifier, they manage to improve significantly classical link prediction algorithms, yielding in an average precision of % and a recall of %. Closely related, Eagle et al. showed on  years of data how the social network of people changes drastically when moving from one geographical environment to another [].
On a related topic, Toole et al. measure the similarity between the mobility of users to classify social relationships and show how to contextualize social contacts using their mobility patterns []. The authors further present a mobility model, based on stochastic decisions to return to a previously visited place or to explore, and to base the choice on social influence or on individual preference. They show that this model achieves good accuracy in reproducing the similarity of mobility traces between social contacts.

Dynamics on mobile phone networks
Many networks represent a transport between nodes via their links. In mobile phone networks, the links transport either information (exchanged during phone calls or contained in messages) or non-voice exchanges (SMS, MMS). Information diffusion has opened questions on the speed of the diffusion or on the presence of super-spreaders, with applications in viral marketing or crowd management. The transmission of data has been at the centre of attention only recently, with the rise of new types of computer viruses running on smartphones.

Information diffusion
A phone call is associated to the transfer of information between caller and callee. However, as paradoxical as it may sound, mobile phone datasets are not appropriate to observe real propagations of information. The content of phone calls or text messages is, for evident privacy reasons, unknown. Yet, without having access to the content, it is impossible to decide for sure if an observed pattern of calls reflects the transmission of information or if it happens by chance. One can imagine a network with a number of indistinguishable balls circulating between the nodes. Each time a node receives a ball from one of its neighbors, it decides to keep it for a random time interval and after that to transmit it to one of its neighbors. Suppose now that one decides to track the movement of one specific ball. If the number of balls is small compared to the number of nodes, this can still be doable, as long as each node has maximum one ball in its possession. However, if the number of balls increases to become equivalent to the number of nodes, there is a high probability to confuse the paths of several balls. Add to this that balls might be added, removed or duplicated during the process, and one gets a similar situation as trying to track a piece of information in a mobile phone network.
This artificial example reflects well the issue of tracking information. Peruani and Tabourier addressed this issue and showed that cascades of information, such as observed in mobile call graphs are statistically irrelevant, and correspond thus probably not to real propagations []. Tabourier et al. show in a further paper [] that even though large cascades of information spreading don't seem to happen in mobile call graphs, local short chain-like patterns and closed loops seem to be the effects of some causality and could very well be related to information spreading.
In a small number of cases, however, the actual observation of large diffusion of information might be possible. Studying the case of emergencies, such as a plane crash or a bombing, Bagrow et al. [] observed an unusual activity in the geographical neighborhood of the catastrophe. In this case, the knowledge of both the temporal and spatial localization of an unexpected event that is likely to generate a cascade of information allows to assume that the observed sequences of calls are correlated for a specific reason.
If, in most cases, the observation of real propagations seems an unreachable objective, a more complete research has been driven in the simulation of propagation of information on complex networks, which results have been extended to questions related to mobile phone networks. There are several ways of modeling information diffusion on networks. A simple way is used in [] with an SI or SIR model where at each time step, infectious nodes try to infect their neighbors with a probability proportional to the link weight, which corresponds to a sequence of percolation processes on the network. However, mobile phone networks are known to have very particular dynamics (recall Section ), which are not taken into account here. Miritello et al. [] used a formalism similar to the one presented by Newman [] for epidemics, to characterize the dynamical strength of a link, which can be used as link weight to map the dynamical process onto a static percolation problem. The dynamical strength, given an SIR model of recovery time T and probability of transmission λ, is given by which is the expected probability of having n calls between i and j in a time range of T multiplied by the probability of propagation given these n calls, summed over all possible values for n. Using an approximation of this expression, they manage to link the observed outbreaks to classical percolation theory tools. However, such a formalism still neglects the impact of temporal correlations between calls, which significantly slows down the transmission of information over a network. Social networks often exhibit small-world topologies, characterized by average shortest paths between pairs of nodes being very short compared to the size of the network []. However, Karsai et al. [] used different randomization schemes to show that even though social networks have a typical small-world topology, the temporal sequence of events significantly slows down the spreading of information, as illustrated on Figure . Kivelä et al. [] analyze this topic further, and introduce a measure they call the relay time, specific to each link, that represents the time it takes for a newly infected node to spread the information through that link. By analyzing several computations of this relay time, in randomized and empirical networks, they show that the bursty behavior of links, but also the broad distribution of link weights are the components that slow down the most the spreading dynamics in mobile phone networks. In another study, Karsai et al. [] confirm this influence and show that neglecting the time-varying dynamics by aggregating temporal networks into their static counterparts introduces serious biases of several orders of magnitude in the time-scale and size of a spreading process unfolding on the network.
From a more theoretical point of view, diffusion processes can be seen as particular cases of dynamical systems. Liu et al. [] questioned in this framework the controllability of complex networks. The problem was stated as follows; given a linear dynamical system with time-invariant dynamics where x(t) = (x  (t), . . . , x N (t)) T defines the state of the nodes of the network at time t, A is the (possibly weighted) adjacency matrix of the network, and B an input matrix, what is the minimal number of nodes needed for the input such that the state of each node is controllable, i.e., the system is entirely controllable? From control theory, one knows that a sufficient and necessary condition is that the reachability matrix C = (B, AB, A  B, . . . , A N- B) is of full rank. From previous work, it is known that the minimal number of nodes required is related to the maximal matching in the network, which can be computed with a reasonable complexity. For example, the authors show that in a mobile phone network, one needs to control about % of the nodes in order to achieve full controllability of the system. Surprisingly, most nodes needed for controlling the network are low-degree nodes, while hubs, that are commonly used as efficient spreaders, are under-represented in the set of input nodes. While the practical interest of this research still needs to be defined, this first result on controllability of networks might open new ideas in the field of information spreading.
Finally, one may wonder if the patterns of phone usage are efficient in a collaborative scheme. Cebrian et al.
[] studied this with a small model, where each node of a mobile phone graph is represented as an agent assorted with a state represented by a binary string. The agents are all given the same function f , that takes their binary string as input and which is hard to optimize, and which computes their personal score. After each communication, the two communicating agents can modify their state in order to increase their personal score. This modification is done with a simple genetic algorithm, which simulates a cross-over of the states of both agents.
Practically, suppose that two agents i and j are respectively in state x (t) i and x (t) j at time t. These states are both binary strings of length T. The agents choose a random integer c in the interval [, T] and both update their state as where y  is the vector with the c first entries of x (t) i and the Tc last entries of x (t) j and y  is the vector with the c first entries of x (t) j and the Tc last entries of x (t) i . The authors observe with this model that the average score on all agents obtained in the real dataset is smaller than for a random topology, which is in line with similar known results from population genetics. Also, perturbation of the time sequence of calls produces a small enhancing of the global fitness.

Mobile viruses
The study of virus propagations has a long history, may it be biological viruses or more recently computer viruses. Wang et al.
[] studied a new kind of virus, which spreads over mobile phone networks. Their work is motivated by the increasing number of smartphones, which have high-level operating systems like computers, which leads to a higher risk of an outbreak. So far, despite the large number of known mobile viruses, no real outbreak has been noticed. The reason for this is that mobile viruses function only on the operating system for which they are designed for. An infected phone can hence only transfer the virus to its contacts running on the same operating system. As exposed by Wang et al. this situation corresponds to a site percolation procedure on the network of possible contacts. Given the actual market shares of the main operating systems, the authors showed that those were below the percolation transition of the contact network. The study concerns two types of spread available for viruses: the diffusion via Bluetooth and via Multimedia Messaging System (MMS). Both diffusions show major differences in spreading patterns; Bluetooth viruses spread relatively slow and depend on user mobility. In contrast, MMS epidemics spread extremely fast and can potentially reach the whole network in a short time, see Figure . However, currently they are contained in small parts of the network, due to the different operating systems. In conclusion, the authors deduce thus that if no outbreak has taken place so far, it is not due to the lack of efficient viruses, but it is rooted in the fragmentation of the call graph. However, the current evolution of the market leads to a situation where some operating systems are gaining a large market share, which could lead to a more risky situation.
In a subsequent study, Wang et al. [] show how the scanning technique, where MMS malware generate random phone numbers to which they try to propagate instead of using the address book of their host, increases the probability of a major outbreak, even They study the effectiveness of topological viruses versus viruses that also use a scanning technique. The authors show that topological viruses, i.e., those that spread through the contact network of infected phones, are the most effective for an operating system that has a large market share, whereas the scanning technique will generate a bigger outbreak in the case of a low market share operating system.

Applications in urban sensing, epidemics, development
The last few years have seen the rise of Big Data and of its uses, and in many regards, this is rapidly changing our lives and way of thinking. Further than observing those networks of mobile phone calls, or modeling social behavior, many researchers now engage in finding new ways of using mobile phone data in everyday life.

Urban sensing
As showed in the previous sections, mobile phone data allows to observe and quantify human behavior as never before. Besides purely sociological questions, this data also opens a number of potential applications, which gives to this data an intrinsic economical value, thinking of geo-localized advertising applications []. Recalling that an increasing fraction of the available smartphone applications record the user's geolocation -whether it is necessary for the app to work or not -it is easy to understand that this information is valuable to target the right users when making advertising campaigns, or simply to understand the profile of the application's users. Mobile phones are more and more becoming a way of taking the pulse of a population, or the pulse of a city, and we expect that in the future, more and more cities will make development plans based on information gathered from mobile phone data. In this framework, recent research has shown that mobile phone data could detect where people are [] and where people travel to [] including the purpose of their trips []. If these findings are applied to a whole city and points of interest are uncovered via mobile phone data (recall Section ), then the whole organization of urban places can be influenced by the knowledge gained from this data. Urban sensing is only shortly addressed here, but has been a popular topic in the last few years, and we refer the interested reader to a recent survey of contributions in this specific field [].
We have previously addressed the possibility of using mobile phone signatures as a cheap census technique, Isaacman et al. take this analysis a step further and show how one can derive the carbon footprint emissions [] based on the mobility observed from mobile phone activity.
Many applications of modeling mobility aim towards transport planning and monitoring traffic with evident applications in accident management and traffic jam prevention. Over the last (almost)  years, a large number of attempts have been made to enhance prediction using mobile phone data. This topic is only shortly addressed here with a few recent contributions, but for more information on the research in this field, we will refer the interested reader to a review published in  []. One example of such an application was proposed by Nanni et al., who create the OD-matrix of Ivory Coast and then assign this matrix to the road network [] to produce a map (see Figure  that a road's usage depends on its topological properties in the road network, and that roads are usually used only by people living a small number of different locations []. The authors further show that taking advantage of this observation helps create better strategies for reducing travel time and congestion in the road network of a city.
Going one step further, Berlingerio et al. designed an algorithm to detect which means of transport people would choose, including public transportation or private means, to infer how many people used which public transportation routes [] throughout the day. The authors then proposed a model of the network of local transportation of Abidjan highlighting the routes that are taken most often. Then, they are able to show how specific little changes to the network could improve the average travel time of commuters by %. Among other possible uses of information on commuting flows, McInerney et al. suggested using the regular mobility of people for physical packages delivery to the most rural areas [], showing on the one hand, the feasibility of this method, and on the other hand reducing by % the total delivery time for rural areas. Other applications of prediction algorithms for the next journey of users include, for example, a recommender system for bush taxis such as suggested by Gambs et al. [], using the predicted next location of users to recommend to pedestrians adapted means of transport that are in their neighborhood.
By monitoring the movements of people towards special planned events, Calabrese et al. [] show that the type of events highly correlates to the neighborhood of origin of the users. Such a cartography of taste can be used by authorities when planning the congestion effects of large events, or for targeted advertising of events (see Quercia et al. []). In a closely related approach, Cloquet and Blondel use the analysis of anomalous behavior in mobile phone activity to predict the attendance to large-scale events such as demonstrations or concerts. The authors propose, as a first step in that direction, a method to determine the time when no more people will arrive to a certain event []. To do this, they propose two methods. The first method uses the mobility of people that are traveling towards the event to model the flux of the arriving or leaving crowd. The second method is based on the recorded interactions between people that are already at the event and other users that are within  km. The authors show that using these methods, they are able to predict the time when no more people will join the event up to  minutes in advance. Another related application was explored by Xavier et al. who analyzed the workload dynamics of a telecommunication operator before and after an event such as a soccer match [] in order to help the management of mobile phone networks during such events.
Finally, mobility traces can also be used to monitor temporal populations [], such as tourists. Kuusik et al. [] studied the mobility of roaming numbers in Estonia for  consecutive years, showing the potential for authorities to understand and efficiently target visiting tourists.

Infectious diseases
In recent years, a lot of research has been done in order to use Big Data to help monitor and prevent epidemics of infectious diseases. If one can model information spreading in mobile phone networks (recall Section ), then the same theory could also be used to model the spreading of real infectious diseases. As mobile phone data can help follow the movements of people (recall Section ), these movements can also provide information about how a disease could travel and spread across a country. The dynamics at hand usually depend on the type of disease and how it can be transmitted, hence many articles, of which we will review a few here, propose different models based on the mobility of people to predict the spread of an epidemic.
Using mobile phone traces, Wesolowski et al. measure the impact of human mobility on malaria, comparing the mobility of mobile phone users to the prevalence of malaria in different regions of Kenya, and identify the main importation routes that contribute to the spreading of malaria []. In another study, Tizzoni et al. [] validate the use of mobile phone data as proxy for modeling epidemics. The authors extract a network of commuters in three European countries by detecting home and work locations for each mobile phone user, and compare this network with the numbers of commuters obtained by census. On these networks of commuters, they trace agent-based simulations of epidemics spreading across the country. They show that the invasion trees and spatio-temporal evolution of epidemics are similar in both census and mobile phone extracted networks of commuters (see Figure ). Most models assume, lacking additional information, homogenous mixing between people that are physically within the same region or area. Frias-Martinez et al. propose another agent-based model of epidemic spreading, using individual mobility and social networks of individuals to build a more realistic model []. Instead of assuming homogeneous mixing within a given area, an individual will have more probability of meeting an infected agent that is in the same area if they have communicated with each other before. The authors further divide the social network of contacts and the mobility model of an individual between weekday and weekend to achieve better accuracy.
Going a step further, a few contributions to the DD challenge [] investigated which would be the best ways to monitor and influence an epidemic rather than just predicting its spread. In this framework, Kafsi et al. [] propose a series of measures applicable at the individual level that could help limit the epidemic. They investigate the effect of three different recommendations, namely () do not cross community boundaries; () stay with your social circle and () go/stay home. Considering that either of these three recommendations could be sent via their mobile phone to different users in the network, and that probably only a fraction of the contacted users would participate, the authors evaluate the impact that implementing this system could have on the spreading process. They show that these measures can weaken the epidemic's intensity, delay its peak, and in some regions, even seriously limit the number of infected individuals. Using the same dataset, Lima et al. proposed a different approach [], namely using the connection between people to launch an information campaign about the epidemic, in the hope to reduce the probability of infection if an individual is better informed about the risks. The authors use an SIR model and the observed mobility of mobile phone users to simulate epidemics unfolding on a population, and evaluate the impact of geographic quarantine on the spreading of the disease, as well as the impact of an information campaign reducing the risks of infection for 'aware' individuals. They show that the quarantine measures don't seem to delay the endemic state, even when almost half the population is limited to their own subprefecture, whereas the information campaign, less invasive, seems to limit significantly the final fraction of infected individuals, opening this topic for further research. This field of research has shown again how valuable mobile phone data could be to save lives, and potentially monitor and limit epidemics of infectious diseases. However, most models and studies are limited by the lack of ground-truth data to compare their results with. Indeed, how would you know who an individual got the disease from, and what was its exact route towards each infected person? Another shortcoming of this area of research comes from the current difficulty of gaining access to those mobile phone datasets, especially to cross-border mobility. If modeling mobility in Africa could be useful to containing the current Ebola outbreak, cross-border mobility would be very valuable data, as discussed in []. However, gaining access to these data is more difficult as it involves getting the approval from more than one country for a single dataset. In [], the authors suggest guidelines to share data for humanitarian use, while preserving the privacy of users.

Health and stress detection
While infectious diseases are still a major cause of death in developing countries, the attention has slowly shifted, in more developed parts of the world, towards chronic diseases such as cardiovascular diseases or cancer, and their causes. Among one of the studied topics, daily stress in the work environment has become a major problem in the recent years. In this framework, Bogomolov et al. have conducted an experiment to find out whether daily stress levels could be predicted from non-invasive sensors, including mobile phone data [, ]. Using only one source of data resulted in poor predictive capacity. However, combining mobile phone data with features of personality traits and weather conditions, they produced a predictive model using -dimensional feature vectors to classify users between 'stressed' and 'not-stressed' , achieving % accuracy. Interestingly, among the features extracted from mobile phone data that were selected as useful for the model, many were bluetooth proximity features.

Viral marketing
In , Katz and Lazarsfeld introduced the breakthrough idea that, more than mass media, the neighborhood of an individual is influencing their decisions []. This idea has induced the concept of opinion leaders, that is, persons who have a high influence on their neighborhood, although some debate exists on the exact role played by opinion leaders [], and introduced the concept of viral marketing. In opposition to direct marketing, the principle of viral marketing is that consumers respond better to information accessed from a friend than to information provided through direct means of communication. Viral marketing searches thus for means of making people communicate about a brand, in order to push friends of an early adopter to adopt the product in their turn. In particular, mobile viral marketing has proved to be an effective means of propagation of such marketing campaigns. The influence of one's neighbors can be observed using CDR data coupled to data on product adoption. In a study of the adoption of  mobile services, Szabó and Barabási [] showed that the adoption of a product by a user was highly correlated to the adoption of their neighbors for some services only, while other services were not showing any viral attribute. A similar study by Hill et al. [] on the adoption of an undisclosed technological service showed again that neighbors of nodes that had adopted the service were  to  times more likely to adopt the service than the best-practice selection of the company's marketing service. A related result was also obtained in the FunF project by Aharony et al. [], who showed that the number of common installed applications was significantly larger for pairs of users having often physical encounters. Risselada et al. [] further showed that the influence of one's neighbors on the adoption of a product evolved with time, depending on the elapsed time since the introduction of the product on the market.
Even though one could use a simple SI or SIR model to characterize viral marketing, it is more likely in this case, that a user will adopt a product if several of its neighbors have already adopted it and the information comes from several different sources. One of the possible ways to model these dynamics is to use a threshold model: each user is assigned a threshold. A node will adopt a product if the proportion of its neighbors that have adopted the product is above the node's threshold. The model can be either deterministic, and decide a priori a same threshold for all nodes, or stochastic and draw thresholds from a probability distribution. To take into account the timing of contacts between people, one can then add to this model the condition that a node will adopt a product if it has enough contacts with different neighbors that have adopted the product within a given time frame. Backlund et al. have studied the effect of timings of call sequences on those models []. Here again, they observe that the burstiness of events tends to hinder propagation of adoption of a product, increasing the waiting times between contacts compared to a randomized sequence of contacts.
The identification of 'good' spreaders for a viral marketing campaign is tough work, especially given the usually very large size of the datasets, which makes it hard to extract informational data in a small time frame. With this in mind, the authors of [] proposed a local definition of social leaders, nodes that are expected to play an influential role on their neighborhood. They defined the social degree of a node as the number of triangles in which the node participates, and social leaders as nodes that have a higher social degree than their neighbors. This definition has its use in marketing campaigns, to identify the customers who should be contacted to start the campaign, which proved to be efficient []. Moreover, social leaders can also be used to reduce the complexity of a network, by only analyzing the network of social leaders instead of the whole network, with possible uses in visualization and community detection.

Crime detection
In criminal investigations, the police often request mobile phone records of suspected individuals for inspection, looking for evidence. The analysis of such data can not only reveal behavioral patterns of a single suspected individual, but also uncover potential criminal organizations through mobile relationships. Social networks analysis allows therefore to uncover the structure of criminal networks, but also to quantify the flow of information between its members. In this framework, a research group from the university of Messina propose a toolbox called 'LogAnalysis' to analyze CDRs and the associated social networks, with the aim of detecting criminal organizations [, ]. This toolbox allows to measure a series of metrics of the network and of the nodes, such as node centrality or clustering coefficient, and the tool further presents an analysis of the dynamics of the graph. The authors add visualization tools to the analysis, enabling forensic analysts to easily spot nodes that are more central, or visualize clusters and sub-clusters of tightly related individuals.
The approach of this type of research is somewhat different from most studies presented in the above paragraphs on CDR datasets, in that it is not based on studying anonymized datasets and extracting information on the behavior of a population, but rather on studying the network around a specific individual or a specific group of suspects whose identity is clearly known by the forensic analyst carrying out the investigation.
Using a different approach, Bogomolov et al. use indicators derived from mobile phone traces to predict whether a certain area will be a crime hot spot in the next month []. Using dynamically updated features such as the estimated number of people in each area, or the age, gender and work/home/visitor group splits derived from mobile phone data, their model achieves almost % accuracy in predicting whether a given area is at risk of being the scene of a crime in the next month. This type research can therefore be used by the police to achieve a better response time, or direct their attention towards the places that are the most likely to require an intervention.

Data for development
The last couple of years have seen a spectacular rise of interest for applications of mobile data for the purpose of helping towards development. Many contributions to the 'Data for Development' (or DD) challenge launched by Orange [] used different bits of information from the data of mobile phone users to help the development of Ivory Coast. Several of these contributions have already been reviewed in the previous paragraphs, for the full set of research projects, see [].
While in the developed world, much information of what can be inferred from mobile phone data is already known (population density, some of the mobility traces, . . . ), this information can be very valuable in the developing world where census data is often unavailable or several years old. Modeling the mobility of people in developing countries can provide very useful information for local governments when making decisions regarding changes in local transportation networks, or urban planning. Indeed, in rural areas of low income countries where the most recent technologies are not always available, up to date information on how many people commute from one place to another can be very useful and help policy makers to decide on the next steps towards development. Sometimes, very basic information such as drawing the road network can be difficult in remote places. Salnikov et al. used the DD challenge dataset to detect high traffic roads by selecting displacements only within a certain range of velocities []. They were able to redraw the main road structure of the country and even identified unknown roads, which they validated a posteriori. Between techniques for cheap census, mobility planning and fighting infectious diseases applications, we expect that in the next few years, the developing world will profit from the availability of such rich databases, and research will provide useful insights into how to better help towards development.

Data representativity
Finally, one may raise the question of the significance of the data: given that only a fraction of a country's population is reached by one operator, to which extent may the results on a dataset be generalized to larger populations? Clearly, quantitative results obtained in these studies, such as the degrees of nodes, cannot be taken for granted, but one may expect that as long as the population sample is not biased, qualitative observations such as the broadness of degree distribution or the organization of nodes in communities are significant information on the structure of communication networks. However, the question of knowing whether the sample is biased or not is almost impossible, especially given the lack of information about the users in CDR databases.
Frias-Martinez et al. raised this question in [], regarding e.g. the socio-economic level that could be biased among mobile phone users compared to the whole population. They validate their results by performing a series of statistical tests to compare the population in their sample to the overall population using census data, and show that no significant difference was observed. However, in the general case, data about users in CDR databases is often missing, and census data may not always be available for comparison. Regarding mobility models, one could argue that active mobile phone users are more likely to be on the move than the rest of the population. A mobility model based on mobile phone users is therefore likely to overestimate the number of people within a population that are traveling. Buckee et al. raised this question regarding those models, further arguing that bias in models of mobility could, in turn, influence the spreading of modeled epidemics []. Onnela et al. also address this problem studying how paths differ depending how much of the network is observed []. They show that, counterintuitively, paths in partially observed networks may appear shorter than they actually are in the underlying full network.
Ranjan et al. studied a related question regarding the mobility of users []: given that one only sees data points where and when a user has made a phone call, to which extent are these points representative of a user's mobility? They found that sampling only voice calls of an individual will most of the time do well to uncover locations such as home and work, but will also, in some cases, incur biases in the spatio-temporal behavior of the user. In a recent study, Stopczyncki et al. widen their coverage by coupling databases from many sources on the same set of users in the context of the Sensible DTU project []. While this approach clearly captures more than just studying mobile phone records, its coverage is limited (, subjects) as the users had to give their explicit consent to share their data: facebook interactions, face-to-face encounters, and answers to a survey. The authors are therefore able to analyze a bigger picture than other studies based on only mobile phone data and show that only studying mobile phone data may not be enough to capture a user's comprehensive profile. Learning from these studies, one should therefore be cautious when drawing conclusions from such analyses, and keep in mind that observing the traces left by mobile phones is only observing selected parts of the whole picture.

Privacy issues
The collection and availability of personal behavioral data such as phone calls or mobility patterns raises evident questions on the security of users'privacy. The content of phone calls or text messages is not recorded, but even the simple knowledge of communication patterns between individuals or their mobility traces contains highly personal information that one typically does not want to be disclosed. During the past decade, a fairly high amount of personal data was made available to researchers via, among others, CDR datasets. The companies sharing their data do not always know how much personal information can be inferred from the analysis of such large datasets, and this has led, so far in other cases than mobile phone data, to a few scandals in the recent years [, ]. In turn, these incidents led, in , to a procedure of adaptation of legal measures in Europe []: the previous european law on the protection of privacy and data sharing dated back from  [], long before the era of what is now called 'Big Data' .
Moreover, the use of data has become global, and an organization based in a specific country uses data generated by its users from all over the world, hence the need for similar regulations in different countries. So far, this has not yet been achieved, as legislations in different parts of the world are very different from one another. In the USA, for example, there is no specific law regulating the data protection and privacy, but instead laws are specific on a sector-by-sector basis. Data protection in the finance or health sectors are therefore regulated by separate authorities []. In Europe, on the contrary, the new directive is designed to apply everywhere in Europe, to the people and organizations who collect and manage personal data [].
The procedure often used when a company shares private data with a third party such as a research group is the following: the company keeps on secured machines the exact private information such as names, addresses or phone numbers on their customers, as well as the CDRs, which contain the phone number of the caller, the callee, the time stamp of the call, the tower at which the caller was connected, idem for the callee, and additional information such as special service usage and so on. The anonymization procedure consists then in replacing each phone number by a randomly generated number, such that each user has a unique random ID, from which it is impossible to retrieve the original phone number by reverse engineering procedures. The CDRs are then modified such that phone numbers are replaced by the corresponding ID. After this procedure, the CDRs are anonymized, and can be transferred to a third party. The standard procedure then implies that the third party signs a non-disclosure agreement, stipulating that they cannot make the CDR data available, and the agreement usually also restricts the range of potential research questions to be explored with the data. The safety of users privacy is then guaranteed both by the removal of information allowing to identify users and by the assumption that the third party doesn't make use of the data for any malicious intent.

De-anonymization attacks
Some research has been produced on mobile phone datasets to challenge this apparent feeling of security, however, recent results are opening new ways of considering the privacy problem. Using CDR data containing mobility traces, Zang and Bolot [] show how it is possible to uniquely identify a large fraction of users with a small number of preferred locations. Their methodology goes as follows: for each user, it is possible to list the top N locations at which calls have been recorded. The authors show then that depending on the granularity of the locations, a non-negligible fraction of users may be uniquely identified by only  locations. For example, if locations are taken at cell level, up to % of the users of a  million communication network can be uniquely identified with  locations, which will be likely to correspond to home and work. Thus, while the anonymization procedure is intended to impeach any linkage between the dataset and individuals, using this procedure allows to potentially retrieve the mobility and calling pattern of targeted users given the access to as little information as home and work addresses. If additional data, such as year of birth or gender of users would be available -which is common in most datasets -it would be possible to identify very large fractions of the network. However, in this attack scheme, one has to know quite well the profile of the user for them to be found in the database. Using a different approach, de Montjoye et al. [] show that knowing only four points in space and time where a user was allows to uniquely re-identify the user with % probability. Using only very little information that could be available easily to an attacker, the authors thus show how unique each user's trajectory is. They further show that blurring the resolution of space or time does not reduce much the information needed to re-identify a user in the database, thus keeping the database very vulnerable if faced with this type of de-anonymization attack.
Other possible attacks have also been considered on anonymized online social networks. Although those attacks are not likely to be applied in the case of mobile phone data, we quickly mention some of them, as it is likely that breaches found in different applications might be similar to potential breaches in mobile phone datasets.
For example, Backstrom et al.
[] describe a family of local attacks, which enable to retrieve the position of some targets in the network, and hence to uncover the connections between those patterns. The authors showed that on a network of . million nodes, by controlling the links of  dummy nodes they manage to uncover the presence or absence of , links between  target nodes, without being detected by the database manager. On a wider scale, Narayannan and Shmatikov [] show that it is possible to retrieve the identity of a large part of a social network by combining it with an auxiliary network. Such a situation happens when users are present in two separate datasets. The authors show then that even if this overlap is available for only a fraction of the users, it is still possible to retrieve the information for a large part of the network. Add to this that other types of databases may be available (for example Twitter, or Facebook in addition to CDRs), and the possibility of de-anonymization is even greater. Such situations have already led to problematic situations, where specific people were re-identified in supposedly anonymized medical records, or movie preferences databases [, , ]. Indeed, two separate databases coming from different sources may be anonymized and safe to be released separately, but can still present a great danger for privacy if an attacker combines and crosses the information contained in both databases.
Against these possible threats of privacy breach, one may wonder if solutions are proposed to counter such attacks. If research on mobile datasets only considers average behaviors, rather than exact patterns, a simple countermeasure is to perform small modifications of the dataset, that would not alter the general aspect of it but that would have dramatic consequences on the algorithms used by attackers, who search for exact matchings between statistics on the network and a priori known properties of the targets.
Another protection against such attacks, and particularly when mobility data is involved, is to produce new random identifiers for each user at regular time intervals. By regenerating random identifiers, it makes it impossible to use longitudinal information in order to assess the preferred locations of a user. As shown by Zang and Bolot [], by changing every day the ID of each user, only % of the nodes can still be identified using their top  locations. While this method seems efficient to protect the privacy of users, it reduces substantially the possible information to retrieve from such a dataset for research purposes. Using a similar approach also proved useful against the attack scheme considered by de Montjoye et al., as Song et al. show in [] that changing the ID of each user every six hours reduces substantially the fraction of unique trajectories in the dataset. A compromise between preserving the anonymity and keeping enough information in the dataset is difficult to achieve. In collaboration with the Université catholique de Louvain, the provider Orange tried to achieve this for their first DD challenge before releasing a dataset to a wide community of researchers (more than  research teams participated). Through releasing four different datasets anonymized differently [] and containing information of different spatio-temporal resolutions, they could guarantee the preservation of the anonymity of users. Yet, the loss of information was not too dramatic, as many studies showed very good results using the provided aggregated information. The challenge was a success and a second one followed in -, using a wider dataset from Senegal []. Other similar initiatives include the Telecom Italia Big Data Challenges  and  [], whose goal is to show the variety of applications that can be derived from the use of Big Data, including mobile phone data, but also weather, Twitter, public transport, energy. . . . These data are aggregated and anonymized and are therefore made openly available online to all who wish to analyze them [].
Another question that is closely linked to this research is how to quantify the anonymity of a database. Latanya Sweeney proposed a measure that is k-anonymity [], defining that a database achieves k-anonymity if for any tuples of previously defined entries of the database, there are at least k users corresponding to it, making it impossible to re-identify a single user with only information on these entries of the database. Of course, the larger k is, the most difficult it becomes to achieve this, especially in a CDR database containing spatio-temporal information about each call. Moreover, when the attacker is looking for a particular person in the database, enabling him to reduce the number of potential corresponding users to a small number is sometimes already a lot of information, and too big a risk to release the database publicly. Another potential solution to preserving privacy was suggested by Isaacman et al.
[] who suggest using synthetic data to model the mobility of people. They used mobile phone data from two american cities to validate their model, showing that their model, based on only aggregated data and probability distributions, could reproduce many of the features of mobility of users, without any of them corresponding to a real person. Mir et al. further proposed an evolved version called DP-WHERE [] of the previous model, adding controlled noise to the set of empirical probability distributions. This noise then guarantees that the model achieves differential privacy, that is, that the analyses will not be significantly different whether or not a single individual is in the database from which the model is derived, even if this individual has an unusual behavior. However, on may wonder if these synthetic data could be used to carry out analyses that were not previously tested on the real database, as no guarantee exists on the outcome of analyses that were not foreseen by the researchers that tested the model for compatibility with empirical data.

Personal data: ownership, usage, privacy
Phone companies collect data about their users, about their habits, their mobility, their acquaintances. Still, the legislation up to  was fuzzy [], chilling companies to share such data for research and making customers feel that George Orwell's predictions are coming true, especially after the scandal in  revealing how much personal information the NSA was collecting from many sources [].
Such data represents an enormous added value, both to companies, for marketing purposes and client screening, and to authorities for traffic management or epidemic outbreak prevention. It is often forgotten, but the use of mobile phone datasets also has a huge positive potential in the developing world, as many of the proposed project to the Data for Development challenge showed [], may it be for supervising the health status of populations, generating census data or optimizing public transport. Ironically, even though the research community has shown the potential that such data have to save lives and that using these data is technically possible, it is still often difficult to access the data because of privacy concerns, even when the data is aggregated and non-disclosure agreements are signed.
Such opportunities, both for corporates and authorities need to develop standardized procedures for the acquisition, conservation and usage of personal data, which is not yet the case. The communication about these procedures to customers hasn't been clear, as are the possibilities for a user to 'opt-out' if they don't want to have their personal data released.
With this intent, several voices have recently been raised in order to urge authorities to develop a 'New Deal' [] on data ownership, in which users would own their personal data as well as the decisions to provide it -in exchange of payment -to companies interested in their usage. A transparent system armed with the necessary protocols and regulation for a transparent use of personal data would also facilitate the access to data for researchers [], and could so benefit to the entire society.

Conclusion and research questions
The first analyses of mobile phone datasets appeared in the late 's, and the result of this decade of research contains a large number of surprises and several promising directions for the future. In this paper, we have reviewed the most prominent results obtained so far, in particular in the analysis of the structure of our social networks, and human mobility. We decided not to cover some closely related questions, such as churn prediction (see [-]) or dynamic pricing [, ], which are rather business-related topics, and for which a vast literature is available.
The recent availability of mobile phone datasets have led to many discoveries on human behavior. We are not all similar in our ways of communicating, and differences between users can range to several orders of magnitudes. Our networks are clustered in well-structured groups, which are spatially well-located. With the raise of communication technology, some have predicted that the barrier of distance would fall, shrinking the world into a small village. However, mobile phone data suggests instead that distance still plays a role, but that its impact is nuanced by the varying population density. Regarding our mobility behavior, individuals appear to have highly predictable movements [], while as populations we act and react in a remarkable synchronized way. In this context, the availability of mobile phone data has for the first time allowed to observe populations from a God-eye point of view, monitoring the pace of daily life or the response to catastrophes.
The ubiquity of mobile phones -there are nowadays more mobile phones than personal computers in use -which allows us to obtain such precise results raises also the thread of viral outbreaks, from which mobile phones have been safe until now. Mobile viruses could be a potential risk for users' privacy, as it is also the case that the anonymized datasets provided by operators to third parties for research could potentially be de-anonymized too.
The availability of such enormous datasets creates a huge potential that could benefit to society, up to the point of saving lives. The research that has been conducted so far only represents the tip of the iceberg of what could potentially be done, when adequately exploited. However, it is the necessity of authorities to ensure that such datasets could not be misused.

Further research
The number of possible research questions on mobile phone datasets is gigantic. In this last part, we will present one research direction that we believe to be highly important and still not addressed in its most general form.
A large number of research has been conducted on the analysis of social networks, based on CDRs. As it appears from the different publications on this topic, there exist some common features but also many differences in the structure of the constructed network. Recall as simplest example the degree distributions, which show different functional forms for most datasets.
These differences may, of course, be linked to cultural differences between the different countries of interest, but there are probably other, quantifiable, reasons. The datasets differ greatly in the market shares of the operators, in the time span of the data collection period, in the size of the network and in the geographical span of the considered country. The method of network construction is also always different and has a tangible impact on the network structure. The use of directed or undirected links, weights and thresholds for removing low-intensity or non-mutual links all greatly impact the structure and hence the statistical features of the obtained network.
Hence, we believe that a serious analysis, both on theoretical and on empirical side of the influence of these factors on the general structure of mobile phone networks may lead to a general framework, allowing to interpret differences between results obtained on several datasets with the knowledge of potential side-effects.
This question is closely related to the even more general question of the significance of information provided by CDR data. Recalling what was said in Section , CDR datasets are noisy data, some links appear there by chance, while other have not been captured in the dataset. It would thus be interesting to question the stability of the obtained results, provided that the real network is different from what has been observed in the data. This links with the work of Gourab [], who analyzed the stability of PageRank under random noise on the network structure. Again, in this framework, no real theoretical result has yet been achieved, allowing to characterize which results are significant, and which are not.