Internal migration and mobile communication patterns among pairs with strong ties

Using large-scale call detail records of anonymised mobile phone service subscribers with demographic and location information, we investigate how a long-distance residential move within the country affects the mobile communication patterns between an ego who moved and a frequently called alter who did not move. By using clustering methods in analysing the call frequency time series, we find that such ego-alter pairs are grouped into two clusters, those with the call frequency increasing and those with the call frequency decreasing after the move of the ego. This indicates that such residential moves are correlated with a change in the communication pattern soon after moving. We find that the pre-move calling behaviour is a relevant predictor for the post-move calling behaviour. While demographic and location information can help in predicting whether the call frequency will rise or decay, they are not relevant in predicting the actual call frequency volume. We also note that at four months after the move, most of these close pairs maintain contact, even if the call frequency is decreased.


Introduction
In recent years, the availability of large-scale mobile phone call detail records with location information has allowed researchers to study human mobility and communication with fine spatiotemporal resolution, large sample sizes, and high accuracy [1,2]. This has generated a lot of interest in both the individual mobility patterns in relation to activity spaces [3,4,5] as well as in general patterns of human mobility [6,7,8].
Most of today's research on human mobility that use the mobile phone traces of individuals is focused on short-term daily mobility. On the other hand, the effects of moving to another location for longer term have not been studied as much, perhaps due to the lack of mobile phone datasets that allow the tracking of individuals' locations over long periods of time. These residential moves, referred to as migration or residential mobility in sociological literature [9,10], have conventionally been studied using surveys and censuses. In contrast to these traditional approaches, mobile phone call detail records do not suffer from recall bias, show more subtle migration patterns [11], and give migration estimates as accurate as those obtained from census data [12].

arXiv:2009.00252v2 [cs.SI] 3 Sep 2020
Although the migration flows are usually focused on individuals, migration also carries a relational aspect [13], and has been associated with changes in the individual's social network [14,15,16]. These changes have traditionally been studied using surveys [9,17], but the availability of the mobile phone call detail records have broadened the research scope to such issues as the persistence of ties in a mover's social network [18,19]. There it was found that the geographic distance between an ego and an alter has a diminishing effect on their mobile phone communication. Moreover, it has been shown that on the basis of communication volumes between ego-alter pairs, stronger ego-alter ties are more likely to persist after migration [20].
In this study we are interested in the evolution of strong ties between movers and their non-mover contacts in the case of long-distance residential moves. In particular, we focus on how the mobile communication frequency changes after the move and how this relates to the demographics of both the mover ego and the non-mover alter. We use a dataset containing the call detail records of over a million mobile phone service subscribers, leveraging the fine spatiotemporal resolution, completeness of mobile communication network information, and large sample sizes that are typically lacking in surveys. We begin by inferring the home locations of the users and identifying the strong ego-alter ties in the mobile communication network. We then investigate both the number of calls and the fraction of calls dedicated by the mover ego to the non-mover alter, and use clustering methods to get insight into changes in the calling patterns between the ego and alter. Finally we study whether the change in the calling patterns can be predicted by pre-move calling behaviour, demographics, and location information.

Methodology
The dataset The dataset we use contains anonymised call detail records (CDRs) from a European country collected over a two year period from January 2008 to December 2009. The CDRs contain the following information: origin (ego) user ID, destination (alter) user ID, type of interaction (either call or SMS), date and time of interaction, an incoming/outgoing marker, and the origin user's cell tower ID. In a separate file, we also have a list of over 1.5 million user IDs and their associated ages and genders in 2008 which were obtained from the users' contract information. In addition, we have yet another file containing cell tower IDs and their geographical latitudes and longitudes. Note that not all the user IDs and cell tower IDs appear both in the CDRs and in the latter two files.
Identifying the movers Let us first consider the choice of spatial and temporal resolution for the analysis. Here we will focus on individual users' long-distance residential moves at the province level, which is the administrative level above the city or municipality level. This is based on the notion that although high-volume and affordable commuting transportation modes like the metro often extend from major cities to neighbouring towns, they rarely extend from one province to another. We also choose the temporal resolution to be one month, as this resulted in reduced noise but still contained information enough to infer the home location of the user. Further, the monthly resolution allows us to see long-distance moves that were not abrupt (e.g., a person may repeatedly travel to and from the old and new home locations before moving to look for apartments, or after moving to transport furniture), which may be missed if we look for moves at finer temporal resolutions.
In order to find the home location of the user, we use Nominatim [21] and the Python module geopy to find the address corresponding to the cell tower ID of every outgoing call and SMS of the user. If a cell tower location is unknown, the address is marked NaN. For each day, we take the most frequently recorded province among all entries including those marked as NaN. We call this the daily most common location at the province level. Although most studies use the night locations to infer the home locations [3,20,22], we do not impose such a time limit: since inter-province transportation is not as cheap and accessible as intra-province transportation, we do not expect regular differences in day or night locations at the province level. Relaxing the time constraint allows us to take advantage of some location data (albeit at a coarser spatial resolution), thus increasing the viable sample size. We then took the most common location among the daily most common locations during weekdays, when people are expected to be in their primary residence, and used this as the home location. In case of ties, one location is chosen at random. Each user will then have a home location vector with 24 entries, each corresponding to a month in the observation period.
Here we define "movers" as users who move from one home location, province 1, to another home location, province 2, only once within the 24-month period such that itineraries like (province 1 → province 2 → province 1) or (province 1 → province 2 → province 3) are not included in the analysis. Users with the run-length encoded home location vectors [(province 1, m), (province 2, 24 − m)] were considered movers, where the provinces are known and m is the number of months spent in province 1. Thus, m can be considered as the estimated moving month, where m ∈ [1,24] with m = 1 representing January 2008. Users with home locations given by [(province 1, 24)] are considered "non-movers", while those with other run-length encodings are considered to have unknown trajectories and are classified as neither movers nor non-movers. We also require that movers be in each home location for a certain number of months as described in the next subsection.

Identifying strong ties
We identify a mover ego's strong ties prior to the move by looking for both the frequency and regularity of mobile communication between the ego and its alters, similar to what we have done in our previous research on call detail records [23,24]. For each ego and each month, the alters are ranked based on the fraction of the calls made or received from the ego that are associated with the alter, which we consider as a proxy for the relative volume of mobile communication devoted by the ego to the alter. If c i is the number of calls between the ego and one of its n alters i, the fraction of calls to the alter i is given by c i / n j=1 c j . The alter i with the highest fraction is given rank 1; alters with tied fractions are given the same rank. Note that in the ranking procedure, we do not yet take into account the demographic and location information of the alters.
We limit our interest to pairs in which the ego moved and the alters did not move. The non-mover alters who were consistently ranked in the top five of the mover ego at least for months m−2, m−1, and m, where m is the estimated moving month, are considered to have strong ties with the ego before moving. Among these ego-alter pairs, we examine those in which the ego and alter (1) have known age and gender, (2) had made or received at least one call or SMS from four months before the moving month to four months after (i.e. m − 4 to m + 4), and (3) have no unknown home locations. These filtering methods yield 4487 ego-alter pairs, with 3661 unique mover egos who lived in each of its home locations for at least four months and 4453 unique non-mover alters. As there are a number of pairs with shared egos, the pairs are not strictly independent. To see if these correlations matter, we trained our predictive models on pairs without duplicate egos and tested them on the same test set as used for the original, complete training set. The results are very similar quantitatively and qualitatively to those obtained using the complete training set.
As the complete training set should contain more information, the results presented here correspond to the full set of 4487 pairs.
Examining changes in the call frequency patterns Although we have data on the SMS of the users, we focus on the call frequency for two main reasons. Firstly, as calls require immediate feedback from both the caller and the callee, we can assume each call as a unit of exchange; the more calls there are, the more communication there is between the two individuals. Secondly, since a single message can be sent as a single SMS or broken down into several SMS, and no immediate feedback is required on the part of the receiver, it is more difficult to infer communication volume using SMS. Previous studies on the same dataset show that including SMS messages does not significantly increase efficacy in ranking the alters [25].
For each ego-alter pair, we construct the time series of two quantities measured at a monthly resolution: (1) the number of calls exchanged by the mover ego and the non-mover alter, and (2) the fraction of the calls made and received by the mover ego that are associated with the non-mover alter. The number of calls gives an estimate of the mobile communication frequency between the ego and the alter, while the fraction of calls is a measure of the relative importance of an alter compared to the other alters. We measure the time in terms of how far away a month is from the moving month: the moving month is taken to be t = 0, while months before moving have values t < 0 and months after moving have values t > 0. To examine whether there is a change in the call frequency associated with a residential move, we use clustering methods on the time series for the different pairs. While it is possible to analyse this by separating time into pre-move and post-move periods and comparing them, exploratory clustering does not have restrictions on when changes are expected and thus gives a better qualitative picture on how calling behaviour changes in relation to a residential move.
Since we are interested more in the call frequency change rather than the actual call frequency value, we standardise each time series to mean 0 and standard deviation 1 over the time period of interest, i.e., t ∈ [ −4, 4]. This period was selected to include a sufficiently long time after moving without compromising sample size; further, since we only required that the alter be in the ego's top five for t ∈ [−2, 0], there are some pairs where the alter is a recent introduction to the ego's close circle, although such cases composed a very small minority of all the pairs considered (0.9%). Since we are interested not just in the shape of the time series but also in the time at which changes occur, we do not use shape-preserving clustering methods [26]. We use the k -means as our clustering method and use the mean silhouette score, the Davies-Bouldin index, and the Jaccard bootstrap similarity index [27] to find the optimal number of clusters and quantify the quality of the clustering. Similar results were also obtained using hierarchical clustering with Ward's method.
In order to verify whether the resulting clusters are genuine or simply artifacts of the algorithms used, the clustering results are compared with those found in a control set. The control set contains pairs which satisfy the strong ties condition, but where the egos and alters are both non-movers. These non-mover egos are assigned dummy moving months which follow a distribution similar to that in the original dataset. In addition, we analyse the Spearman correlation coefficient for the unstandardised counts and fractions of the calls across different months for both the actual mover and dummy mover pairs.
Predicting post-move behaviour using pre-move behaviour as well as demographic and location information Next we predict the post-move behaviour using the users' demographic and location information and their pre-move behaviour. We specifically look at the following features: age and gender of the ego, age and gender difference between the ego and the alter, distance between the ego's home location and the alter's home location prior to and after the move, distance moved by the ego, and pre-move calling patterns. These pre-move calling patterns are the number and fraction of calls mentioned above, as well as a reciprocity measure for the calls, which is defined as follows: where c is the non-standardised number of calls, and the subscript refers to whether these are calls made by the ego to the alter, the alter to the ego, or the calls made between the ego and the alter irrespective of direction. The reciprocity measure is calculated per month and has a range of [−1, 1].
To compute for the distances related to the move, we have to find the locations of the ego and the alter before and after the move at a fine spatial resolution. We look for the home location of the egos and the alters one month before and two months after the estimated moving month (i.e., t = −1 and t = 2) using the same procedure as outlined above, but at the city level rather than the province level. Since the actual move may have happened at t = 0 or early t = 1, the home locations at t = −1 and t = 2 are more likely to reflect the true pre-move and postmove locations. We also note that since no time limit to night-time is considered in estimating the home location, this procedure may have picked up a city that is not the true home location, but perhaps a work or school location. However, we expect that the distance between the true home location and the inferred home location is small compared to the actual moving distance between the old home province and the new home province [28].
In some cases (101 out of the 4487 pairs), the home city location of the ego turns out not to be in its home province location. This can be explained by the aggregation method used: for example, suppose that the user is in the same province (e.g., province A) but in different cities (e.g., cities 1, 2, and 3, all in province A) for most of the time, but stays in one city in another province (e.g., city 4 in province B) for some time. If the number of days in city 4 is greater than the number of days in cities 1, 2, and 3 individually, the inferred city location can be city 4 even if the user spent the most time in province A. In such cases, we take the province-level inferred home location to be the more accurate estimate, as city-level estimates are more prone to noise due to the relative ease of intra-province commutes compared to inter-province commutes. The distances are then computed using the capital of the province as the location of the user.
From the clustering results, we find that we can aggregate the post-move behaviour as the mean in the post-move months. We focus on predicting (a) the post-move mean number and fraction of calls and (b) whether the number and fraction of calls decay after the move or not. We create a train-test set with an 80-20 split (n train = 3589, n test = 898) and use 5-fold cross validation when tuning the model parameters. Categorical features with k levels were dummy-encoded into k − 1 binary variables, and all features were normalised to mean 0 and standard deviation 1 using the train set. For the regression task (a), we use linear regression (ordinary least squares (OLS), ridge (Ridge), elastic net penalty (ElasticNet)), random forest regression (RF), k-nearest neighbour regression (KNN), and support vector regression (SVR) with the linear (SVR-lin), polynomial (SVR-poly), and radial basis function (SVR-RBF) kernels, while for the classification task (b), we use logistic regression with L2 penalty (LogReg), random forests (RF), AdaBoost, and support vector machines (SVM) with the linear (SVM-lin), polynomial (SVM-poly), and RBF (SVM-RBF) kernels. The models were chosen to give a good representation of the most commonly used linear and nonlinear models for regression and classification. We also note that neural networks were not considered due to the relatively limited size of the dataset. Models were implemented in Python using the scikit-learn module.

General statistics
The demographics of the users are shown in Figure 1. Most of the mover egos are in the age group of 20-40 years old, while the alters are in two age groups, namely those in the same age group as the mover ego, with an age difference within 10 years, and the others in the age group of 20-40 years older. These likely correspond to the peers and parents, respectively, of the mover egos due to the age difference and the frequency and regularity of calls, as we had hypothesised in our previous work [23,24,29]. We also note that the demographic profile of the full dataset is comparable with that of the census data, although users above 60 years of age are underrepresented due to a lack of mobile phone use among older individuals [30]. Figure 1 also includes the distribution of the moving months as an inset; more details are included in the Supplementary Information ( Figure S1).
As expected, our method selects mostly egos who move relatively long distances: 94% move at 50 km and above with the median moving distance being 168 km (Figure 2(a)), although there are a number of cases where the ego moves a short Figure 1 Mobile phone user statistics. The main figure shows the population pyramids for the egos (filled bars) and alters (unfilled bars) showing the relative frequencies for each gender and age group. The upper inset shows the relative frequency histogram of the age difference among the pairs considered, while the lower inset shows the relative frequency histograms of the egos' moving months, with the composition by age group indicated by the stacked bars. Due to the requirement that the ego and alter must be active for four months before and after the moving month, none of the egos in our analysis moved before May 2008 or after August 2009. distance crossing the provincial borders. We find that 47% of the pairs were within 50 km of each other before the move and ended up moving away from each other, while in 43% of pairs, egos moved closer to within 50 km of the alters' location. Around 5.6% of the pairs continued to be within 50 km of each other before and after the move. 68% of the pairs had the ego and the alter in the same city either before (36%) or after (32%) the move. Among the pairs which were initially in different locations before moving, the median pre-move ego-to-alter distance was 118 km; among pairs in different locations after moving, the median post-move ego-to-alter distance was 153 km (Figure 2(b)).

Clustering results
We used k -means clustering on the time series for each pair of both (1) the number of calls exchanged by the ego and the alter and (2) the fraction of the ego's calls that were associated with the alter. Each time series was standardised to have mean 0 and standard deviation 1, over the range −4 ≤ t ≤ 4, where t = 0 corresponds to the moving month.
We find that the silhouette index, Davies-Bouldin score, and the Jaccard bootstrap similarity index [27] all gave k = 2 as the optimum number of clusters for the fraction as well as for the number of calls (Supplementary Information Figure S2). The resulting cluster prototypes obtained by averaging over all cluster members show two typical behaviours of pairs: those where the quantity of interest increased and those where it decreased. In both cases, the change occurs at around t = 0 as shown in Figure 3(a, b). Similar results were also obtained using Ward's method. shows the distributions of the absolute difference of the distance between the ego and the alter before and after moving, with separate curves for egos moving away from the alter and egos moving towards the alter. All curves were obtained by using the Savitsky-Golay filter on the raw histograms.
To ensure that this result is not an artifact of the data preprocessing and the clustering algorithm, we compare our results to a control dataset of the same size where both the egos and alters are non-movers, but the ego is assigned a dummy month with a distribution similar to that in the original dataset. The results are shown in Figure 3(c, d). We note that both the original and the control dataset from t ∈ [−4, 4] showed two cluster prototypes, one where the quantity of interest increases and the other decreases, and the change happens at around t = 0. As there is no real meaning to t = 0 for the control dataset, the month of changes (t = 0) for this dataset could be a trivial result in contrast to that for the original dataset. Due to the standardisation of time series to mean 0, they will cross the zero axis but at arbitrary times as the moving months have dummy values. Thus, averaging those time series results in crossing the zero axis in the middle of the interval [−4, 4], i.e., at around t = 0. By the same reason, we expect for the control dataset that if we truncate the standardised time series, the averaged time series will cross the zero axis in the middle of the truncated interval. This explains why in the control dataset, a k = 2 clustering yields rising and decaying prototypes with the rise or decay occurring in the middle of the interval, shifting in location depending on how the interval is truncated. In contrast, for the original dataset, we find that even when performing k -means clustering on the truncated time series, the rise or decay happens at the same point in time around t = 0. Thus, the rising and decaying prototypes in the original dataset can be attributed to the significance of t = 0.
The significance of the estimated moving month t = 0 can also be seen by measuring the correlation of the number or fraction of calls between different months. Precisely, for each value of t, we take the Spearman correlation coefficient between the unstandardised quantity (count or fraction) in that month and the same quan- pre-move months are similar to those between post-move months. However, the correlation between a pre-move and a post-move month is markedly lower. Although it is expected that the correlation is higher for consecutive months compared to non-consecutive months, the sharp difference in the drop in correlation coefficients at t = 0 is observed among pairs with mover egos but not among pairs with nonmover egos (Figure 4). When we examine the correlation coefficients for each cluster separately, we find that the correlations across all months remain high (Supplementary Information Figure S3), with Spearman's ρ ≥ 0.7 for both the number and the fraction of calls. Thus, the drop in the correlation coefficient observed in the aggregate population (no separation by cluster) is likely due to the difference in the post-move behaviour in the two clusters found. Figure 4 Spearman correlation coefficients for calling frequency quantities between different months. These plots show the Spearman correlation coefficients between the unstandardised number or fraction of calls in one month and another month for pairs with mover egos ((a) and (b)) and pairs with non-mover egos ((c) and (d)). Each line corresponds to a particular month, and shows the correlation coefficients between that month and all values of t. The correlation coefficient between the particular month and itself simply has a value of 1.
We also note that the vast majority of pairs behave consistently with their corresponding cluster prototype, with a 97% agreement between the cluster labels (rise or decay) and the actual changes (increase or decrease) in the number and fraction of calls. For pairs where the cluster label does not agree with an actual rise or decay in the number or fraction of calls, the difference between the pre-move and postmove values is much smaller than those observed in properly labelled pairs. Further, there is a considerable change between pre-move and post-move values. The median change in the number of calls after moving is 42%, while the median change in the fraction of calls after moving is 34%. Separating the sample into the rise and decay clusters give similar results for each cluster; detailed results are summarised in the Supplementary Information (Supplementary Information Table 1). Taken together, these bolster our finding that pairs with a mover ego and a non-mover ego are best clustered into two major groups, one where communication rises and another where it decays.
Predicting post-move calling frequency volume from pre-move calling behaviour as well as demographic and location information The moderate correlation between the pre-move and post-move values shown in Figure 4 suggests that we may be able to predict post-move behaviour if the premove behaviour is known. Therefore, we will discuss a number of predictive models and explore the extent to which the pre-move behaviour coupled with demographic and location information can predict the post-move call frequency.
We first examine whether we can predict the number of calls and the fraction of calls made and received by the ego that are associated with the alter after the move. Since the number and fraction of calls are highly correlated between the pre-move months as well as between the post-move months, we take the means of the values across the pre-move months and across the post-move months and use these means to characterize the pre-move and post-move behaviour, respectively. We predict the post-move means of the number and fraction of calls by using the following as predictors: (i) the mean of the number of calls between the ego and the alter across the premove months, also denoted for brevity as the "pre-move count" (count pre) (ii) the mean of the fraction of calls between the ego and the alter across the premove months, also denoted for brevity as the "pre-move fraction" (frac pre) (iii) age of the mover ego (age ego) (iv) age difference between the ego and the alter, i.e., ego's age minus alter's age (age diff) (v) gender of the mover ego (categorical, gender ego) (vi) gender difference between the ego and the alter (same or opposite gender, categorical, gender diff) (vii) distance moved by the ego in kilometers (distance move) (viii) pre-move distance between the ego and the alter in kilometers (distance ea pre) (ix) direction of move (towards or away from the alter, categorical, direction move) (x) pre-move mean reciprocity measure in Equation 1 (recip pre) The absolute value of the difference between the pre-move distance and the postmove distance between the ego and the alter was also considered, but was discarded due to its very strong correlation with the distance moved by the ego (Pearson's ρ = 0.96). Most of the predictors used were not correlated with each other (|ρ| < 0.15), except for age diff and age ego (ρ = 0.51), distance ea pre and direction move (point bi-serial correlation coefficient |r pb | = 0.50), and frac pre and count pre (ρ = 0.72). The log transformation was considered for particular variables with considerable skew (the mean number of calls and distances), and the logit transform was also considered for the mean fraction of calls. For the log and logit transformation of zero values, we substitute the transform of a sufficiently small value (e.g., for pre-move distance in kilometers and for the number of calls, we use 0.1 instead of 0; for the fraction of calls, we use 0.001 instead of 0). Various combinations of transformed and untransformed variables were compared for performance using the R 2 score and the mean squared error (MSE) computed on the test set. In cases where the response variable is transformed, these scores are computed by first back-transforming the predicted variable into its original scale, allowing us to compare models with transformed and untransformed response variables. Categorical predictors (gender and direction of move) were dummy encoded, and all predictors were standardised prior to inclusion in the model.

Number of ego's calls associated with the alter
In predicting the number of calls between the ego and the alter, the best result, both in terms of the test R 2 value (0.57) and the test MSE (494.8), was given by the ordinary least squares regression with the response and predictors untransformed, with the linear models (ridge, elastic net, linear SVR) and random forest regression following very closely behind. SVR-RBF and KNN regression perform more poorly, but the worst performance is from the SVR with a polynomial kernel as depicted in Figure 5(a, b).
To obtain the most relevant predictors, we calculate the permutation feature importance as used in Ref. [31], which is the difference from the baseline metric when the values of the predictor are shuffled across all pairs. The more positive the permutation feature importance, the better the baseline metric is over the metric with the shuffled predictor (for metrics that are meant to be minimised, the sign is reversed so that a more positive permutation feature importance still corresponds to a decline in metric quality). For our implementation, we take the mean of the difference obtained from five different permuted sets of the test set. For the ordinary least squares regression, as well as in the case of all other algorithms used, the pre-move count has the highest permutation feature importance for both the R 2 (0.96) and the MSE (1096) values, followed by predictors with much lower permutation feature importance values (0.01 for R 2 and 12 for the MSE). We also note that although there were no interaction effects in the input design matrix, we do not find any evidence for interaction, as can be seen in the similar results between the random forest regression (which is expected to pick up nonlinear and interaction effects) and the linear models. To further illustrate, for each model, we redo the prediction removing all predictors except that involving the pre-move count and find that the performance is similar as shown in Figure 5(a, b). The linear trend between the pre-move and post-move counts can also be observed in Figure 6(a).
These indicate that although the pre-move count explains much of the variance in the post-move count, the remaining variance cannot be explained sufficiently by the demographic information of the pair, the distance moved by the ego, the distance between the ego and the alter, and the reciprocity of calls between the ego and the alter. On the other hand, the pre-move fraction, being moderately positively correlated with the number of pre-move count, (Spearman's ρ = 0.72), seems to be a redundant predictor.

Fraction of ego's calls associated with the alter
The results for predicting the post-move mean of the fraction of calls of the ego associated with the alter are similar to those found for predicting the post-move mean of the number of calls. The best model is obtained using the ridge regression with the response untransformed and the predictors distance ea pre and distance move  ) and (d)). The dark bars show the MSE and R 2 for the model using all predictors, while the white bars show the MSE and R 2 for the model using only one predictor, which is the pre-move mean of the number of calls for (a) and (b) and the pre-move mean of the fraction of calls for (c) and (d). Note that for the best performing models which happen to be linear models, using only a single predictor performs almost as well as using the full set of predictors.
log-transformed (R 2 = 0.65, MSE = 0.015). However, the performance for all the algorithms used (except for KNN regression) are almost identical, with R 2 ranging from 0.64 to 0.65 and MSE at 0.015. KNN regression performs more poorly than the rest, with an R 2 of 0.58 and an MSE of 0.018 ( Figure 5(c, d)). We also note that the best models with log or logit transformations used on the response and/or some predictors result in very similar performance (R 2 ranging from 0.64 to 0.65, MSE from 0.015 to 0.016).
The pre-move fraction is also the most significant predictor found, with its permutation feature importance much higher than that of the other variables (for R 2 , 1.28 vs. 0.003 for the next highest score; for MSE, 0.06 vs. 0.0001). Using only the premove fraction in the least squares model also yields almost identical performance to the one with all the predictors, see Figure 5(c, d). Again, these indicate that though pre-move behaviour may be used to predict post-move behaviour, the demographic and location information do not improve much the prediction performance. The linear trend between the pre-move and post-move fraction can also be observed in Figure 6(b). Predicting rise or decay in calling frequency from pre-move calling behaviour as well as demographic and location information We have observed that the post-move calling frequency volume (in terms of both the number and fraction of calls) can be predicted, to some extent, using the premove behaviour, whereas demographic and location information do not significantly improve performance. We now attempt to see if we can at least predict the direction of change in the calling frequency by predicting whether the post-move calling frequency (mean number or fraction of calls) is lower than the pre-move value. These roughly correspond to the rise and decay clusters from the clustering analysis, but instead of using the cluster labels as ground truth, we use the actual pre-and post-move means to determine the rise or decay. We also note that there are a few cases where the pre-and post-move means are the same, though they comprise a very small proportion of the pairs (0.96% for the number of calls and 0.04% for the fraction of calls). These pairs are lumped together with those whose calling frequency increased post-move. Though this group is more aptly named "non-decay" cluster, we keep the term "rise" for simplicity, as the majority of this group behave in this way. The resulting train set is mildly imbalanced with the decay-rise composition of 57%-43% and 62%-38% distribution for the number and fraction of calls, respectively, and we remedy the imbalance by using balanced class weights in the classification algorithms as implemented in the Python module scikit-learn.
The basic procedure is the same as in the regression task, but we predict a binary output using classification models. We consider two metrics: the total accuracy and the average of the recall for each class (rise and decay), with the latter equivalent to the average of the accuracy scores for each class considered separately.

Number of ego's calls associated with the alter
The SVM-RBF classifier gives the best performance both in terms of the total accuracy and average recall, with logit-transforming the fraction of calls and log-transforming the number of calls and the distances yielding the best performance (accuracy 0.67, average recall 0.66). The recall for the rise and decay class are also comparable (0.64 and 0.68, respectively), indicating that the model performance is similar for both classes (Figure 7(a, b)).
In contrast to the regression task, we find that the demographic and location information are also relevant predictors. The permutation feature importances are not heavily skewed towards only one predictor, as in the regression task, but are within the same order of magnitude across most predictors. For both the average recall and the total accuracy, the most relevant predictors in the best performing model are the pre-move fraction, the direction of the move, the pre-move count, the age difference, and the pre-move distance between the ego and the alter, as shown in Figure 8(a, b).
In order to further illustrate the relevance of the demographic and location information in the prediction performance, we look at the performance of the models but with only the (logit-transformed) pre-move fraction and (log-transformed) premove count as predictors. Unlike in the regression task, removing the demographic and location information results in noticeably poorer performance as shown in Figure 7(a, b), except in the case of the linear models. This indicates that the relationship of demographics and location to post-move calling behaviour is likely to be nonlinear.
Fraction of ego's calls associated with the alter All sets of predictors give similar best accuracy scores (ranging from 0.67 to 0.68, depending on the set of predictors used), all attained using random forests. However, the random forest models attain this by assigning most pairs to the decay cluster, resulting in a high recall for the decay class (0.86-0.88) but a low recall for the rise class (0.32-0.33). In contrast, the SVM-RBF classifier gives a slightly lower total accuracy (0.62-0.64), but have comparable recall scores for each class (0.56-0.63 for the rise class, 0.64-0.66 for the decay class). Overall, the best performing model is the SVM-RBF classifier with the distances log-transformed, with a total accuracy score and average recall score of 0.64, see Figure 7(c, d).
The most relevant predictors for this model are the pre-move fraction, the age difference, the direction of the move (away or towards the alter), and the pre-move distance between the ego and the alter (Figure 8(c, d)). Using only the pre-move fraction and the pre-move count gives poorer predictive performance than using all predictors, except in the linear models, as shown in Figure 7(c, d).

Discussion and concluding remarks
In contrast to surveys most commonly used in migration studies, the mobile phone call detail records give us a very detailed quantitative view on how mobile communication patterns of movers change, while being unaffected by recall bias and characterised by high spatiotemporal resolution. We were primarily interested in how a long-distance residential move would affect the mobile communication patterns between a mover ego and a non-mover alter that frequently and regularly communicated with the ego prior to the move. In particular, we wanted to investigate the role of the demographic and location information of the ego-alter pair on these changes. We focused on two quantities that characterise the mobile communication patterns: (a) the number of calls between the mover ego and the non-mover alter, and (b) the fraction of calls made and received by the mover ego that are associated with the non-mover alter. The number of calls serves as a measure of communication volume independent of the other alters of the ego, while the fraction of calls is a proxy for the relative importance of the alter to the ego. Using clustering analysis, we have found that a change in communication patterns happens shortly after the move, generally speaking either rising or decaying. As there is a high correlation both within the pre-move months and within the post-move months, but a lower correlation between the pre-move and post-move months, the extent of this change is limited mostly to the moving month, with the communication stabilising shortly after. Interestingly, very few pairs cut communication entirely. At four months after moving, only 3.5% of the close pairs we examined have no calls or SMS. A preliminary investigation reveals that these pairs that stopped communicating are disproportionately composed of young peers, and we aim to study this phenomenon in more detail in the future.
We also find that the post-move means of the number and fraction of calls can be predicted to some extent by the corresponding pre-move values alone, although a large proportion of the variance remains unexplained even when demographics Figure 8 Permutation feature importance for rise/decay classification. These show the permutation feature importance of each predictor used in the best model for predicting rise/decay in the mean number of calls ((a) and (b)) and the mean fraction of calls ((c) and (d)). The more positive the permutation feature importance for a particular predictor is, the more shuffling the predictor values negatively affects the model performance. and the distances moved are taken into account. The relationship between the preand post-move calling frequency volumes shows a linear trend ( Figure 6), with premove values likely resulting in similar post-move values. In predicting the rise or decay in the number and fraction of calls, demographic and location information provide more information to nonlinear models, yielding better performance than if only the pre-move means of the number and fraction of calls had been used as predictors. Thus, whereas a higher frequency of pre-move calls generally leads to a higher frequency of post-move calls, demographics and migrating distances appear to have a more complicated nonlinear effect on the post-move calling behaviour. We also note that in predicting the rise or decay of calling frequency, the most relevant predictors other than the pre-move calling behaviour involve the migrating distances and the age difference of the pair, while the genders of the ego and the alter have little effect on the prediction performance. As in our dataset, the age difference mostly falls into one of two groups, one corresponding to the age differences among peers (0-10 years) and the other to parent-child or similar relationships (20-40 years), our results suggest that the type of relationship between pairs affects postmove communication, which is consistent with sociological findings [32].
Although the predictive performance of our classifiers is better than random, both the average recall and the accuracy are less than 70%. We can compare this to the Bayes accuracy, the theoretical upper bound of the accuracy obtained by any classifier for a given sample. Although the Bayes accuracy itself cannot be calculated without the knowledge of the underlying distributions, its bounds can be estimated. The estimate for the upper bound of the Bayes accuracy obtained from the 1-nearest neighbour classifier [33] is 69% for both classification tasks. As the training size is not very large, we expect some bias in this estimate, but as the dimensionality of the prediction problem is not very high, we do not expect the actual upper bound to be considerably higher [34]. Thus, even though we did not exhaust all possible models and algorithms, it is more likely that the lack of predictive power can be attributed more to the inherent overlap in the feature space of the two classes (decay and rise) than to the insufficiency of the models used. In other words, the post-move behaviour of the pairs where the ego moved but the alter did not, may also be influenced by other variables that we did not consider in our models.
A limitation of using our dataset to study mobility is that a user's location is recorded only when the user makes a call or sends an SMS and when the cell tower used has known geographical coordinates. We have worked around this limitation through a combination of approaches. First, we increased the number of viable samples by not introducing a time limit in exchange for reducing the spatial resolution to the province level. Second, we obtained the home location by taking the majority province location at two levels: we first took the most common location per day, and then among these locations, the most common location per month. This prevents days with an unusually high number of calls to significantly influence the inferred home location. Lastly, we require that the user only moves once and that the home location stays stable over a certain period (i.e., the egos we considered were in the same home location for at least four months), making it more probable that the user indeed moved and not commuted between the inferred home locations. By imposing these criteria to identify movers, we find that most of the mover egos that we included in our analysis moved long distances (Figure 2), which require a lot of time or financial costs for daily commutes by European standards [35]. Our clustering results also bolster the relevance of the moving month in the call frequency time series, indicating that the movers and their moving month were likely correctly inferred.
In summary, our study demonstrates the predictability of post-move mobile communication based on pre-move mobile communication, migrating distances, and demographics of communicating individuals. To our knowledge, our work is the first of its kind in investigating the interplay between these factors from a quantitative