Spatiotemporal correlations of handset-based service usages

We study spatiotemporal correlations and temporal diversities of handset-based service usages by analyzing a dataset that includes detailed information about locations and service usages of 124 users over 16 months. By constructing the spatiotemporal trajectories of the users we detect several meaningful places or contexts for each one of them and show how the context affects the service usage patterns. We find that temporal patterns of service usages are bound to the typical weekly cycles of humans, yet they show maximal activities at different times. We first discuss their temporal correlations and then investigate the time-ordering behavior of communication services like calls being followed by the non-communication services like applications. We also find that the behavioral overlap network based on the clustering of temporal patterns is comparable to the communication network of users. Our approach provides a useful framework for handset-based data analysis and helps us to understand the complexities of information and communications technology enabled human behavior.


Introduction
Understanding macroscopic socio-economic phenomena of a large number of individuals has been extensively studied by means of social, physical, and computational sciences [-]. Recent access to large-scale digital datasets on human dynamics and social interaction has enabled us to quantitatively investigate the structure and dynamics of human communication networks. Indeed, researchers have studied various datasets, ranging from email and mobile phone communications to social network services, e.g. Twitter and Facebook [-]. Mobile phones or handsets are now actively utilized to accurately measure or sense human behavior because the handsets equipped with a variety of sensors, including GPS and WiFi, are carried around by the users everyday and all day through. Highly resolved location data collected from handsets have been recently used to uncover human mobility patterns [-]. The reliability of data collected from handsets, i.e. 'behavioral' data, was tested in the serial studies conducted within the frame of MIT's Reality Mining project [, , ]. It was shown that the behavioral data are at least comparable to self-report survey data in terms of friendship network and even capturing information that self-reports are missing [].
The handset usage patterns are known to be diverse among users when measured by the number or duration of the phone sessions and by the amount of data received, to name a few [, ]. Within the individual handset usage patterns, temporal inhomogeneities due http://www.epjdatascience.com/content/1/1/10 to circadian and weekly cycles were also reported [], which are in close relation to the spatial inhomogeneities, such as nighttime at home and daytime in office. Therefore, for conducting a comprehensive study, it is important to identify the context characterizing the situation of handset user, and then to understand how the context affects service usage patterns [-]. However, it is only very recently when the effect of context on the handset-based service usages was investigated. But so far the analysis has been conducted mostly at the aggregate level, while the temporal diversities of service usage among users have been ignored [].
In this paper, we study spatiotemporal correlations of the service usage patterns of individual users by analyzing a handset-based dataset. This dataset was collected from  users' handsets for over  months as a part of the OtaSizzle project at Aalto University, Finland []. A software installed on handsets collected information about the handset's locations and usages of various services, including web domain visits, applications, emails, voice calls, and short message services, with the resolution of seconds in time and mobile network base stations spatially. After constructing spatiotemporal trajectories of the users we identify several contexts that are meaningful to them by using the context detection method []. Other methods include, for example, places of interest or meaningful locations [, ] and eigenmode analysis [-]. Then, we find correlations between the spatiotemporal trajectories and the service usage patterns. We observe the similarity and diversity in temporal patterns of the service usages and discuss their temporal correlations, time-ordering behavior between services, and behavioral overlap network based on the clustering results. Our approach provides a useful framework for handset-based data analysis, and hence it would be important for better design of information and communications technology (ICT) enabled social environments and services. This paper is organized as follows. In Section  we describe the data collection and preparation methods. In Section  several contexts for each user are identified by means of the context detection method applied to user's spatiotemporal trajectory. In Section  we uncover the spatiotemporal correlations and the similarity and diversity in temporal patterns of the service usages. Finally, we summarize the results with concluding remarks in Section .

Data collection method
The handset-based dataset in this study was collected by the MobiTrack software installed on Nokia Symbian smartphones of  participants or users from September  to December , i.e. for a period spanning about  months. All users were students and staff members of Aalto University, Finland and identified as early adopters of mobile phones and services []. The dataset was anonymized so that no personal information of the users could be obtained. We consider only  users with the overall duration of handset usage longer than  days, see Section  for details.
The dataset consists of two kinds of information: locations and service usages. The resolution of locations is limited to the physical area covered by each mobile network base station, i.e. cell, denoted by c. Whenever the handset is connected to a new cell or otherwise every half an hour, the identifier of the cell connected by the handset was recorded with a timestamp t with one second resolution. Each cell can be located in the geographical space with a unique pair of latitude and longitude. The geographic information for cells http://www.epjdatascience.com/content/1/1/10 . For all users we have ,, records at , different cells. Although only .% of cells could be located in the geographical space, they correspond to .% of records. Figure  shows all located cells over the world, in Finland, and in the Helsinki municipal area. In this way, the detailed spatiotemporal trajectory of each user could be constructed in terms of a sequence of cell records {(c k , t k )}, where k denotes the ordered index of record.
For service usage data we consider five services: web domain visit (web), application (app), email, voice call (call), and short message service (SMS). Each service usage or event was recorded with a timestamp with one second resolution together with service-specific relevant information. In the case of web domain visits, a URL (Uniform Resource Locator) was extracted and recorded whether it was visited via browser or widget. Only the applications visible in the foreground of the handset were recorded so that no process or application running in the background was considered. The records of communication services, such as email, call, and SMS, include the information on whether the user was an initiator or receiver of the communication event, and on the communication partner if available. For more information regarding the data collection method, see [].

Data preparation method
The service usage dataset contains events mostly generated by users but it also contains automatic events by the operating system of the handsets. In order to observe the pure human behavior, we systematically filtered out these automatic events. However, some spurious regularities still remain in the web dataset. In the cases of google.com, facebook.com and so on, once a web is connected, the browser might visit the same web automatically for periodic updates and synchronization of accounts until the web is disconnected. To resolve this issue, we obtain the distribution of inter-event time τ , defined as the time interval between consecutive web domain visits by the same user. Several sharp peaks at spe-http://www.epjdatascience.com/content/1/1/10  cific inter-event times are found, where each peak is mostly related to the single webpage. We remove all the events leading to those inter-event times, except for the event trains consisting of only two events with τ =  seconds. It is because some trains with only two events separated by  seconds can also be generated by users. As new regularities become visible after filtering, we apply this method recursively until the peaks are suppressed considerably, leading to an approximately % of entire events removed. Figure  shows that this filtering method for web dataset does not change the overall characteristics of the inter-event time distribution.
We also ignore some user-generated application events associated with other service usages, corresponding to % of entire events. For example, the user opens the messaging application when sending or receiving SMSs. These application events might lead to artificial correlations between different service usages. In addition, corrupted events, less than .% of the whole dataset, have been ignored or manually corrected. Finally, we have , web domain visits, , application events, , emails, , calls, and , SMSs in the service usage dataset.

Context detection from spatiotemporal pattern
In order to detect the contexts for each user, we construct the user's spatiotemporal trajectory from a sequence of cell records {(c k , t k )}. It is necessary to infer the user's location between consecutive timestamps of cell records. From a sequence of cell records, we derive the temporal boundaries {(c k , t (s) k , t (e) k )} for the user's trajectory, implying that the user stays within the area covered by cell c k from the moment of t (s) k to t (e) k , see Figure . It is assumed that the user stays in the cell c k till t (e) k =   (t k + t k+ ) and then in the cell c k+ from http://www.epjdatascience.com/content/1/1/10 Here we set t c as half an hour, i.e. the time interval for regular cell recording. The time interval between consecutive timestamps longer than t c implies that the handset may be turned off, used in offline or airplane mode, or not able to detect any cell nearby. If t k+t k > t c , the user is considered to stay in the cell c k till t (e) k = t k + t c and in the cell c k+ from t (s) k+ = t k+t c . Hence, the location is unknown between t (e) k and t (s) k+ . Then, the total time spent, i.e. duration, in each cell c is obtained as follows: If the sum of durations in all the recorded cells, D ≡ c d c , is less than  days, that user is not considered for the further analysis, leading to  available users. The average and standard deviation of D for available users are  ±  days.
In addition, we observe back and forth changes in a short time span between two cells covering the neighboring areas. It can occur even without any real movement of the handset if the handset is located at the boundary of two neighboring cells. To filter out this noisy behavior, the involved cells can be clustered by a sandwich clustering method [].
Here we consider only one type of sandwich with four records involving two cells, i.e.
Whenever this type of sandwich is detected, every c k in the temporal boundaries is replaced by or merged into c k+ if d c k+ > d c k , and vice versa. Consequently, some geographically neighboring cells can be clustered into one representative cell, which from now on will be considered equally with normal cells. For example, the first row in Figure  shows the user 's temporal boundaries during typical Friday and Saturday. Note that clustering cells for one user is independent of other users' records.
We find spatiotemporal inhomogeneities of the trajectories of handsets on the individual basis as well as at the aggregate level. As an illustrative example, we obtain the rank curve d(r), defined as the duration in the rth cell c in a descending order according to d c . The rank curve for all users is highly skewed, such that the first few cells, including one in Otaniemi campus of Aalto University, were visited for more than a few months while .% of cells were visited for less than one hour, as shown in Figure . The same inhomogeneities are also observed for individual users. For example, the rank curves for users  and  are shown in Figure , who were selected to show the representative behavior.
The heavily visited cells are supposed to cover meaningful places to the handset user, such as home and office. Since the service usage patterns might be affected by the different characteristics of meaningful places, it is important to identify the context characterizing the situation of user. Here the context is preferred to the meaningful place because the time and place of handset usage are not independent but correlated, e.g. nighttime at home and daytime in office []. Each cell will be detected as one of five contexts, such as Home, Office, Other meaningful place (Other), Elsewhere (Else), and Abroad. One context can be assigned to several cells. The identifier of a cell contains the mobile country code (MCC), by which Abroad context is assigned to the cells out of Finland. For the cells within Finland, we obtain more detailed durations for each cell c: . duration on weekdays (d c,wd ), . duration on weekdays between  AM and  AM (d c,- ), and . duration on weekdays between  AM and  PM (d c,- ). Now we describe criteria for assigning contexts except for Abroad. A cell is detected as Elsewhere (Else) if the duration in that cell is negligible to the total duration as With above threshold values, at least one Office has been detected for more than half of the users. Note that most users were students so that they might not have any regular http://www.epjdatascience.com/content/1/1/10 With above threshold values, at least one Home has been detected for all users except for two of them. Many users turn out to have more than one Home, such as user's own home and his/her parent's home. Finally, the remaining cells are detected as Other meaningful place (Other). Figure  shows the locations of detected contexts for sample users in the Helsinki municipal area. We put two sample users' contexts together to avoid privacy issues.
Our context detection method is validated by weekly patterns of duration for different contexts obtained for sample users and at the aggregate level, as depicted in Figure . For example, the user  without Other detected shows a very regular pattern, especially on weekdays, i.e. at Home in nighttime, in Office during the working time, and at Else when moving between Home and Office. Weekly patterns of user  are comparable to the temporal boundaries in terms of detected contexts, as depicted in the second row in Figure . Weekly patterns of duration aggregated over all users show the overall behavior. Durations at Home, Office, Other, and Else account for .%, .%, .%, and .% of the total duration of all users, respectively.

Spatiotemporal correlations of service usages
We investigate correlations between users' spatiotemporal trajectories and their service usage patterns. Here five services, such as web domain visit (web), application (app), email, http://www.epjdatascience.com/content/1/1/10 voice call (call), and short message service (SMS), are considered and each service is denoted by s. The spatiotemporal correlation of service usages for user i is fully characterized by the number of events corresponding to the service s in the cell c and at time t, denoted by n is (c, t). For gaining contextual understanding of correlations we consider the contexts instead of cells, i.e. n is (C, t) = c n is (c, t), where the summation is over c detected as context C.

Contextual correlations of service usages
We first focus on the contextual correlations of service usages with n is (C) = t n is (C, t). Since services have qualitatively different characteristics, the numbers of events of different services cannot be directly compared to each other but only in terms of fractions and intensities of usages. The fraction of service usage is defined as follows Figure  (left) shows the fractions for sample users  and  as well as their means over all users with standard errors, measured by the bootstrap method. The handset of user  has never been abroad and no Other context is detected. For this user all service usages are more active at Home and Office than at Else, which is very different from the service usage patterns of user . Due to the diversity of the service usage patterns among users, any general conclusion cannot be made on the individual basis. However, by looking at the means with standard errors, it is found that all service usages are the most active at Home, while they are relatively inactive for other contexts. Given the aggregate durations for different contexts obtained in the Section , this finding can be explained such that the longer duration for some context means the higher chance for service usage. http://www.epjdatascience.com/content/1/1/10  7) and (8), respectively. Standard errors are also provided for the user-averaged statistics.
Accordingly, instead of the fractions of service usages we consider those divided by the corresponding durations as follows: where d iC denotes the duration of user i for context C. The results are shown in Figure  (right). Despite of the diversity among users, the means of intensities of different services for the same context have to some extent similar values. The large mean of intensity of email usage in Office might be due to the fact that users prefer emails to calls or SMSs in classes or laboratories during the working time. The large mean of intensity of web usage at Else could be the result of users killing time by surfing the webpages while on the move. One could also say that users while abroad tend to use SMSs more than other communication services. Finally, for all services, only the means of intensity at Home turn out to be less than  and most inactive, which could be partly because users have many other activities to do at Home.

Temporal correlations and time-ordering of service usages
We turn to analyze the temporal correlations of service usages in terms of n is (t) = C n is (C, t), where the summation is over all contexts with one exception, Abroad. It is because the service usage abroad cannot be considered as normal, as shown in Figure . We first obtain weekday and weekend patterns of service usages as n we is (t) = k n is t + k T d () http://www.epjdatascience.com/content/1/1/10 for  ≤ t < T d with T d =  day. Here k and k denote the indexes of weekdays and weekends, respectively. The weekday and weekend event rates of service s for user i are defined as where a = / and a = / are weights for normalization. In addition we obtain the weekday and weekend event rates averaged over all users.
In Figure  we show the individual event rates for sample users  and  as well as the event rates averaged over all users. The overall behavior of the individual and useraveraged event rates reflects typical weekly cycles of humans by being more active in the daytime and on weekdays and less active in the nighttime and on weekends. From the useraveraged event rates, we find that email (call) is more used around noon (late afternoon) on weekdays, while email (call) is less (more) used than other services in the weekend daytime. Since most users in our dataset were students and staff members of the university, they might not be making or receiving calls in classes or laboratories in the weekday daytime. Instead they might be using other communication services, such as email and SMS. On the other hand, users might be using call more than email outside class or laboratory on weekends.
To investigate the temporal correlations between service usages for each user, we calculate the Pearson correlation coefficient (PCC) by using the event rates of services s and s for user i: For the PCC on weekdays and on weekends, ρ wd is (t) and ρ we is (t) are used, respectively. The values of PCC turn out in most cases to be positive (not shown here). This is mainly due to the typical weekly cycles of humans as mentioned before. To correct such cycles, for each case of weekdays and weekends we consider de-seasoned event rates defined as where S i denotes the number of services the user i have used. As shown in Figure , the values of PCC obtained for the de-seasoned event rates show similar and distinct behavior among users as well as between weekdays and weekends. For example, in the case of user , the strongly positive correlation between call and SMS usages on weekdays turns to be slightly negative on weekends. This result is consistent with Figure 10 Pearson correlation coefficients among service usages. Pearson correlation coefficients among service usages for users 5 and 81 and for all users (from top to bottom) are obtained from weekday (left) and weekend (right) event rates. Positive and negative correlations are represented by orange and gray lines, respectively. http://www.epjdatascience.com/content/1/1/10 the temporal patterns depicted in Figure . The positive (negative) correlation between services by being used at the same time (at different times) of the week can be interpreted such that those services are complementary (substitutive) with each other []. Then, we obtain and compare distributions of PCC over all users for each pair of services. The mean values for web-app and call-SMS pairs (app-email pair) are slightly positive (negative) on weekdays and become slightly negative (positive) on weekends. All other pairs have the negative mean values. The result for positive correlations is inconclusive due to the large standard errors of PCC up to .. However, for the pairs of services with large negative correlations, such as web-call and web-SMS pairs, we can argue that those services might be used in a substitutive way. In order to compare the correlations for weekdays and for weekends, we have conducted the Kolmogorov-Smirnov test. It is found that the distributions of PCC for weekdays and for weekends are significantly different for the pairs of web-app (p-value less than .), app-email (.), email-call (.), email-SMS (.), and call-SMS (.). This list of pairs contains all the pairs whose sign of the mean has changed from weekdays to weekends.
For more detailed, i.e. event-based analysis of correlations among service usages, we obtain the distribution of time interval between two consecutive or simultaneous events but of different services of the same user. Precisely, the time interval for a pair of services s and s is defined by t ss = t st s with event timings t s and t s . As shown in the upper panels of Figure (a), distributions for some service pairs have a peak at the negative value of t ss both for weekdays and for weekends. This indicates that the event of service s follows that of service s . On the other hand, distributions for other pairs of services do not show any distinct peaks, implying no temporal correlation. This time-ordering behavior could mean that one service usage might effectively induce another service usage. However, we cannot investigate such a process by our dataset. We summarize the results such that communication services, such as email, call and SMS, are followed by non-communication services, i.e. web and app, as depicted in Figure (b). We also obtain the distributions of time interval for different contexts. We find the overall similar time-ordering behavior (not shown here), except that email is followed by web at Home and that app does not follow communication services abroad. Note that the event-based analysis cannot be directly compared to the analysis of aggregated weekly patterns.

Clustering and overlaps in temporal patterns of service usage
As it turns out, the temporal patterns of service usage are diverse from one user to another, while some of them still show similar behavior. To investigate the similarity and diversity of weekly patterns for each service we apply the k-means clustering method [] to the weekly event rates as ρ is (t) ≡ {ρ wd is (t), ρ we is (t)}. To correct the typical weekly cycles of each service (not of each user), we use the de-seasoned event rates as follows where N s denotes the number of users showing any activity in service s. We similarly define the service-averaged event rates for each user for the clustering, to be denoted by avg. In each case we set the number of clusters as k =  and the cluster index is denoted by q = , . . . , . Clustering has been conducted , times with different initial conditions and http://www.epjdatascience.com/content/1/1/10   web  74  9  7  6  5  3  3  2  1  1  111  app  50  32  10  7  6  6  5  4  3  1  124  email  55  3  3  2  1  1  1  1  1  1  69  call  54  40  14  5  4  1  1  1  1  1  122  SMS  74  14  11  9  5  4  3  1  1  1  123  avg  64  21  16  6  5  5  4  1  1 1 124 We summarize k-means clustering results for weekly patterns of service usages with k = 10. q and Ns denote the cluster index and the number of available users for service s, respectively.
here we present the result maximizing the quality of clustering or validity index, defined as the minimum inter-cluster distance divided by the sum of intra-cluster distances [].
The clustering results are summarized in Table  and only a few weekly patterns of dominant clusters are shown in Figure . Only one dominant cluster is found in each case of web and email usages, implying similar patterns among users. Weekly patterns of app, call, and SMS usages are clustered into more than one dominant cluster. Compared to the largest cluster (q = ) of call usage, the second largest cluster (q = ) can be characterized by larger activities in the weekday daytime and in the weekend morning. The behavioral difference between dominant clusters in SMS usage is also obvious. The largest cluster http://www.epjdatascience.com/content/1/1/10 Figure 12 k-means clustering results of users' weekly patterns. We have used k = 10 and plotted only a few dominant clusters with cluster size in the parenthesis, see Table 1 for details. The bin size was set to one hour.
(q = ) represents the evening-type users, while the second largest cluster (q = ) does the morning-type users on weekdays. In the case of service-averaged usage patterns, the second largest cluster (q = ) shows the larger (smaller) activity in the daytime on weekdays (on weekends) than the largest cluster (q = ). To check the validity of clustering results, we obtain the Pearson correlation matrices using the de-seasoned event rates, ρ is (t). All the matrices support the k-means clustering results, see Figure . We also tested the effect of the number of means, k, on the clustering and found that the results are qualitatively similar apart from the number of small or outlying clusters.
Finally, in order to get insight into the overall structure of temporal correlations among users and services, we construct an overlap network based on the clustering results. This leads to the network of overlapping communities [], where nodes and link weights of the network represent users and their overlaps, respectively. Precisely, the behavioral overlap is defined as the number of services in which two users, say i and j, belong to the same cluster as Here q is denotes a cluster index for user i's service s, and the Kronecker delta function δ(q, q ) gives  if q = q and  otherwise. Figure  shows the overlap network with  links of O B =  and . The behavioral overlap O B =  of a link, denoted by thick black line, implies that the neighboring users belong to the same clusters for all services, i.e. they are fully synchronized. We find cliques consisting of only the fully synchronized users, which we call synchronized cores. The largest synchronized core with  users is closely related to the second largest synchronized core except for belonging to different clusters of call usage. These cores are also connected to many other users but not as a synchronized core. This agglomerate structure can be induced by the relatively homogeneous demographics of users in our dataset. However, we like to note that the clustering was applied to the deseasoned event rates, which have been subtracted by the user-averaged temporal behavior. We compare the behavioral overlap network based on the clustering results to the communication network of users. The communication network can be constructed from the http://www.epjdatascience.com/content/1/1/10 Figure 13 Pearson correlation matrices of users' de-seasoned event rates. These matrices support the validity of the k-means clustering results in Table 1. The user index has been sorted according to the corresponding cluster index and blank spaces are due to totally inactive users.
call and SMS datasets containing the information on communication partners. Only  out of  users and  links between users are identified. The topological overlap of a link ij is defined as [] where i denotes the set of neighbors of node i. O T ij has a value of  if i and j have exactly the same neighbors except for themselves and it has a value of  if they do not have any neighbors in common. Figure  shows the overall positive correlation between behavioral and topological overlaps. It implies that connected users sharing more common neighbors show more similar weekly patterns of service usages. Thus, the behavioral overlap network based on the service usages can be used to reveal the communication network structure of users.

Summary
We have investigated spatiotemporal correlations and temporal diversities of service usages by analyzing a handset-based dataset collected from  users for over  months. The dataset consists of locations and service usages. After constructing the precise spatiotemporal trajectory for each user based on the location dataset, we identify several http://www.epjdatascience.com/content/1/1/10 Figure 14 Overlap network constructed based on the clustering results for all services. Circle, square, and hexagonal nodes represent female, male, and unknown gender of users, respectively. Each black solid thick line denotes a link between users who belong to the same clusters for all services. Other colored lines denote the links between users who belong to the same clusters for all but one service: web (dashed thick blue), app (dotted thin red), email (dotted thick green), call (solid thick cyan), SMS (dashed thin violet), or due to the unused service by either user (solid thin gray). This figure was generated using Cytoscape v2.8.1 [41]. meaningful places or contexts by means of context detection method. As contexts, Home, Office, Other meaningful place, Elsewhere, and Abroad are considered. We showed how the context affects the service usage patterns of users, including their web domain visit (web), application (app), email, voice call (call), and short message service (SMS).
In this study we have found the similarity and diversity of weekly patterns among users and services, in terms of temporal correlations, time-ordering behavior between services, and overlap network based on clustering. The services used at the same time (at different times) of the week lead to the positive (negative) correlations between them, which can be interpreted as being complementary (substitutive) to each other. By conducting the event-based analysis instead of weekly patterns we observe the time-ordering behavior between services, such that communication services, i.e. email, call, and SMS, are followed by the non-communication services, i.e. web and app. Finally, the similarity and diversity of weekly patterns of service usages enable us to classify users into several different clusters, e.g. as characterized by the morning-type or evening-type usage patterns, except for the web and email usages. The behavioral overlap network constructed based on the clustering results can be used to reveal the communication or real social network structure of users. http://www.epjdatascience.com/content/1/1/10 Our findings on the spatiotemporal correlations of service usage patterns for different contexts enable us to better understand the behavior of humans and what that implies. This is also important for better design of information and communications technology (ICT) enabled social environments and services. However, more detailed analysis with higher resolution is required to reveal the underlying mechanism or the origin of spatiotemporal correlations. http://www.epjdatascience.com/content/1/1/10