Estimation and monitoring of city-to-city travel times using call detail records

Whenever someone makes or receives a call on a mobile telephone, a Call Detail Record (CDR) is automatically generated by the operator for billing purposes. CDRs have a wide range of applications beyond billing, from social science to data-driven development. Recently, CDRs have been increasingly used to study human mobility, whose understanding is crucial e.g. for planning efficient transportation infrastructure. A major difficulty in analyzing human mobility using CDR data is that the location of a cell phone user is not recorded continuously but typically only when a call is initiated or a text message is sent. In this paper we address this problem, and develop a method for estimating travel times between cities based on CDRs that relies not on individual trajectories of people, but their collective statistical properties. We apply our method to data from Senegal, released by Sonatel and Orange for the 2014 Data for Development Challenge. We turn CDR mobility traces to estimates on travel times between Senegalese cities, filling an existing gap in knowledge. Moreover, the proposed method is shown to be highly valuable for monitoring travel conditions and their changes in near real-time, as demonstrated by measuring the decrease in travel times due to the opening of the Dakar-Diamniadio highway. Overall, our results indicate that it is possible to extract reliable de facto information on typical travel times that is useful for a variety of audiences ranging from casual travelers to transport infrastructure planners.


Introduction
Mobile phones are ubiquitous, widely available and used all over the world.They have also proven to be an invaluable source of high-quality data for studying different aspects of human societies [-], especially for development purposes [, ].Such studies typically use Call Detail Record (CDR) data that are collected by telecommunication operators for billing purposes and therefore come with no extra cost or overhead.CDRs contain information on communication events such as calls or text messages, including the initiator and recipient, time of contact, and which cell tower is involved in the contact.
Studying CDRs has been especially helpful for developing and underdeveloped countries, where there is often a lack of systematic population-level data collection, or in the aftermath of natural disasters, where individuals are hard to reach or their location is unknown [].In the recent years, global entities like UN Global Pulse have published re-ports on use of these data for such purposes [], and telecommunication companies such as Orange and Telecom Italia have set up data challenges for scientists to study CDRs for development purposes [, ].
One important line of research applies CDR analysis to study human mobility and transportation and to develop methods that can be used e.g. for urban real-time monitoring [] and planning [], or for optimization of transportation infrastructure [-].
The time it takes to travel between different locations is a key constraint (and a key descriptor) of human mobility.Thus, up-to-date information on de facto travel times is not only of importance to travelers, but also for planning and governance of transport infrastructure.For instance, such information could be useful for monitoring road conditions or assessing access times to hospitals.The importance of de facto travel times is therefore evident, but their availability is still limited, in particular in developing countries, due to a lack of available resources required for such monitoring.
In practice, there are many ways to estimate and monitor travel times.Travel time information can be estimated with different techniques typically used by transportation engineers, ranging from magnetic loop detectors, automatic register plate recognition systems, and recording of GPS traces to traditional surveying methods [-].Although some of these methods provide highly accurate real-time estimates on travel speeds and times, they typically require installation of physical equipment (e.g.magnetic loop detectors) which makes them resource-intensive, or they are labor-intensive (surveys).GPSbased methods require less resources.In particular, Google or other vendors of smartphone operating systems can easily leverage on an existing population of suitable devices to collect raw data for computing travel time estimates.However, even though mobile phones are common in developing countries, smartphone penetration is typically low [], making the collection of data from GPS-enabled smartphones difficult in practice.Also there may be no commercial interest in providing detailed, high-quality information on travel times in developing countries.Furthermore, algorithms used for extracting travel time estimates from raw data are not typically available.
One alternative approach to estimating travel times is to use data generated by communication between a mobile phone and the cellular network base stations.In developed countries, the most important use case is to provide accurate real-time information on traffic conditions and therefore most studies and commercial projects have been focused on this topic [-].There are two main approaches for estimating travel time information using mobile phones and the cellular network.The first uses information generated when mobile phones move across the coverage areas of cell-towers, which results in handovers and location area change events [-].The second is based on signaling strengths and delays between a mobile phone and nearby cell towers [, ].When done periodically, this results in GPS-like coordinate trajectories which can then be further refined into travel time distributions between origin-destination (OD) pairs [].
While many of these systems have also been commercially implemented [, ], such travel time estimation systems are not yet adopted worldwide and even in some developed countries they are still at the pilot phase [].
To summarize, in the context of travel time estimation, most studies focus on providing accurate, real-time estimates on specific road segments while less attention has been given to the travel times actually experienced by the users on longer trips.Additionally, most methods are either costly, labor intensive or rely on infrastructures which are non-existent in developing countries.Therefore, there is a need for methods which () are inexpensive and are not resource or labor-intensive () do not depend on complicated infrastructure or hardware () provide accurate estimates of travel times experienced by users.
In this paper, we show that this can be achieved with the help of CDR data already stored for billing purposes, without the need for implementing more detailed hand-over or triangulation data analysis pipelines.The benefit of using billing data is that mobile operators always collect this data in a standardized format; there is even a dedicated software package for analyzing such data (http://bandicoot.mit.edu/)[].However, extracting accurate travel time information from CDRs is not a straightforward task because the location of a user is recorded only when the user initiates a call or sends a text message.Therefore, a single CDR-based mobility trajectory is typically very sparse in time, and cannot directly be used for estimating travel times between locations.However, when multiple mobility trajectories are pooled and analyzed as a whole, it turns out it is possible to produce reliable travel time estimates.
To this end, we have developed a method for automated extraction of typical travel times between cities from CDR data.Due to the simplicity and low computational cost of our method, we are immediately able to scale it up to the country level instead of the more local scales typical for other methods.The method aims at providing an overall view on travel times between cities and it enables monitoring of travel times and conditions in the long term.It has been especially designed for developing countries where reliable information on travel times and transport infrastructure is limited or not available at all.Unlike some of the above-mentioned methods, it is not designed for producing real-time traffic speed estimates for specific road segments; however, as we show with an example piece of Senegalese highway, it does allow detecting sudden changes in travel times.
The data we analyze originates from Senegal, for which Orange and Sonatel have provided anonymized CDR-based mobility data sets in conjunction with the 'Data for Development Challenge ' (DD Challenge) [].To show our method's performance in practice, we compare our results to existing travel time information available from alternative sources, such as the travel times provided by Google.Furthermore, to demonstrate that the method is capable of monitoring changes in travel condition in near-real-time, we estimate how much the opening of the Dakar-Diamniadio highway dropped the typical travel times between the capital Dakar and the nearby city of Pout.

Data
In this study, we have used  anonymized mobility data sets provided by Orange and Sonatel for the  DD Challenge [].Each set contains ∼, mobility traces for a two-week time span; a mobility trace contains the cell tower IDs and time stamps of calls and text messages made by one anonymized customer.In the provided data set, users whose traces span less than % of the days of in a given two-week period have been filtered out, together with users who have more than , weekly events and likely correspond to non-human users such as machines sending text messages.Both filtering processes have been performed by the DD Challenge organizers, i.e. at data source.As shown in Figure (A), after the filtering most users have been observed between  and , times during a two-week time span.Because of privacy and commercial reasons, only approximate coordinates are given for the locations of the cell towers, and the time Figure 1 Mobility trace lengths n t and inter-observation time counts between origin-destination city pairs.In Panel A, we show the logarithmically binned distribution of the number of data points in the users' CDR-based mobility traces.The distribution shows that most two-week mobility trajectories have between 10 1 and 10 3 data points.In Panel B, we show the complementary cumulative distribution (1-CDF) of the number of inter-observation counts for all origin-destination city pairs.Note that a large number of city pairs (∼10%) have zero inter-observation times, which is partially due to some cities not being allocated to any (constantly) active cell tower.
resolution of data is restricted to  minutes.For a more detailed description of the data set, see Ref. [].
No data are perfect, and this data set is no exception.The biggest problem arises from the fact that the data only contains the locations of the cell towers at the end of the data collection period.a Thus, changes in the cell tower or, to be precise, in the locations of the base transceiver stations during the year can go unnoticed, and cause errors in the data.To reduce errors caused by cell towers (or their IDs) whose location has changed and by cell towers that were introduced during the time span of the data set, we have only included in our analysis those cell towers that were associated with at least one CDR entry on each day of the time span covered by the data.This white-list of cell towers was created using another data set that contained the hourly numbers of calls and text messages sent and received at each cell tower.In total the white list contained , cell towers out of the total number of , provided in the data set.Although this tower-level filtering of the data has helped to reduce errors, it is still apparent in some of the results that the source data comes with erroneous tower locations.
Possible problems with tower locations are also mitigated by focusing on cities instead of individual cell towers.To this end, we obtained a list of  major Senegalese cities and their geo-coordinates from www.tageo.com[].With the help of this information and the provided cell tower locations, we assigned a set of cell towers to each city, such that each cell tower is assigned to its closest city whenever their distance is at most  km.Note that two cities (Wassadou and Ourossogui) out of the total of  cities were not assigned any cell towers.The locations of the cities and their associated cell towers are displayed in Figure (A).

Determination of typical travel times
Given two cities i and j and their corresponding sets of cell towers I and J, we say that user u has made a trip from city i to city j whenever the mobility trajectory of u first contains one of the cell towers in I, and at a later point one of the cell towers in J. Between the start and end of a trip, a user can visit any other cities and cell towers, but can not visit any cell towers corresponding to the origin or the destination city.Thus a trip from i to j consists of a series of time-ordered observations (time, user, tower ID), where the first tower ID belongs to the set I and the last tower ID to the set J. The inter-observation time corresponding to each trip is then defined simply as the time between the first and last observation.Note that a user can be simultaneously on multiple trips.A schematic example of a city-level trajectory, and the resulting inter-observation times are presented in Figure .In Figure (B) we show the pooled distribution of the number of extracted inter-observation times for the  ×  OD-pairs.To estimate the typical travel time from city i to city j, we pool all inter-observation times from the mobility trajectories of different users, and investigate their distribution.In theory, the shortest observed inter-observation time would be indicative of how fast one can travel between the two cities.However, the shortest inter-observation time may not represent the typical travel time between these two locations and it is also particularly sensitive to any errors in the data.Because of this, we focus on the peak of the interobservation time distribution instead.
For accurately estimating the location of the peak, some smoothing of the interobservation distribution is necessary as the original distribution of data can fluctuate a lot, even though our data is already binned to  minute intervals due to the restricted temporal resolution of  minutes.Smoothing is of most importance when an OD-pair has a low number of inter-observation times (see Figure  for an example).
It is also worth noting that travel can take place using different modes of transportation which can cause multiple peaks in the inter-observation time distribution.This can clearly be seen in Figure (C), where there is first a peak corresponding to travel by air followed by a peak corresponding to travel by sea and land.In this study we focus on the most typical travel modes.As travel by air is not very common within Senegal [], we focus Figure 3 A schematic representation of one user's mobility trajectory and the resulting inter-observation times.On the left, the mobility trajectory of one person is visualized both as a spatial representation (top) and as a timeline (bottom).In the spatial representation each circle corresponds to a city, and an arrow corresponds to a movement from one city to another.The ordering of the movements is indicated by ordinal numbers and the times when the user has been observed in each city are shown below the name of each city.The timeline presentation below shows the same information in a more compact form.On the right, we have also listed all inter-observation times that can be computed from the trajectory.

Figure 4 Importance of smoothing.
In the figure we show an example OD pair with a low number of extracted inter-observation times.The black dots show the original distribution of data consisting of in total 1,609 inter-observation times.The green curve shows the smoothed estimate of the probability density function, while the red vertical line denotes the peak estimate and the blue vertical line denotes the lower bound estimate.As is evident from the figure, the original data fluctuates a lot and the general trend of the data is better visible in the smoothed distribution.Furthermore, our decision rule for the peak now selects the first peak of sufficient magnitude, which is a more reasonable estimate than the even higher peak located around 750 minutes.
on travel times corresponding to straight-line travel speeds of less than  km/h which should allow for all different travel modes by sea and land but filter out air traffic.Note that given the typical road and travel conditions and that the limit is on straight-line speeds, this manually chosen limit is rather generous and will almost certainly not exclude any actual land travel trips.This thresholding also helps to further filter out some erroneous results due to irregularities in the source data.
Because of various biases in the data (e.g.different mobile phone usage frequencies leading to different waiting times before first call is made at destination, or tower location offsets) there is no guarantee that the location of a peak in an inter-observation time dis-tribution would precisely correspond to a typical travel time between two cities.Thus, information on the peak's width is also important.The right-hand side of the peak typically decays slowly as there is no natural limit to a trip's duration.Therefore we focus on the left-hand side of the peak and its lower bound.If the position of the peak is considered as an estimate of the typical travel time, the lower bound measures the best case, travel under optimal conditions.The lower bound is computed as follows: given a peak's location t p , the location of its lower bound t l is defined as the largest inter-observation time such that (i) t l < t p , and (ii) the value of the smoothed inter-observation time distribution is lower than or equal to half the peak height.
In detail, our analysis pipeline for estimating typical travel times between cities is as follows: . Compute inter-observation time distributions Loop through the CDR data, compute inter-observation times for each origin-destination city pair, and pool the results into inter-observation distributions.

. Smooth the distributions
To smooth the inter-observation time distributions, use a standard Gaussian kernel with a standard deviation σ corresponding to  minutes.The bandwidth of  minutes was chosen as it was found to allow reasonable travel time estimations for city pairs with fewer trips, while not oversmoothing the original data.The smoothed density estimates P s (t) can then obtained from the original inter-observation time distribution P o (t): where t max is the largest inter-observation time permitted by the data ( weeks) and C is a normalization coefficient guaranteeing that the final smoothed distribution P s (t) is a valid probability density function.The smoothed density estimates are evaluated at  min intervals.

. Find all maxima
Find all local maxima of the smoothed probability density functions whose corresponding straight-line travel speed does not exceed  km/h (to filter out air traffic and errors in the original data).This can be done simply by going through the elements of the vector of smoothed density estimates: an element is a local maximum when its value is higher than those of its neighbors.

. Detect the peak corresponding to typical travel time
From each smoothed probability density functions, select the peak with smallest travel time such that the height of the peak is at least . times the height of the largest peak fulfilling the travel speed restriction.Typically this condition results in simply choosing the highest peak of the distribution, but with origin-destination city pairs with a low number of observations this condition was found to provide more robust results (see Figure ).

. Compute the lower bound estimate
Select the closest point in time smaller than the peak time such that the height of the distribution at this point is smaller than or equal to half the height of the detected peak.In case the inter-observation time distribution does not fall to half peak height on peak's left hand side, set the value for the lower bound to  minutes.Typically, such cases are due to irregularities in the source data.Note that because there is no ground-truth calibration data, we are forced to set some parameters of the method on the basis of reasonable assumptions, instead of adjusting their values based on calibration.In the following sections, we will nevertheless discuss possible causes of estimation biases.Especially, we will investigate the biases caused by varying the smoothing bandwidth.
The code implementing the above analysis pipeline for extracting typical travel times from CDR data is freely available at https://github.com/rmkujala/ddttimes.

Estimation biases
Our peak and lower-bound estimates are, of course, prone to different kinds of biases due to our definition of inter-observation times.First, each inter-observation time typically includes not only the actual time of travel but also period before and after, as calls are not made exactly on departure or arrival.This bias is however difficult to correct, as it is not known how mobile phone usage and travel behavior are coupled (and inter-call times are typically very broadly distributed, see e.g.[]).Moreover, we do not filter out detours taken by travelers which cause the long tails in the inter-observation times as seen in Figure .On the other hand, the range of cell towers can cover areas that are far from the location of a city, which can shorten inter-observation times between pairs of cities.These individual biases can thus sum up to a bias that can be either negative or positive, and that is difficult to estimate using CDR data alone.However, if calibration data e.g. based on GPS recordings of individuals were available, it should be possible to correct for these different biases.Nevertheless, our example cases will show that our estimates tend to be close to quoted travel times found from literature.In any case, it seems reasonable to assume that the bias remains relatively constant for any pair of cities, and thus when the method is used for monitoring changes in travel times, possible biases no longer matter.

Effect of the number of samples on the estimates
As our method relies on the distribution of inter-observation times between two cities, it is important to know how much data is required for reliable estimates.To get some idea of the amount of data required for robust travel time estimation, we investigated how the estimation error decreases with the number of data points.This was done by bootstrap resampling the original inter-observation time distributions so that bootstrap sample sizes ranged from  up to the total number of data points in the sample.For each sample size, we calculated , bootstrap estimates and computed the median as well as the th and th percentiles of the bootstrap estimate distributions.Here, we report the results obtained for the two city pairs presented in Figure  ('Kaolack to Tambacounda' and 'Dakar to Ziquinchor').In addition, we also investigated two origin-destination city pairs ('Dakar to Thies' and 'Dakar to Kaolack') for which a very large number of inter-observation times (>  ) were available when data were aggregated over the entire data collection period.The results are shown in Figure , and they illustrate two main points: First, our estimates on the location of the peak are relatively unbiased when at least , interobservation data points are available, as the median of the bootstrap estimates remains close to the final value of the full distribution after this limit.Second, based on the th and th percentiles of the bootstrap distributions, to reach an acceptable  min estimation accuracy we need of the order of , data points.In our data set, we find  origindestination city pairs (.% of all origin-destination pairs) that fulfill this criteria when analyzed over the whole year.For the full distribution on the number of inter-observation data points, see Figure (B).
Naturally, as the shape of inter-observation time distributions differs across city pairs, the amount of data required for accurate travel-time estimates may vary.As a rough rule of thumb, we nevertheless conclude that approximately , inter-observation times are required for each origin-destination pair for obtaining reliable estimates.It is also worth stressing that this rule of thumb is specific to this study only, as the mobility data used here came in two-week chunks limiting the longest possible observable inter-observation times accordingly.

Effect of the width of the smoothing kernel
Next, we discuss how the width of the Gaussian kernel used for smoothing the interobservation time distributions affects the results.In this work we report results with a Gaussian kernel of width that corresponds to  min in standard deviation, which we found to yield reasonable results.In general, it would be good to select kernel width adaptively e.g. using cross validation.However, given that the time resolution ( min) of the data was artificially heavily limited, this would have not been very straightforward.
We have nevertheless investigated how the smoothing bandwidth used affects our results.To this end, we computed the peak and lower bound estimates with a range of different smoothing bandwidths for origin-destination city pairs that had at least , inter-Figure 6 The effect of the smoothing bandwidth on the travel time estimates.In Panel A we show the distribution of the lower bound estimates normalized by subtracting the lower bound estimate obtained with smoothing bandwidth of 30 minutes.In Panel B we show the distribution of similarly normalized peak estimates.In both panels the outer shaded area denotes the 5th and 95th percentile, the inner shaded area denotes the 25th and 95th percentiles of all estimates, and the solid lines correspond to the median of all the normalized estimates.In these plots, we only show data for OD pairs with at least 10,000 data points.Furthermore, the results for the lower bound estimates are only based on those results for which we are able to identify the lower bound with all different bandwidth values.(With some OD pairs the inter-observation time distribution never falls to half of the identified peak's width on the peak's left hand side due to data irregularities.)Panel B shows that most peak estimates tend to stabilize when the width of the smoothing kernel reaches 20 min, as is shown by the 5th percentile of the normalized estimates.Additionally, we note that smoothing causes a systematic bias to the results: the larger the kernel width, the larger are the peak estimates and the smaller are the lower bound estimates.
observation times within the time span of the data.We arrived at two main conclusions (see Figure ): First, our estimates seem to stabilize when the width of the smoothing kernel reaches  min (this becomes more emphasized when the threshold is set to , inter-observation times).Second, our lower bound estimates tend to decrease and peak estimates increase when the smoothing bandwidth increases.This is due to the skewness of the inter-observation time distributions: As the right tail of the distribution is typically fatter than the left tail, smoothing systematically shifts the peak to the right and the lowerbound estimate to the left.

City-to-city travel times within Senegal
We begin by reporting results on travel time estimates extracted for all origin-destination pairs for which at least , inter-observation times were discovered.In the supplementary web-page (see Additional file ), we also report our estimates and bootstrap error bounds for all origin-destination city pairs for which at least , inter-observation times were available.
As any official information on times of travel between Senegalese cities is scarce and hard to find, there is no obvious ground truth available for validating our results.Nevertheless, to give our estimates general credibility, we performed two different sanity checks.First, the travel time estimates should be symmetric: the estimated travel time from city i to j should be approximately equal to the travel time from city j to i.As shown in Figure (A), our estimates do generally fulfill this condition.Second, if we take up a simplistic assumption of constant average travel speed throughout the country, we would expect an In Panel A, we present a scatter plot of estimated travel times from i to j versus times from j to i.It is seen that the estimates are rather symmetric, as they should.Note that each pair of cities corresponds to a single point in the plot, and the selection of which of the cities corresponds to i and which to j is made arbitrarily.In Panel B, we show how the estimates scale with distance.Overall, we can observe an approximately linear trend.There are a few points that correspond to clearly infeasible estimates due to erroneous data that seem to have very long travel times (500-800 minutes) even though the distance between city pairs is low (0-150 km).
approximately linear relationship between the estimates and the straight-line (geodesic) distances between cities.The results shown in Figure (B) agree with this hypothesis for most of our estimates apart from some data points, including a few clearly erroneous ones.By manual inspection of the source data, we found out that the erroneous estimates are due to data irregularities: even after our data filtering pipeline some cell towers seem to suddenly change their location.Thus whenever the data itself is of reasonable quality, these two sanity checks demonstrate that our method yields sensible results.
To illustrate how well our estimates align with the real world, we consider two examples where travel time estimates are available from elsewhere.According to Lonely Planet [], the travel time from St.-Louis to Dakar is roughly five hours with frequent sept-place taxis.Our peak estimate of  h  min matches this extremely well.Furthermore, if we look at our estimate of the travel time from Dakar to Ziguinchor (also discussed in Figure (C)) equals  h  min, which matches the approximate  h travel time to travel from Dakar to Ziguinchor by ferry [].

Comparison with Google's estimates
To compare our estimates with existing routing engines, we also obtained travel time estimates between all city coordinates from Google's Distance Matrix API [] using the default parameters of the service.The comparison of our and Google's estimates is shown in Figure .Overall, our results and those obtained through Google's API are roughly linearly dependent.Compared to our estimates, Google's estimates tend to be lower, especially when longer travel times are considered suggesting that Google may effectively overestimate the typical travel speed in Senegal.However, this is difficult to verify, as Google has not made public how they produce their travel time estimates.It is nevertheless clear that the baseline for Google's estimates originates from the road network data that includes information on road network types and speed limits.In many developed countries, Google is known to also track and store location data from the users of its mobile operating system [].This however requires that the people in the monitored area have access to smartphones, so that e.g.GPS location data can be transferred from the phone to Google.In Senegal, smartphone penetration is still low (% of adult users,  []), which can make it challenging for Google to calibrate their travel time estimates.Naturally, the differences between our and Google's travel time estimates can also originate from biases in our source data due to artifacts such as erroneous cell tower coordinates, varying coverage ranges of cell towers, and the non-continuous tracking of individuals.Also the goal for Google's estimates may be different than ours: Google may focus on providing estimates the travel time between places without including any additional delays e.g. for breaks.Thus, due to the lack of established ground-truth data, we can not claim either of the method to be superior -all that is certain is that there is a systematic difference.

Travel speed maps help to pinpoint anomalous travel times
To demonstrate how our results could be used for monitoring travel conditions in a country, we compute the speed of travel between Dakar and other Senegalese cities assuming that the travel follows straight lines and that the travel times equal the peak estimates.This we visualize in Figure (A).The results show in general that the further the distance from Dakar, the greater is also the speed of travel.As Dakar is known for its congestion this result is in line with expectations, although possible systematic biases due to e.g.smoothing of the distribution can affect the results.
Nevertheless, we can also find an exception to the rule: The travel speed from Dakar to Ziguinchor is arguably slower than to many other cities that are of same distance from Dakar.This is most likely due to the ferry travel between these two cities, as land travel requires one to cross Gambia and travel on bad roads.

Monitoring travel times: case study on opening of the Dakar-Diamniadio toll highway
Our method also allows near-real-time monitoring of travel times: all that needs to be done is periodically feeding CDRs to the peak detection algorithm -say, daily or weekly.This allows maintaining up-to-date travel time estimates, and in particular, detecting expected or unexpected changes in travel times.In Figure  we show monthly and daily travel time estimates from Dakar to Pout (both peaks and lower bounds).From the monthly and daily estimates it becomes clear that there is a drop of  to  minutes in the typical travel time around the time when the new highway was opened.Not surprisingly, the results for the daily estimates are noisier than the monthly ones as they are based on fewer inter-observation times.Nevertheless, the drop in travel times can be pinpointed with high accuracy to match the opening day of the highway.

Discussion
In this paper we have introduced a method for extracting typical travel times between cities from CDR-based mobility data.To demonstrate the usefulness of the method, we have applied it to data from Senegal, released for the  DD Challenge, and shown that it produces feasible estimates even though the spatial and temporal resolution of the data has been artificially reduced.Compared to Google's Distance Matrix API estimates, our approach yields estimates that are on average longer than those by Google suggesting the possibility that Google may be overestimating travel speed in Senegal.Also, we have discussed how the method can be used for monitoring changes in travel speeds in near real-time, as demonstrated by measuring the impact of opening a new highway on travel times.
For our method to work properly, a sufficient amount of data is required, especially when accurate travel time estimates are called for (for monitoring changes only, a higher level of noise is tolerable).However, even when only using a sample of , individuals out of Senegal's total population of over  million inhabitants, we were able to obtain reasonable travel time estimates between many Senegalese cities. Were the operator's CDR data to be used in full, we would have also been able to provide estimates for pairs of cities between which little traffic takes place.
Our method would also benefit from better spatial and temporal accuracy of data.An increased spatial resolution would allow better allocation of cell towers to cities, and if the temporal resolution of the data was improved, also within-city analyses based on individual cell-towers could become feasible.Note that for CDRs without artificial restrictions this is typically the case: positions of towers are accurately known, and data is recorded at a time resolution of one second.Further, were the data augmented with data on handovers, location area changes, and Internet usage data, the estimates would become even more accurate.
As some of our results point out, errors in data can give rise to corrupted results.While simple filtering of the data removed some of the errors, others did persist.Most likely these errors could have been avoided at source: the errors have to do with changing base station locations, base station ID's that have been switched between stations, or other technical issues at the operator's end.
In addition to improving the quality, amount, and accuracy of the source data, also the method itself could be tuned for more accurate estimates.For instance, one could make use of information on road network characteristics or distances between cities and use them to regularize the estimation problem e.g. using Bayesian methods.Moreover, the estimation process could also be approached in a more holistic manner using ideas originating from the triangle inequality: if we know the typical travel times t AB and t BC , it is likely that the typical travel time t AC is smaller than or of the same order of magnitude as t AB + t BC .Thus, if there are few trips observed between cities A and C, the travel time t AB + t BC could be used as a soft constraint for estimating the travel time t AC .This approach could also help out in automatic detection of erroneous results that are due to irregularities in the source data.
It is also worth pointing out that the estimates produced by our approach are bound to suffer from different biases because of reasons ranging from varying cell tower ranges to offset times between cell phone usage and traveling.The effect of these biases could be diminished with more accurate calibration data on human mobility, such as GPS location data recorded for a sample of users.
Finally, let us discuss certain benefits of the method.The method is easy to implement, and it does not require either massive deployments of sensors, GPS traces from smartphones, or large-scale computational resources as the analyses can be run even on a standard desktop computer.The method also avoids privacy concerns that are often associated with CDR data, as it can operate with chunks of anonymized data (all that is required is series of locations and times), and produces only aggregate data that does not violate the privacy of individual users.
The real value of any method comes nevertheless from its use in practice.For travelers, both locals or tourists, the information on typical travel times is valuable as it helps out in planning of trips.For transport infrastructure planners, the method provides means for spotting possible bottlenecks in a transportation network.Also, as the method lends itself to near real-time monitoring of travel conditions, it can be used for assessing changes in travel times, either locally or country-wide.Such changes can result locally from special events, deteriorated (or improved) road conditions, or disturbances such as illegal check point harassment.On a larger scale, one could envision detecting distruptions to travel patterns caused by disasters, violent conflicts, or outbreaks.
To summarize, in this work we have demonstrated that it is possible to extract and monitor travel times using CDR data when high quality data are available in sufficient amount.Given that the cost of applying the method in practice is low and the potential gains remain significant, we hope to see our method implemented also in practice -especially in developing countries where accurate travel time information is not often readily available.

Figure 2
Figure 2 Extraction of typical travel times between Senegalese cities. Panel A: Each set of cell towers assigned to a city are colored with same color.Black dots indicate cell towers that were not assigned to any city.Light blue background indicates sea, and white background land.Note also that southern and northern parts of Senegal are separated by Gambia, for which road data is not shown.Panels B and C: Two examples of inter-observation time distributions (B: from Kaolack to Tambacounda; C: from Dakar to Ziguinchor).The empirical inter-observation time distributions are shown with black dots, and the green curve represents our kernel density estimate of the inter-observation time probability density.The red vertical lines indicate the estimated typical travel time corresponding to the peak, and the blue vertical lines indicate the lower bound estimates.In Panel B, we see a typical pattern with a single clear peak that is located at 275 minutes (4 h 35 min).This gives us an estimate of the typical travel time from Kaolack to Tambacounda.In Panel C, however, there are two peaks.The first peak is located at ∼100 minutes and is presumably from air traffic between Dakar and Ziguinchor.The second peak is located at 870 minutes (= 14 h 30 min), which matches well with the travel time taken to reach Ziguinchor from Dakar with ferry (15 h).

Figure 5
Figure 5 Dependence of estimation accuracy and bias on the number of data points.Solid lines denote median estimates, and the shadowed area denotes the 5th and 95th percentiles of the bootstrap estimate distributions.The circle at the end of each curve denotes the travel time estimate obtained using the full inter-observation time distribution.

Figure 7
Figure 7 Symmetricity and scaling of travel time estimates with distance.In Panel A, we present a scatter plot of estimated travel times from i to j versus times from j to i.It is seen that the estimates are rather symmetric, as they should.Note that each pair of cities corresponds to a single point in the plot, and the selection of which of the cities corresponds to i and which to j is made arbitrarily.In Panel B, we show how the estimates scale with distance.Overall, we can observe an approximately linear trend.There are a few points that correspond to clearly infeasible estimates due to erroneous data that seem to have very long travel times (500-800 minutes) even though the distance between city pairs is low (0-150 km).

Figure 8
Figure 8Comparison of our estimates with Google's Distance Matrix API estimates.Only estimates based on at least 10,000 inter-observation times are shown.In Panel A, we show our results for the lower bound estimate, and in Panel B, we show our results for the peak estimate.Google's estimates tend to be overall smaller than ours.This is most evident with longer distances, and indicates that Google seems to systematically overestimate travel speeds in Senegal.

Figure 9
Figure 9 Getting out of Dakar takes time.On the left (Panel A), we display straight-line travel speeds from Dakar to selected other cities in Senegal.The figure shows that the longer the distance from Dakar, the faster the speed of travel, except for Ziguinchor to which travel often takes place by ferry resulting in slow travel speed.The grey network in the background represents Senegal's road network obtained from Ref. [37].The gap in the road network between the southern and northern parts of Senegal corresponds to Gambia for which road network data is not included.On the right (Panel B) we show a close-up map showing the main roads close to the capital Dakar that is located on a peninsula.The stretch of the Dakar-Diamniadio highway that was opened on August 1st 2013 is highlighted with a thick dark blue line.The map image in Panel B is a modified excerpt from OpenStreetMap (© OpenStreetMap contributors).

Figure 10
Figure 10 Opening of the Dakar-Diamniadio toll highway shortens travel times between Dakar and Pout.Daily (Panel A) and monthly (Panel B) travel time estimates between Dakar and Pout.The solid lines denote the estimated values, and the shaded areas denote the 5th and 95th percentiles of 1,000 bootstrap estimates.The green vertical line marks the opening date of the highway in both plots.