Skip to main content

Traveling heterogeneity in public transportation


It is well reported that long commutes have a large detrimental effect on people’s health and on the economy of cities. Interestingly, despite the strong impact on our daily lives, a simple way to measure the quality of urban transportation is still unknown. We performed data analysis on the transportation network of two large cities (Fortaleza and Dublin). By dividing each bus trajectory into equal pieces of space, we determine the distribution of time intervals for each trip, and we propose that the heterogeneity of the time distribution can be used to characterize the quality of that trip. Inspired by the use of the Gini coefficient to quantify the inequality level of income distribution, we used the Gini in order to characterize the heterogeneity level of the time distribution. We demonstrated that Gini coefficients are strongly correlated with peak usage of the mobility system, as well as the schedule delays in the system. Finally, our method can be used to find highly heterogeneous trips which have a large negative effect on the urban mobility and can help find new directions for new public planning strategies.

1 Introduction

A long commute has become one of the major problems of modern urbanization. In the largest cities in Brazil the average commuting time is 41 minutes [1]. In these areas, the poorest travelers spend almost 20% more time commuting than the richest and 19% of the poorest commute for more than one hour, while for the richest it is only 11% [1]. This increased time has a large health and economic impact. Studies in the U.K. show that a 20 minute increase in a commute is as bad as a 19% pay cut in regard to job satisfaction [2]. While car users are the majority of the population, bus commuters feel the negative impacts of a longer commuting time more strongly than users of other modes of transport [2].

The study of mobility is an active area of research in the statistical physics community. Models first applied on transport problems in fluid dynamics were later used to understand animal and human mobility [3]. Random walks and reaction–diffusion models were first studied in fluid dynamics and dynamical systems [4]. However, certain special classes of random walks, like Lévy flights, where the lengths of steps are chosen from a power–law tail distribution, are proposed as the mechanism behind animal foraging and human mobility [57]. An initial study on human mobility used bank note dispersal as a proxy for human movement and verified that the distribution of distances between consecutive sightings of bank notes is fat tailed [7]. Recent studies using mobile phone calls show similar results [8].

This pervasive heterogeneity in mobility distribution is not a privilege of human and animal mobility. Fat tailed distributions are a well known characteristic of natural and economic systems. The Pareto Principle states that roughly 80% of the wealth is owned by 20% of the population [9]. This principle came about directly from the power–law tail distribution of incomes [10]. The level of income inequality or heterogeneity is measured by the Gini which is a measurement of statistical dispersion that can range from 0 (complete equality) to 1.0 (complete inequality) [11].

Here we used a dataset of bus trajectories in order to study the time distribution among them. We investigated if the Pareto Principle still holds for urban transportation mobility, namely, does the bus in order to complete its journey, use 80% of the total time with only 20% of the space of its trajectory? We found that this proportion varies for different lines, going from more homogeneous bus routes with 35% of time for 20% of space, to more heterogeneous with 65% of time. We then calculate the Gini coefficient as a heterogeneity and show that it is correlated with schedule delays. We also see that the high heterogeneity of time is a consequence of a power–law distribution, with different exponents for each trip. Since there is a mathematical connection between the Gini and the power–law exponent, we show that when the exponent of a bus trip is close to −2.0 we also have a high probability of schedule delays. These highly heterogeneous trips have a large detrimental effect on urban mobility. However, this also gives us the opportunity for a micro–intervention approach, where we can change a small fraction of the bus routes to have a high impact on the final quality of transportation.

2 Methods and data

This research is based on data analysis from Fortaleza and Dublin. Fortaleza is a Brazilian metropolis in which the main form of public transportation is via bus, counting on about 350 bus routes spread throughout the city, which covers an area of 314 km2. In order to adapt the Gini coefficient for a transport application, two datasets were used regarding the Fortaleza bus system, namely, the GPS positions of the buses and the passenger validation records (VAL). The data were obtained through the Fortaleza’s city hall and refer to the period between the 12th and the 17th of April 2016, from Tuesday to Sunday. These data have been noteworthy used in several studies in the last few years [1216]. The first is the largest dataset of the two and consists of about 21M GPS points. It is possible to recover the route of the vehicles throughout the day, because the location of the bus is recorded over a period of time approximately every thirty seconds. In addition to the georeferenced location, this data has the time at which the GPS point was recorded and which vehicle is located at that particular point.

The second dataset is the VAL. The bus travelers in Fortaleza can use a smart card as a ticket to pay for their trips. Thus, every time a passenger uses their smart card, a validation is recorded. Fortaleza city hall stores, together with this information of the user’s validation, the time of departure and arrival of vehicles for each bus route. That is, every time a journey starts and ends, an employee of the bus company takes note of the time the journey began and finished. However, as this work is concerned with analyzing only the journey performed by the bus, from this data, only the information when the journeys begin and end will be utilized.

Because the vehicle id is present in both datasets, they can be cross referenced to generate another dataset (TRIP) containing the route made by each vehicle on the several bus routes. This process is done by obtaining the GPS points of a particular vehicle that has performed a journey within the range identified in the chronologically ordered VAL data. As a consequence, TRIP contains information of the variation of distance (Δs) and time (Δt) between pairs of consecutive GPS points of each journey. About 238.000 bus journeys were generated.

In order to remove noise, a filter was applied to the TRIP data. It was found that either there were no GPS points for a particular trip or the vehicles were traveling at certain moments at a very high speed or traveling for a very short distance or taking too much or too little time to travel. Therefore, the following criteria were adopted to exclude trips from the analysis: trips in (1) at some point, the vehicle traveled at speeds greater than 120 km/h; (2) trips that have a route of a distance of less than 5 km and lastly, (3) trips that required less than 30 minutes or more than 3 hours. After the filtration process, approximately 91 thousand trips remained to be analyzed.

Data from Dublin were obtained from Ireland’s open data page [17]. There is a file with GPS positions of the city bus from January 1st to January 31st, 2013. Altogether, there are about 37 million GPS points. The data contain the geographical positions (latitude and longitude), the date and time that such positions were recorded, the vehicle identifier, the bus route identifier and the trip identifier. Therefore, to reassemble the trips it was necessary to separate the GPS points by the id of the trip and to order them by the date and time, generating the dataset TRIP for Dublin. This TRIP dataset also contains the distance (Δs) and time (Δt) variations between each pair of consecutive GPS points, similar to what was done with the Fortaleza TRIP dataset. In order to remove the noises of the data the same 3 filters used in the Fortaleza data were applied. In the end, the travel archive was left with 65.000 records.

The Gini coefficient, used for income distribution, measures the distribution of wealth of a community, a value that varies between 0 and 1. In the calculation of the Gini, the fraction of the accumulated wealth of a given fraction of population is considered. From these values one can construct a graph with a characteristic curve, called the Lorenz curve. This curve represents the relative distribution of one variable in relation to another, which in this case is the distribution of wealth in the population. It is common to visualize, together with the Lorenz curve, the function \({y=x}\) (identity function or line of equality), which represents a situation of perfect equality in the distribution of income of the population, that is, everybody earns an equal amount. Therefore, the area between the Lorenz curve and the line of equality represents the Gini coefficient. Inspired by the concept of Gini index in economics [11], we applied it to the TRIP datasets in order to measure the level of heterogeneity in the traveling times of the vehicles. For this, the original path, shown in Fig. 1(a), is recovered, which is then divided into pieces having an associated Δs and Δt. However, it is necessary to analyze the distribution of times demanded in equal distance variations, therefore we divide every path into constant lengths \(\Delta s^{*}\). The time value \(\Delta t^{*}\) for each new piece is obtained by linear interpolation between each consecutive two values of the original data, as shown in Fig. 1(b). Once the values of \(\Delta t^{*}\) for each \(\Delta s^{*}\) of a trip are calculated, we performed a cumulative sum on the descending order values of \(\Delta t^{*}\). Each value of the cumulative sum is normalized by the total sum of \(\Delta t^{*}\) (total duration of the bus trip). The same is done for the values of \(\Delta s^{*}\). In Fig. 1(c), we show an example of the relation between the cumulative \(\Delta t^{*}\) and the cumulative \(\Delta s^{*}\). As already mentioned, this is the so-called Lorenz curve \(L(s)\), which indicates, in its original application in economics, what percentage of people hold a given percentage of a country’s wealth. In the context of transportation, it would indicate the percentage of the trajectory traveled as a function of the percentage of the time for the trip. This is the reason for using constant distance variations (\(\Delta s^{*}\)), since each value \(\Delta s^{*}\) would represent one person and each \(\Delta t^{*}\) associated with a \(\Delta s^{*}\) would represent the wealth of the person in question.

Figure 1
figure 1

Model and processing of bus travel data. (a) Conceptual representation of the trajectories of a trip performed by a bus. On the route, in relation to the original data, the variation of time (Δt) and distance (Δs) between each consecutive GPS point is illustrated. (b) The same path in (a) is shown, however, it is divided into constant lengths (\(\Delta s^{*}\)) and the new time values (\(\Delta t^{*}\)) between each constant length are calculated from a linear interpolation. (c) The Lorenz curve \(L(s)\) generated artificially to follows the Pareto Principle is illustrated. This curve is constructed by performing the cumulative sum of the constant distances (\(\Delta s^{*}\)) all divided by total length on the x-axis. On the y-axis we have the cumulative sum of the descending order time values (\(\Delta t^{*}\)) normalized by the total time. The function \(y=x\) (dashed line) is the equality line \(E(s)\) and two times the area defined between the Lorenz curve and the equality line is numerically equal to the Gini coefficient. The Gini value accounts for the heterogeneity of travel times distribution. The continuous vertical line points to the 80/20 ratio of the Pareto Principle

Also in Fig. 1(c), we can see a dashed line which represents the line of equality \(E(s)\). If a vehicle travels the whole trajectory at a constant speed, for example, the Lorenz curve of that trip will coincide with the equality line, indicating the maximum equality of the values of \(\Delta t^{*}\). The other extreme is the case where all the time demanded in the whole course was spent on only one piece of \(\Delta s^{*}\). In the analogy of the initial study of income distribution of people in a country, it is as if the whole wealth of a country belonged to only one person. The value of the Gini coefficient G is then calculated by measuring the area between the Lorenz curve and the equality line, that is

$$ G = 2 \int_{0}^{1} \bigl(L(s)-E(s)\bigr) \,ds, $$

with \(0 \le G \le 1\).

This measures whether a vehicle is taking too much time from the total trip time to travel the distance necessary to end the trip. We verified if the Pareto 80/20 proportion exists, which would indicate whether it is necessary to travel 20% of the total trip length in 80% of the trip time. If at least a similar behavior to this is found, a certain imbalance in the execution of the trip is pointed out. Thus, the Gini coefficient can be a good indicator of the quality of the transport journey.

Figure 2(a) shows a heat map of several trips made from 7:00 AM to 9:00 AM for Fortaleza. The variable analyzed in this map is \(\Delta t^{*}\) and the logarithmic scale is indicated inside the map. In Fig. 2(b) one trip is selected that belongs to the map in Fig. 2(a) and its time series is illustrated in the Fig. 2(c), representing the values of \(\Delta t^{*}\) for this particular trip.

Figure 2
figure 2

Map of time in logarithmic color scale. (a) Several morning bus trips are shown in Fortaleza. The trips were divided into constant spaces of \(\Delta s^{*} = 10\mbox{ m}\) and the colors plotted on the map are the values of \(\log_{10} \Delta t^{*}\), with their values indicated in the scale. (b) One trip from the several in (a) is selected and highlighted on the map (a). The start and end locations of this trip are also illustrated. The route of this trip is performed by the 75 bus route with \(\Delta s^{*}= 10\mbox{ m}\) and the colors shown in (b) correspond to the values of \(\Delta t^{*}\) in a logarithmic scale. (c) Time series of \(\Delta t^{*}\) of the interpolated path of the trip em (b). Each value of the series represents the amount of time demanded to go through each respective piece of 10 m

3 Results

The aim here is to use the Gini coefficient as an indicator to characterize the level of heterogeneity in urban transportation. For this, it is necessary to analyze the \(\Delta t^{*}\) values of each trip. Here we initially investigate the presence of the Zipf’s law in the distribution of \(\Delta t^{*}\). This law was originally applied to the field of linguistics [18] and states that the frequency of a word in a language is inversely proportional to its occurrence in a given text, such that the highest occurring word will be ranked 1, the second ranked 2, and so on. Generally, the frequency f of a word is given by: \(f=1/r^{s}\), where r is the rank of the word and s is an exponent that characterizes the distribution. In Fig. 3(a) we illustrated Zipf curves of various trips, a curve built from the rankings of each \(\Delta t^{*}\) value. The values of \(\Delta t^{*}\) are ordered from highest to lowest and assigned a rank for each one, in which the highest time has rank 1. If the axes are placed in a logarithmic scale it is noticed that a straight line is formed, demonstrating that \(\Delta t^{*}\) follows a Zipf distribution. In Fig. 3(b) the Lorenz curves of each trip indicated in Fig. 3(a) are shown. We can compare the 80/20 Pareto Principle ratio to the proportion indicated on our Lorenz curve. For example, the most heterogeneous trip has a Lorenz curve that uses about 70% of the time to travel 20% of the space. Such disproportion may reinforce the concept that the Gini clearly indicates an inefficiency in the trip execution.

Figure 3
figure 3

Zipf and Lorenz curve of some trips. (a) Zipf plots of several trips of different bus routes. The Zipf plot is constructed by sorting decreasingly the values of \(\Delta t^{*}\). The values of \(\Delta t^{*}\) are obtained dividing the course of each trip into constant spaces of \(\Delta s^{*}= 10\mbox{ m}\). Thus, in (a), the y-axis corresponds to the values \(\Delta t^{*}\) and the x-axis is the rank of each \(\Delta t^{*}\) both in logarithmic scale. It has ranking 1 the highest value \(\Delta t^{*}\), ranking 2 the second highest value \(\Delta t^{*}\) and so on. The Zipf plot shows a pervasive fat-tail characteristic for the interpolated time values \(\Delta t^{*}\), where the dashed line corresponds to the line with slope equals −0.5 in log–log space. (b) The Lorenz curves \(L(s)\) of the trips shown in (a), where each color represents the same trip. The equality line \(E(s)\) is shown as a dashed line and is used to calculate the Gini coefficient values for each trip, namely, being twice the area between \(E(s)\) and \(L(s)\). Each curve gives a Gini coefficient, one for each trip, representing the heterogeneity of the interpolated times. The continuous vertical line indicates where the 80/20 ratio of the Pareto Principle occurs

Another way to understand the usefulness of these methodologies in transport applications is to calculate the Gini coefficient of each bus trip in Fortaleza and Dublin. Figure 4(a) shows the behavior of the average Gini per time of the day. The mean (\(\mu_{\mathrm {gini}}\)) and the standard deviation (\(\sigma_{\mathrm {gini}}\)) of Ginis were computed for bins of same sizes. The dashed areas around the average represent \(\mu_{\mathrm {gini}} \pm \sigma_{\mathrm {gini}}\). As depicted, the two peaks present in Fig. 4(a) illustrate the rush-hour periods, one peak in the morning and the other in the evening, where there is a high amount of commuting of people/vehicles. Thus Fig. 4(a) ratifies the expectation that higher Gini values are concentrated in the rush hour, showing that the Gini captures this information.

Figure 4
figure 4

Gini coefficient by time of day and delay. (a) The variation of the Gini coefficient with time during the day. The continuous lines (blue for Fortaleza and red for Dublin) represent the average Gini (\(\mu_{\mathrm {gini}}\)) for bins equally spaced for all trips of the workdays of a week in Fortaleza and Dublin. The shaded area around the average represents \(\mu_{\mathrm {gini}} \pm \sigma_{\mathrm {gini}}\), in which \(\sigma_{\mathrm {gini}}\) is the standard deviation of the Gini for bins equally spaced. The plot corroborates the expectation that the largest values of Gini are in the rush hours. (b) The correlation between Gini coefficient and fraction of time delay for Fortaleza. Each point represents a trip and for each one a delay value D is calculated. The D measures the deviation of the expected value for the total time of that bus trip. The continuous line represents the Nadaraya–Watson non-parametric regression and the dashed line is the linear regression. The linear regression shows a relation of \(G = 0.38 + 0.24*D\) and the inset is the distribution of travel delay values. (c) The same correlation as (b) for Dublin. The linear regression shows a relation of \(G = 0.44 + 0.13*D\) and the inset is the distribution of travel delay values

Here we compare the Gini coefficient with the schedules programmed for each trip, in order to analyze if it correlates with the travel delay. For a vehicle \(V_{i}\) performing a specific route, we define the travel delay D as the deviation fraction of the average time to finish that bus route,

$$ D(V_{i})= \frac{T_{i}- \langle T \rangle }{ \langle T \rangle }, $$

where \(T_{i}\) is the total time of the trajectory of \(V_{i}\) and \(\langle T \rangle \) is the average travel time for all vehicles on the same bus route. In Fig. 4(b) the plot of the delay versus the Gini for Fortaleza shows a clear correlation between these two variables, suggesting the possibility that the Gini can encode information about schedule delays. Figure 4(c) shows the same information for Dublin.

As an illustrative case, we analyzed a specific bus route from Fortaleza (650I). In Fig. 5(a) we show the time series \(\Delta t^{*}\) in a logarithmic color scale for five different trips (see supplementary material for more details of other routes). We can observe that the largest values of \(\Delta t^{*}\) (yellow on the color scale) usually occur in the same places. As shown in Fig. 5(b), the distributions of \(\Delta t^{*}\) obtained from the time series in Fig. 5(a) for each trip can be well fitted by a power–law \(p(\Delta t^{*}) = a (\Delta t^{*})^{-\alpha }\). These distributions show that there are several different travel profiles even for the same bus route, each one with a different value of exponent α. It is known that the power–law exponent α is mathematically related to the Gini index [19], and \(\lim_{\alpha \to 2.0^{+}} G = 1.0\). This mathematical property lets us to plot in Fig. 5(c) the exponent α against the total travel time for the bus line 650I. We see that as the \(\alpha \to 2.0\) the total travel time increases, which corroborates the result of Fig. 4(b), where trips with large Gini values also have large time delays. This is a direct consequence of the average divergence for power–law distributions when \(\alpha =2.0\) [20]. In fact, this behavior is observed for all the trips considered in our data set, as can be seen in the compilation in Fig. 6(a) for Fortaleza and Fig. 6(b) for Dublin, where we plotted the time delay fraction against the α exponent. The Nadaraya–Watson non-parametric regression [21, 22] shows that as \(\alpha \to 2.0\) the time delay fraction increases. Therefore, from a public planning perspective, a good bus line needs to have a value of α much larger than 2.0.

Figure 5
figure 5

Characterization of the distribution times of bus route 650I. (a) The time series (\(\Delta t^{*}\)) for five different trips of the bus route 650I shown in a logarithmic color scale. The largest values of \(\Delta t^{*}\) are approximately concentrated in the same regions. (b) The distribution of the values \(\Delta t^{*}\) for the same five trips are illustrated in logarithmic scale. We performed a power–law fitting for each distribution. We see that each trip has a different value for the power–law exponent α. The dashed lines are a visual guide with slope −3.0 and −2.0. In (c) we have the relation between the α exponent and the total time that the bus used to finish the scheduled trajectory. Each point represents a trip of line 650I and the dashed line represents the average behavior obtained from the Nadaraya–Watson regression. The large values of time trips for low values of α corroborates the correlation between Gini and time delay D (Fig. 4(b)), since distributions with \(\alpha \to 2.0\) must have Gini close to 1.0

Figure 6
figure 6

Characterization of all bus trips. In (a) we show, for each small circle, the relation between the fraction delay in function of the α exponent for Fortaleza. As we see, when \(\alpha \to 2.0\) the average fraction delay, calculated by the Nadaraya–Watson (red dashed line), grows fast. In (b) the same is shown for Dublin. In (c) we show the probability distribution for the Gini for Fortaleza (in blue) and Dublin (in red). One can see that the PDFs appear Gaussians, despite a slight assimetry when compared to the statistical fit (in dashed lines). To confirm that hypothesis we apply the Kolmogorov–Smirnov test. It obtained p equals 0 for Fortaleza and Dublin. Therefore the hypothesis that the distribution is normal is rejected. The kurtosis and skewness [31] were also computed. The kurtosis for Fortaleza is 1.9493 and the skewness is 0.5485. For Dublin, the kurtosis is 1.450 and the skewness is −0.0282. These results indicates that, despite the fact that one visually appears the normal fit, the PDF of Fortaleza is more asymmetric than Dublin’s one

In Fig. 6(c) we show the distribution for Gini values of all the bus trips calculated from our data sets. The Kolmogorov–Smirnov test [23] was used to check whether or not the distributions of Gini index for Fortaleza and Dublin follow a normal distribution. Since the values of the p are approximately zero, then the hypothesis of the normality of the curves was rejected.

4 Discussion

Urban mobility and public transport systems have been extensively studied over the years [1416, 2428], however, even considering such vast literature, there is still no consensus on how to describe in the best way the efficiency of vehicle routes in large metropolises [29]. Here we propose that the level of heterogeneity of the distribution of time during the trajectory of a bus can be a measurement of the overall quality of the trip.

In order to quantify heterogeneity, we used the Gini coefficient, an index that has its origin in Economy and has been extensively used to quantify the inequality level of the income distribution of a social system. We show that the Gini coefficients are strongly correlated with peak usage of the mobility system, as well as the schedule delays in the system. More precisely, a large value of Gini is an indication of a bus line that has a more unpredictable time schedule. We also see that Dublin has a slightly higher Gini than Fortaleza, which can indicate that Fortaleza has a better urban traffic movement than Dublin and opens up the possibility to use the Gini to compare different cities.

The findings described in this article introduce alternatives for the implementation of innovative practices for decision makers within cities. Since we have shown that the time series follows a power–law distribution, this allows for the opportunity for a microintervention approach, in which we could change a small fraction of the bus trajectory in order to achieve a large improvement. Finally, the Gini can also be used to classify fuel consumption and pollution, since an increase of these factors is well known to be closely related to velocity variations [30].



Global Positioning System


validation dataset


trip dataset

Δs :

variation of distance

Δt :

variation of time

\(\Delta s^{*}\) :

constant distance

\(\Delta t^{*}\) :

variation of time for a \(\Delta s^{*}\)

\(L(S)\) :

Lorenz curve

\(E(s)\) :

line of equality


Gini coefficient

\(\mu_{\mathrm {gini}}\) :

mean of Ginis

\(\sigma_{\mathrm {gini}}\) :

standard deviation of Ginis

\(V_{i}\) :

vehicle id


travel delay

\(T_{i}\) :

total time of the trip of \(V_{i}\)

\(\langle T \rangle \) :

average travel time for all vehicles on the same bus route

α :

power–law exponent


probability density function


  1. Schwanen T (2013) Commute time in Brazil (1992–2009): differences between metropolitan areas, by income levels and gender. Texto para Discussão, vol 1813a. Instituto de Pesquisa Econômica Aplicada (IPEA)

    Google Scholar 

  2. Chatterjee K (2000) A human perspective on the daily commute: costs, benefits and tradeoffs. Phys A, Stat Mech Appl 281(1):69–77

    Google Scholar 

  3. Viswanathan GM, Da Luz MG, Raposo EP, Stanley HE (2011) The physics of foraging: an introduction to random searches and biological encounters. Cambridge University Press, Cambridge

    Book  Google Scholar 

  4. Metzler R, Klafter J (2000) The random walk’s guide to anomalous diffusion: a fractional dynamics approach. Phys Rep 339(1):1–77

    Article  MathSciNet  Google Scholar 

  5. Viswanathan GM, Buldyrev SV, Havlin S, Da Luz MGE, Raposo EP, Stanley HE (1999) Optimizing the success of random searches. Nature 401(6756):911

    Article  Google Scholar 

  6. Viswanathan GM, Afanasyev V, Buldyrev SV, Murphy EJ, Prince PA, Stanley HE (1996) Lévy flight search patterns of wandering albatrosses. Nature 381(6581):413

    Article  Google Scholar 

  7. Brockmann D, Hufnagel L, Geisel T (2006) The scaling laws of human travel. Nature 439(7075):462

    Article  Google Scholar 

  8. Gonzalez MC, Hidalgo CA, Barabasi A-L (2008) Understanding individual human mobility patterns. Nature 453(7196):779–782

    Article  Google Scholar 

  9. Pareto V (1964) Cours d’économie politique, vol 1. Librairie Droz

    Book  Google Scholar 

  10. Yakovenko VM (2001) Exponential and power–law probability distributions of wealth and income in the United Kingdom and the United States. Phys A, Stat Mech Appl 299(1):213–221

    MATH  Google Scholar 

  11. Gini C (1912) Variabilità e mutabilità. Reprinted in: Pizetti E, Salvemini T (eds) Memorie di metodologica statistica. Libreria Eredi Virgilio Veschi, Rome

    Google Scholar 

  12. Sullivan D, Caminha C, Melo H, Furtado V (2017) Towards understanding the impact of crime in a choice of a route by a bus passenger. Preprint. arXiv:1705.03506

  13. Furtado V, Furtado E, Caminha C, Lopes A, Dantas V, Ponte C, Cavalcante S (2017) A data-driven approach to help understanding the preferences of public transport users. In: Big Data (Big Data), 2017 IEEE international conference on, pp 1926–1935. IEEE

    Chapter  Google Scholar 

  14. Caminha C, Furtado V (2017) Impact of human mobility on police allocation. In: Intelligence and security informatics (ISI), 2017 IEEE international conference on, pp 125–127. IEEE

    Chapter  Google Scholar 

  15. Caminha C, Furtado V, Pequeno TH, Ponte C, Melo HP, Oliveira EA, Andrade Jr JS (2017) Human mobility in large cities as a proxy for crime. PLoS ONE 12(2):e0171609

    Article  Google Scholar 

  16. Caminha C, Furtado V, Pinheiro V, Silva C (2016) Micro-interventions in urban transportation from pattern discovery on the flow of passengers and on the bus network. In: Smart cities conference (ISC2), 2016 IEEE international, pp 1–6. IEEE

    Google Scholar 

  17. Dublin bus GPS sample data from Dublin City Council. Accessed 2016-10-06

  18. Zipf GK (2016) Human behavior and the principle of least effort: an introduction to human ecology. Ravenio Books

    Google Scholar 

  19. Clauset A, Shalizi CR, Newman ME (2009) Power–law distributions in empirical data. SIAM Rev 51(4):661–703

    Article  MathSciNet  Google Scholar 

  20. Sornette D (2006) Critical phenomena in natural sciences: chaos, fractals, selforganization and disorder: concepts and tools. Springer, Berlin

    MATH  Google Scholar 

  21. Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9(1):141–142

    Article  Google Scholar 

  22. Watson GS (1964) Smooth regression analysis. Sankhya, Ser A 26:359–372

    MathSciNet  MATH  Google Scholar 

  23. Lilliefors HW (1967) On the Kolmogorov–Smirnov test for normality with mean and variance unknown. J Am Stat Assoc 62(318):399–402

    Article  Google Scholar 

  24. Chang SK, Schonfeld PM (1991) Multiple period optimization of bus transit systems. Transp Res, Part B, Methodol 25(6):453–478

    Article  Google Scholar 

  25. Gordillo F (2006) The value of automated fare collection data for transit planning: an example of rail transit od matrix estimation. PhD thesis, Massachusetts Institute of Technology

  26. Gao Z, Wu J, Mao B, Huang H (2005) Study on the complexity of traffic networks and related problems. J Commun Transp Syst Eng Inf 2:014

    Google Scholar 

  27. Wang J, Mo H, Wang F, Jin F (2011) Exploring the network structure and nodal centrality of China’s air transport network: a complex network approach. J Transp Geogr 19(4):712–721

    Article  Google Scholar 

  28. Munizaga MA, Palma C (2012) Estimation of a disaggregate multimodal public transport origin–destination matrix from passive smartcard data from Santiago, Chile. Transp Res, Part C, Emerg Technol 24:9–18

    Article  Google Scholar 

  29. Bast H, Delling D, Goldberg A, Müller-Hannemann M, Pajor T, Sanders P, Wagner D, Werneck RF (2016) Route planning in transportation networks. In: Algorithm engineering. Springer, Berlin, pp 19–80

    Chapter  Google Scholar 

  30. Cappiello A (2002) Modeling traffic flow emissions. PhD thesis, Massachusetts Institute of Technology

  31. Mardia KV (1970) Measures of multivariate skewness and kurtosis with applications. Biometrika 57(3):519–530

    Article  MathSciNet  Google Scholar 

Download references


We thank the Brazilian agencies CNPq, CAPES, FUNCAP, and the National Institute of Science and Technology for Complex Systems (INCT-SC) in Brazil for financial support and Heitor Credidio for fruitful discussions.

Availability of data and materials

The data that support the findings of this study are available from the corresponding author upon reasonable request.


CP, HPMM, CC, JSA and VF have been funded by the Brazilian Agencies: Conselho Nacional de Desenvolvimento Científico e Tecnológico (, Coordenação de Aperfeiçoamento de Pessoal de Nível Superior ( and Fundação Cearense de Apoio ao Desenvolvimento Científico e Tecnológico ( JSA has also been funded by the National Institute of Science and Technology for Complex Systems ( The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations



All authors contributed equally. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Caio Ponte.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ponte, C., Melo, H.P.M., Caminha, C. et al. Traveling heterogeneity in public transportation. EPJ Data Sci. 7, 42 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: