 Regular article
 Open Access
 Published:
Traveling heterogeneity in public transportation
EPJ Data Science volume 7, Article number: 42 (2018)
Abstract
It is well reported that long commutes have a large detrimental effect on people’s health and on the economy of cities. Interestingly, despite the strong impact on our daily lives, a simple way to measure the quality of urban transportation is still unknown. We performed data analysis on the transportation network of two large cities (Fortaleza and Dublin). By dividing each bus trajectory into equal pieces of space, we determine the distribution of time intervals for each trip, and we propose that the heterogeneity of the time distribution can be used to characterize the quality of that trip. Inspired by the use of the Gini coefficient to quantify the inequality level of income distribution, we used the Gini in order to characterize the heterogeneity level of the time distribution. We demonstrated that Gini coefficients are strongly correlated with peak usage of the mobility system, as well as the schedule delays in the system. Finally, our method can be used to find highly heterogeneous trips which have a large negative effect on the urban mobility and can help find new directions for new public planning strategies.
Introduction
A long commute has become one of the major problems of modern urbanization. In the largest cities in Brazil the average commuting time is 41 minutes [1]. In these areas, the poorest travelers spend almost 20% more time commuting than the richest and 19% of the poorest commute for more than one hour, while for the richest it is only 11% [1]. This increased time has a large health and economic impact. Studies in the U.K. show that a 20 minute increase in a commute is as bad as a 19% pay cut in regard to job satisfaction [2]. While car users are the majority of the population, bus commuters feel the negative impacts of a longer commuting time more strongly than users of other modes of transport [2].
The study of mobility is an active area of research in the statistical physics community. Models first applied on transport problems in fluid dynamics were later used to understand animal and human mobility [3]. Random walks and reaction–diffusion models were first studied in fluid dynamics and dynamical systems [4]. However, certain special classes of random walks, like Lévy flights, where the lengths of steps are chosen from a power–law tail distribution, are proposed as the mechanism behind animal foraging and human mobility [5–7]. An initial study on human mobility used bank note dispersal as a proxy for human movement and verified that the distribution of distances between consecutive sightings of bank notes is fat tailed [7]. Recent studies using mobile phone calls show similar results [8].
This pervasive heterogeneity in mobility distribution is not a privilege of human and animal mobility. Fat tailed distributions are a well known characteristic of natural and economic systems. The Pareto Principle states that roughly 80% of the wealth is owned by 20% of the population [9]. This principle came about directly from the power–law tail distribution of incomes [10]. The level of income inequality or heterogeneity is measured by the Gini which is a measurement of statistical dispersion that can range from 0 (complete equality) to 1.0 (complete inequality) [11].
Here we used a dataset of bus trajectories in order to study the time distribution among them. We investigated if the Pareto Principle still holds for urban transportation mobility, namely, does the bus in order to complete its journey, use 80% of the total time with only 20% of the space of its trajectory? We found that this proportion varies for different lines, going from more homogeneous bus routes with 35% of time for 20% of space, to more heterogeneous with 65% of time. We then calculate the Gini coefficient as a heterogeneity and show that it is correlated with schedule delays. We also see that the high heterogeneity of time is a consequence of a power–law distribution, with different exponents for each trip. Since there is a mathematical connection between the Gini and the power–law exponent, we show that when the exponent of a bus trip is close to −2.0 we also have a high probability of schedule delays. These highly heterogeneous trips have a large detrimental effect on urban mobility. However, this also gives us the opportunity for a micro–intervention approach, where we can change a small fraction of the bus routes to have a high impact on the final quality of transportation.
Methods and data
This research is based on data analysis from Fortaleza and Dublin. Fortaleza is a Brazilian metropolis in which the main form of public transportation is via bus, counting on about 350 bus routes spread throughout the city, which covers an area of 314 km^{2}. In order to adapt the Gini coefficient for a transport application, two datasets were used regarding the Fortaleza bus system, namely, the GPS positions of the buses and the passenger validation records (VAL). The data were obtained through the Fortaleza’s city hall and refer to the period between the 12th and the 17th of April 2016, from Tuesday to Sunday. These data have been noteworthy used in several studies in the last few years [12–16]. The first is the largest dataset of the two and consists of about 21M GPS points. It is possible to recover the route of the vehicles throughout the day, because the location of the bus is recorded over a period of time approximately every thirty seconds. In addition to the georeferenced location, this data has the time at which the GPS point was recorded and which vehicle is located at that particular point.
The second dataset is the VAL. The bus travelers in Fortaleza can use a smart card as a ticket to pay for their trips. Thus, every time a passenger uses their smart card, a validation is recorded. Fortaleza city hall stores, together with this information of the user’s validation, the time of departure and arrival of vehicles for each bus route. That is, every time a journey starts and ends, an employee of the bus company takes note of the time the journey began and finished. However, as this work is concerned with analyzing only the journey performed by the bus, from this data, only the information when the journeys begin and end will be utilized.
Because the vehicle id is present in both datasets, they can be cross referenced to generate another dataset (TRIP) containing the route made by each vehicle on the several bus routes. This process is done by obtaining the GPS points of a particular vehicle that has performed a journey within the range identified in the chronologically ordered VAL data. As a consequence, TRIP contains information of the variation of distance (Δs) and time (Δt) between pairs of consecutive GPS points of each journey. About 238.000 bus journeys were generated.
In order to remove noise, a filter was applied to the TRIP data. It was found that either there were no GPS points for a particular trip or the vehicles were traveling at certain moments at a very high speed or traveling for a very short distance or taking too much or too little time to travel. Therefore, the following criteria were adopted to exclude trips from the analysis: trips in (1) at some point, the vehicle traveled at speeds greater than 120 km/h; (2) trips that have a route of a distance of less than 5 km and lastly, (3) trips that required less than 30 minutes or more than 3 hours. After the filtration process, approximately 91 thousand trips remained to be analyzed.
Data from Dublin were obtained from Ireland’s open data page [17]. There is a file with GPS positions of the city bus from January 1st to January 31st, 2013. Altogether, there are about 37 million GPS points. The data contain the geographical positions (latitude and longitude), the date and time that such positions were recorded, the vehicle identifier, the bus route identifier and the trip identifier. Therefore, to reassemble the trips it was necessary to separate the GPS points by the id of the trip and to order them by the date and time, generating the dataset TRIP for Dublin. This TRIP dataset also contains the distance (Δs) and time (Δt) variations between each pair of consecutive GPS points, similar to what was done with the Fortaleza TRIP dataset. In order to remove the noises of the data the same 3 filters used in the Fortaleza data were applied. In the end, the travel archive was left with 65.000 records.
The Gini coefficient, used for income distribution, measures the distribution of wealth of a community, a value that varies between 0 and 1. In the calculation of the Gini, the fraction of the accumulated wealth of a given fraction of population is considered. From these values one can construct a graph with a characteristic curve, called the Lorenz curve. This curve represents the relative distribution of one variable in relation to another, which in this case is the distribution of wealth in the population. It is common to visualize, together with the Lorenz curve, the function \({y=x}\) (identity function or line of equality), which represents a situation of perfect equality in the distribution of income of the population, that is, everybody earns an equal amount. Therefore, the area between the Lorenz curve and the line of equality represents the Gini coefficient. Inspired by the concept of Gini index in economics [11], we applied it to the TRIP datasets in order to measure the level of heterogeneity in the traveling times of the vehicles. For this, the original path, shown in Fig. 1(a), is recovered, which is then divided into pieces having an associated Δs and Δt. However, it is necessary to analyze the distribution of times demanded in equal distance variations, therefore we divide every path into constant lengths \(\Delta s^{*}\). The time value \(\Delta t^{*}\) for each new piece is obtained by linear interpolation between each consecutive two values of the original data, as shown in Fig. 1(b). Once the values of \(\Delta t^{*}\) for each \(\Delta s^{*}\) of a trip are calculated, we performed a cumulative sum on the descending order values of \(\Delta t^{*}\). Each value of the cumulative sum is normalized by the total sum of \(\Delta t^{*}\) (total duration of the bus trip). The same is done for the values of \(\Delta s^{*}\). In Fig. 1(c), we show an example of the relation between the cumulative \(\Delta t^{*}\) and the cumulative \(\Delta s^{*}\). As already mentioned, this is the socalled Lorenz curve \(L(s)\), which indicates, in its original application in economics, what percentage of people hold a given percentage of a country’s wealth. In the context of transportation, it would indicate the percentage of the trajectory traveled as a function of the percentage of the time for the trip. This is the reason for using constant distance variations (\(\Delta s^{*}\)), since each value \(\Delta s^{*}\) would represent one person and each \(\Delta t^{*}\) associated with a \(\Delta s^{*}\) would represent the wealth of the person in question.
Also in Fig. 1(c), we can see a dashed line which represents the line of equality \(E(s)\). If a vehicle travels the whole trajectory at a constant speed, for example, the Lorenz curve of that trip will coincide with the equality line, indicating the maximum equality of the values of \(\Delta t^{*}\). The other extreme is the case where all the time demanded in the whole course was spent on only one piece of \(\Delta s^{*}\). In the analogy of the initial study of income distribution of people in a country, it is as if the whole wealth of a country belonged to only one person. The value of the Gini coefficient G is then calculated by measuring the area between the Lorenz curve and the equality line, that is
with \(0 \le G \le 1\).
This measures whether a vehicle is taking too much time from the total trip time to travel the distance necessary to end the trip. We verified if the Pareto 80/20 proportion exists, which would indicate whether it is necessary to travel 20% of the total trip length in 80% of the trip time. If at least a similar behavior to this is found, a certain imbalance in the execution of the trip is pointed out. Thus, the Gini coefficient can be a good indicator of the quality of the transport journey.
Figure 2(a) shows a heat map of several trips made from 7:00 AM to 9:00 AM for Fortaleza. The variable analyzed in this map is \(\Delta t^{*}\) and the logarithmic scale is indicated inside the map. In Fig. 2(b) one trip is selected that belongs to the map in Fig. 2(a) and its time series is illustrated in the Fig. 2(c), representing the values of \(\Delta t^{*}\) for this particular trip.
Results
The aim here is to use the Gini coefficient as an indicator to characterize the level of heterogeneity in urban transportation. For this, it is necessary to analyze the \(\Delta t^{*}\) values of each trip. Here we initially investigate the presence of the Zipf’s law in the distribution of \(\Delta t^{*}\). This law was originally applied to the field of linguistics [18] and states that the frequency of a word in a language is inversely proportional to its occurrence in a given text, such that the highest occurring word will be ranked 1, the second ranked 2, and so on. Generally, the frequency f of a word is given by: \(f=1/r^{s}\), where r is the rank of the word and s is an exponent that characterizes the distribution. In Fig. 3(a) we illustrated Zipf curves of various trips, a curve built from the rankings of each \(\Delta t^{*}\) value. The values of \(\Delta t^{*}\) are ordered from highest to lowest and assigned a rank for each one, in which the highest time has rank 1. If the axes are placed in a logarithmic scale it is noticed that a straight line is formed, demonstrating that \(\Delta t^{*}\) follows a Zipf distribution. In Fig. 3(b) the Lorenz curves of each trip indicated in Fig. 3(a) are shown. We can compare the 80/20 Pareto Principle ratio to the proportion indicated on our Lorenz curve. For example, the most heterogeneous trip has a Lorenz curve that uses about 70% of the time to travel 20% of the space. Such disproportion may reinforce the concept that the Gini clearly indicates an inefficiency in the trip execution.
Another way to understand the usefulness of these methodologies in transport applications is to calculate the Gini coefficient of each bus trip in Fortaleza and Dublin. Figure 4(a) shows the behavior of the average Gini per time of the day. The mean (\(\mu_{\mathrm {gini}}\)) and the standard deviation (\(\sigma_{\mathrm {gini}}\)) of Ginis were computed for bins of same sizes. The dashed areas around the average represent \(\mu_{\mathrm {gini}} \pm \sigma_{\mathrm {gini}}\). As depicted, the two peaks present in Fig. 4(a) illustrate the rushhour periods, one peak in the morning and the other in the evening, where there is a high amount of commuting of people/vehicles. Thus Fig. 4(a) ratifies the expectation that higher Gini values are concentrated in the rush hour, showing that the Gini captures this information.
Here we compare the Gini coefficient with the schedules programmed for each trip, in order to analyze if it correlates with the travel delay. For a vehicle \(V_{i}\) performing a specific route, we define the travel delay D as the deviation fraction of the average time to finish that bus route,
where \(T_{i}\) is the total time of the trajectory of \(V_{i}\) and \(\langle T \rangle \) is the average travel time for all vehicles on the same bus route. In Fig. 4(b) the plot of the delay versus the Gini for Fortaleza shows a clear correlation between these two variables, suggesting the possibility that the Gini can encode information about schedule delays. Figure 4(c) shows the same information for Dublin.
As an illustrative case, we analyzed a specific bus route from Fortaleza (650I). In Fig. 5(a) we show the time series \(\Delta t^{*}\) in a logarithmic color scale for five different trips (see supplementary material for more details of other routes). We can observe that the largest values of \(\Delta t^{*}\) (yellow on the color scale) usually occur in the same places. As shown in Fig. 5(b), the distributions of \(\Delta t^{*}\) obtained from the time series in Fig. 5(a) for each trip can be well fitted by a power–law \(p(\Delta t^{*}) = a (\Delta t^{*})^{\alpha }\). These distributions show that there are several different travel profiles even for the same bus route, each one with a different value of exponent α. It is known that the power–law exponent α is mathematically related to the Gini index [19], and \(\lim_{\alpha \to 2.0^{+}} G = 1.0\). This mathematical property lets us to plot in Fig. 5(c) the exponent α against the total travel time for the bus line 650I. We see that as the \(\alpha \to 2.0\) the total travel time increases, which corroborates the result of Fig. 4(b), where trips with large Gini values also have large time delays. This is a direct consequence of the average divergence for power–law distributions when \(\alpha =2.0\) [20]. In fact, this behavior is observed for all the trips considered in our data set, as can be seen in the compilation in Fig. 6(a) for Fortaleza and Fig. 6(b) for Dublin, where we plotted the time delay fraction against the α exponent. The Nadaraya–Watson nonparametric regression [21, 22] shows that as \(\alpha \to 2.0\) the time delay fraction increases. Therefore, from a public planning perspective, a good bus line needs to have a value of α much larger than 2.0.
In Fig. 6(c) we show the distribution for Gini values of all the bus trips calculated from our data sets. The Kolmogorov–Smirnov test [23] was used to check whether or not the distributions of Gini index for Fortaleza and Dublin follow a normal distribution. Since the values of the p are approximately zero, then the hypothesis of the normality of the curves was rejected.
Discussion
Urban mobility and public transport systems have been extensively studied over the years [14–16, 24–28], however, even considering such vast literature, there is still no consensus on how to describe in the best way the efficiency of vehicle routes in large metropolises [29]. Here we propose that the level of heterogeneity of the distribution of time during the trajectory of a bus can be a measurement of the overall quality of the trip.
In order to quantify heterogeneity, we used the Gini coefficient, an index that has its origin in Economy and has been extensively used to quantify the inequality level of the income distribution of a social system. We show that the Gini coefficients are strongly correlated with peak usage of the mobility system, as well as the schedule delays in the system. More precisely, a large value of Gini is an indication of a bus line that has a more unpredictable time schedule. We also see that Dublin has a slightly higher Gini than Fortaleza, which can indicate that Fortaleza has a better urban traffic movement than Dublin and opens up the possibility to use the Gini to compare different cities.
The findings described in this article introduce alternatives for the implementation of innovative practices for decision makers within cities. Since we have shown that the time series follows a power–law distribution, this allows for the opportunity for a microintervention approach, in which we could change a small fraction of the bus trajectory in order to achieve a large improvement. Finally, the Gini can also be used to classify fuel consumption and pollution, since an increase of these factors is well known to be closely related to velocity variations [30].
Abbreviations
 GPS:

Global Positioning System
 VAL:

validation dataset
 TRIP:

trip dataset
 Δs :

variation of distance
 Δt :

variation of time
 \(\Delta s^{*}\) :

constant distance
 \(\Delta t^{*}\) :

variation of time for a \(\Delta s^{*}\)
 \(L(S)\) :

Lorenz curve
 \(E(s)\) :

line of equality
 G:

Gini coefficient
 \(\mu_{\mathrm {gini}}\) :

mean of Ginis
 \(\sigma_{\mathrm {gini}}\) :

standard deviation of Ginis
 \(V_{i}\) :

vehicle id
 D:

travel delay
 \(T_{i}\) :

total time of the trip of \(V_{i}\)
 \(\langle T \rangle \) :

average travel time for all vehicles on the same bus route
 α :

power–law exponent
 PDF:

probability density function
References
 1.
Schwanen T (2013) Commute time in Brazil (1992–2009): differences between metropolitan areas, by income levels and gender. Texto para Discussão, vol 1813a. Instituto de Pesquisa Econômica Aplicada (IPEA)
 2.
Chatterjee K (2000) A human perspective on the daily commute: costs, benefits and tradeoffs. Phys A, Stat Mech Appl 281(1):69–77
 3.
Viswanathan GM, Da Luz MG, Raposo EP, Stanley HE (2011) The physics of foraging: an introduction to random searches and biological encounters. Cambridge University Press, Cambridge
 4.
Metzler R, Klafter J (2000) The random walk’s guide to anomalous diffusion: a fractional dynamics approach. Phys Rep 339(1):1–77
 5.
Viswanathan GM, Buldyrev SV, Havlin S, Da Luz MGE, Raposo EP, Stanley HE (1999) Optimizing the success of random searches. Nature 401(6756):911
 6.
Viswanathan GM, Afanasyev V, Buldyrev SV, Murphy EJ, Prince PA, Stanley HE (1996) Lévy flight search patterns of wandering albatrosses. Nature 381(6581):413
 7.
Brockmann D, Hufnagel L, Geisel T (2006) The scaling laws of human travel. Nature 439(7075):462
 8.
Gonzalez MC, Hidalgo CA, Barabasi AL (2008) Understanding individual human mobility patterns. Nature 453(7196):779–782
 9.
Pareto V (1964) Cours d’économie politique, vol 1. Librairie Droz
 10.
Yakovenko VM (2001) Exponential and power–law probability distributions of wealth and income in the United Kingdom and the United States. Phys A, Stat Mech Appl 299(1):213–221
 11.
Gini C (1912) Variabilità e mutabilità. Reprinted in: Pizetti E, Salvemini T (eds) Memorie di metodologica statistica. Libreria Eredi Virgilio Veschi, Rome
 12.
Sullivan D, Caminha C, Melo H, Furtado V (2017) Towards understanding the impact of crime in a choice of a route by a bus passenger. Preprint. arXiv:1705.03506
 13.
Furtado V, Furtado E, Caminha C, Lopes A, Dantas V, Ponte C, Cavalcante S (2017) A datadriven approach to help understanding the preferences of public transport users. In: Big Data (Big Data), 2017 IEEE international conference on, pp 1926–1935. IEEE
 14.
Caminha C, Furtado V (2017) Impact of human mobility on police allocation. In: Intelligence and security informatics (ISI), 2017 IEEE international conference on, pp 125–127. IEEE
 15.
Caminha C, Furtado V, Pequeno TH, Ponte C, Melo HP, Oliveira EA, Andrade Jr JS (2017) Human mobility in large cities as a proxy for crime. PLoS ONE 12(2):e0171609
 16.
Caminha C, Furtado V, Pinheiro V, Silva C (2016) Microinterventions in urban transportation from pattern discovery on the flow of passengers and on the bus network. In: Smart cities conference (ISC2), 2016 IEEE international, pp 1–6. IEEE
 17.
Dublin bus GPS sample data from Dublin City Council. https://data.gov.ie/dataset/dublinbusgpssampledatafromdublincitycouncilinsightproject. Accessed 20161006
 18.
Zipf GK (2016) Human behavior and the principle of least effort: an introduction to human ecology. Ravenio Books
 19.
Clauset A, Shalizi CR, Newman ME (2009) Power–law distributions in empirical data. SIAM Rev 51(4):661–703
 20.
Sornette D (2006) Critical phenomena in natural sciences: chaos, fractals, selforganization and disorder: concepts and tools. Springer, Berlin
 21.
Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9(1):141–142
 22.
Watson GS (1964) Smooth regression analysis. Sankhya, Ser A 26:359–372
 23.
Lilliefors HW (1967) On the Kolmogorov–Smirnov test for normality with mean and variance unknown. J Am Stat Assoc 62(318):399–402
 24.
Chang SK, Schonfeld PM (1991) Multiple period optimization of bus transit systems. Transp Res, Part B, Methodol 25(6):453–478
 25.
Gordillo F (2006) The value of automated fare collection data for transit planning: an example of rail transit od matrix estimation. PhD thesis, Massachusetts Institute of Technology
 26.
Gao Z, Wu J, Mao B, Huang H (2005) Study on the complexity of traffic networks and related problems. J Commun Transp Syst Eng Inf 2:014
 27.
Wang J, Mo H, Wang F, Jin F (2011) Exploring the network structure and nodal centrality of China’s air transport network: a complex network approach. J Transp Geogr 19(4):712–721
 28.
Munizaga MA, Palma C (2012) Estimation of a disaggregate multimodal public transport origin–destination matrix from passive smartcard data from Santiago, Chile. Transp Res, Part C, Emerg Technol 24:9–18
 29.
Bast H, Delling D, Goldberg A, MüllerHannemann M, Pajor T, Sanders P, Wagner D, Werneck RF (2016) Route planning in transportation networks. In: Algorithm engineering. Springer, Berlin, pp 19–80
 30.
Cappiello A (2002) Modeling traffic flow emissions. PhD thesis, Massachusetts Institute of Technology
 31.
Mardia KV (1970) Measures of multivariate skewness and kurtosis with applications. Biometrika 57(3):519–530
Acknowledgements
We thank the Brazilian agencies CNPq, CAPES, FUNCAP, and the National Institute of Science and Technology for Complex Systems (INCTSC) in Brazil for financial support and Heitor Credidio for fruitful discussions.
Availability of data and materials
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Funding
CP, HPMM, CC, JSA and VF have been funded by the Brazilian Agencies: Conselho Nacional de Desenvolvimento Científico e Tecnológico (www.cnpq.br), Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (www.capes.gov.br) and Fundação Cearense de Apoio ao Desenvolvimento Científico e Tecnológico (www.funcap.ce.gov.br). JSA has also been funded by the National Institute of Science and Technology for Complex Systems (www.cbpf.br/inctsc). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author information
Affiliations
Contributions
All authors contributed equally. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Caio Ponte.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Ponte, C., Melo, H.P.M., Caminha, C. et al. Traveling heterogeneity in public transportation. EPJ Data Sci. 7, 42 (2018) doi:10.1140/epjds/s1368801801726
Received
Accepted
Published
DOI
Keywords
 Pareto principle
 Urban mobility
 Heterogeneity