- Regular article
- Open Access

# Traveling heterogeneity in public transportation

- Caio Ponte
^{1}Email author, - Hygor Piaget M. Melo
^{2}, - Carlos Caminha
^{1}, - José S. Andrade Jr.
^{3}and - Vasco Furtado
^{1}

**Received:**18 July 2018**Accepted:**8 October 2018**Published:**19 October 2018

## Abstract

It is well reported that long commutes have a large detrimental effect on people’s health and on the economy of cities. Interestingly, despite the strong impact on our daily lives, a simple way to measure the quality of urban transportation is still unknown. We performed data analysis on the transportation network of two large cities (Fortaleza and Dublin). By dividing each bus trajectory into equal pieces of space, we determine the distribution of time intervals for each trip, and we propose that the heterogeneity of the time distribution can be used to characterize the quality of that trip. Inspired by the use of the Gini coefficient to quantify the inequality level of income distribution, we used the Gini in order to characterize the heterogeneity level of the time distribution. We demonstrated that Gini coefficients are strongly correlated with peak usage of the mobility system, as well as the schedule delays in the system. Finally, our method can be used to find highly heterogeneous trips which have a large negative effect on the urban mobility and can help find new directions for new public planning strategies.

## Keywords

- Pareto principle
- Urban mobility
- Heterogeneity

## 1 Introduction

A long commute has become one of the major problems of modern urbanization. In the largest cities in Brazil the average commuting time is 41 minutes [1]. In these areas, the poorest travelers spend almost 20% more time commuting than the richest and 19% of the poorest commute for more than one hour, while for the richest it is only 11% [1]. This increased time has a large health and economic impact. Studies in the U.K. show that a 20 minute increase in a commute is as bad as a 19% pay cut in regard to job satisfaction [2]. While car users are the majority of the population, bus commuters feel the negative impacts of a longer commuting time more strongly than users of other modes of transport [2].

The study of mobility is an active area of research in the statistical physics community. Models first applied on transport problems in fluid dynamics were later used to understand animal and human mobility [3]. Random walks and reaction–diffusion models were first studied in fluid dynamics and dynamical systems [4]. However, certain special classes of random walks, like Lévy flights, where the lengths of steps are chosen from a power–law tail distribution, are proposed as the mechanism behind animal foraging and human mobility [5–7]. An initial study on human mobility used bank note dispersal as a proxy for human movement and verified that the distribution of distances between consecutive sightings of bank notes is fat tailed [7]. Recent studies using mobile phone calls show similar results [8].

This pervasive heterogeneity in mobility distribution is not a privilege of human and animal mobility. Fat tailed distributions are a well known characteristic of natural and economic systems. The Pareto Principle states that roughly 80% of the wealth is owned by 20% of the population [9]. This principle came about directly from the power–law tail distribution of incomes [10]. The level of income inequality or heterogeneity is measured by the Gini which is a measurement of statistical dispersion that can range from 0 (complete equality) to 1.0 (complete inequality) [11].

Here we used a dataset of bus trajectories in order to study the time distribution among them. We investigated if the Pareto Principle still holds for urban transportation mobility, namely, does the bus in order to complete its journey, use 80% of the total time with only 20% of the space of its trajectory? We found that this proportion varies for different lines, going from more homogeneous bus routes with 35% of time for 20% of space, to more heterogeneous with 65% of time. We then calculate the Gini coefficient as a heterogeneity and show that it is correlated with schedule delays. We also see that the high heterogeneity of time is a consequence of a power–law distribution, with different exponents for each trip. Since there is a mathematical connection between the Gini and the power–law exponent, we show that when the exponent of a bus trip is close to −2.0 we also have a high probability of schedule delays. These highly heterogeneous trips have a large detrimental effect on urban mobility. However, this also gives us the opportunity for a micro–intervention approach, where we can change a small fraction of the bus routes to have a high impact on the final quality of transportation.

## 2 Methods and data

This research is based on data analysis from Fortaleza and Dublin. Fortaleza is a Brazilian metropolis in which the main form of public transportation is via bus, counting on about 350 bus routes spread throughout the city, which covers an area of 314 km^{2}. In order to adapt the Gini coefficient for a transport application, two datasets were used regarding the Fortaleza bus system, namely, the GPS positions of the buses and the passenger validation records (VAL). The data were obtained through the Fortaleza’s city hall and refer to the period between the 12th and the 17th of April 2016, from Tuesday to Sunday. These data have been noteworthy used in several studies in the last few years [12–16]. The first is the largest dataset of the two and consists of about 21M GPS points. It is possible to recover the route of the vehicles throughout the day, because the location of the bus is recorded over a period of time approximately every thirty seconds. In addition to the georeferenced location, this data has the time at which the GPS point was recorded and which vehicle is located at that particular point.

The second dataset is the VAL. The bus travelers in Fortaleza can use a smart card as a ticket to pay for their trips. Thus, every time a passenger uses their smart card, a validation is recorded. Fortaleza city hall stores, together with this information of the user’s validation, the time of departure and arrival of vehicles for each bus route. That is, every time a journey starts and ends, an employee of the bus company takes note of the time the journey began and finished. However, as this work is concerned with analyzing only the journey performed by the bus, from this data, only the information when the journeys begin and end will be utilized.

Because the vehicle *id* is present in both datasets, they can be cross referenced to generate another dataset (TRIP) containing the route made by each vehicle on the several bus routes. This process is done by obtaining the GPS points of a particular vehicle that has performed a journey within the range identified in the chronologically ordered VAL data. As a consequence, TRIP contains information of the variation of distance (Δ*s*) and time (Δ*t*) between pairs of consecutive GPS points of each journey. About 238.000 bus journeys were generated.

In order to remove noise, a filter was applied to the TRIP data. It was found that either there were no GPS points for a particular trip or the vehicles were traveling at certain moments at a very high speed or traveling for a very short distance or taking too much or too little time to travel. Therefore, the following criteria were adopted to exclude trips from the analysis: trips in (1) at some point, the vehicle traveled at speeds greater than 120 km/h; (2) trips that have a route of a distance of less than 5 km and lastly, (3) trips that required less than 30 minutes or more than 3 hours. After the filtration process, approximately 91 thousand trips remained to be analyzed.

Data from Dublin were obtained from Ireland’s open data page [17]. There is a file with GPS positions of the city bus from January 1st to January 31st, 2013. Altogether, there are about 37 million GPS points. The data contain the geographical positions (latitude and longitude), the date and time that such positions were recorded, the vehicle identifier, the bus route identifier and the trip identifier. Therefore, to reassemble the trips it was necessary to separate the GPS points by the *id* of the trip and to order them by the date and time, generating the dataset TRIP for Dublin. This TRIP dataset also contains the distance (Δ*s*) and time (Δ*t*) variations between each pair of consecutive GPS points, similar to what was done with the Fortaleza TRIP dataset. In order to remove the noises of the data the same 3 filters used in the Fortaleza data were applied. In the end, the travel archive was left with 65.000 records.

*s*and Δ

*t*. However, it is necessary to analyze the distribution of times demanded in equal distance variations, therefore we divide every path into constant lengths \(\Delta s^{*}\). The time value \(\Delta t^{*}\) for each new piece is obtained by linear interpolation between each consecutive two values of the original data, as shown in Fig. 1(b). Once the values of \(\Delta t^{*}\) for each \(\Delta s^{*}\) of a trip are calculated, we performed a cumulative sum on the descending order values of \(\Delta t^{*}\). Each value of the cumulative sum is normalized by the total sum of \(\Delta t^{*}\) (total duration of the bus trip). The same is done for the values of \(\Delta s^{*}\). In Fig. 1(c), we show an example of the relation between the cumulative \(\Delta t^{*}\) and the cumulative \(\Delta s^{*}\). As already mentioned, this is the so-called

*Lorenz curve*\(L(s)\), which indicates, in its original application in economics, what percentage of people hold a given percentage of a country’s wealth. In the context of transportation, it would indicate the percentage of the trajectory traveled as a function of the percentage of the time for the trip. This is the reason for using constant distance variations (\(\Delta s^{*}\)), since each value \(\Delta s^{*}\) would represent one person and each \(\Delta t^{*}\) associated with a \(\Delta s^{*}\) would represent the wealth of the person in question.

*G*is then calculated by measuring the area between the Lorenz curve and the equality line, that is

This measures whether a vehicle is taking too much time from the total trip time to travel the distance necessary to end the trip. We verified if the Pareto 80/20 proportion exists, which would indicate whether it is necessary to travel 20% of the total trip length in 80% of the trip time. If at least a similar behavior to this is found, a certain imbalance in the execution of the trip is pointed out. Thus, the Gini coefficient can be a good indicator of the quality of the transport journey.

## 3 Results

*Zipf’s law*in the distribution of \(\Delta t^{*}\). This law was originally applied to the field of linguistics [18] and states that the frequency of a word in a language is inversely proportional to its occurrence in a given text, such that the highest occurring word will be ranked 1, the second ranked 2, and so on. Generally, the frequency

*f*of a word is given by: \(f=1/r^{s}\), where

*r*is the rank of the word and

*s*is an exponent that characterizes the distribution. In Fig. 3(a) we illustrated Zipf curves of various trips, a curve built from the rankings of each \(\Delta t^{*}\) value. The values of \(\Delta t^{*}\) are ordered from highest to lowest and assigned a rank for each one, in which the highest time has rank 1. If the axes are placed in a logarithmic scale it is noticed that a straight line is formed, demonstrating that \(\Delta t^{*}\) follows a Zipf distribution. In Fig. 3(b) the Lorenz curves of each trip indicated in Fig. 3(a) are shown. We can compare the 80/20 Pareto Principle ratio to the proportion indicated on our Lorenz curve. For example, the most heterogeneous trip has a Lorenz curve that uses about 70% of the time to travel 20% of the space. Such disproportion may reinforce the concept that the Gini clearly indicates an inefficiency in the trip execution.

*D*as the deviation fraction of the average time to finish that bus route,

*α*. It is known that the power–law exponent

*α*is mathematically related to the Gini index [19], and \(\lim_{\alpha \to 2.0^{+}} G = 1.0\). This mathematical property lets us to plot in Fig. 5(c) the exponent

*α*against the total travel time for the bus line

*650I*. We see that as the \(\alpha \to 2.0\) the total travel time increases, which corroborates the result of Fig. 4(b), where trips with large Gini values also have large time delays. This is a direct consequence of the average divergence for power–law distributions when \(\alpha =2.0\) [20]. In fact, this behavior is observed for all the trips considered in our data set, as can be seen in the compilation in Fig. 6(a) for Fortaleza and Fig. 6(b) for Dublin, where we plotted the time delay fraction against the

*α*exponent. The Nadaraya–Watson non-parametric regression [21, 22] shows that as \(\alpha \to 2.0\) the time delay fraction increases. Therefore, from a public planning perspective, a good bus line needs to have a value of

*α*much larger than 2.0.

In Fig. 6(c) we show the distribution for Gini values of all the bus trips calculated from our data sets. The Kolmogorov–Smirnov test [23] was used to check whether or not the distributions of Gini index for Fortaleza and Dublin follow a normal distribution. Since the values of the *p* are approximately zero, then the hypothesis of the normality of the curves was rejected.

## 4 Discussion

Urban mobility and public transport systems have been extensively studied over the years [14–16, 24–28], however, even considering such vast literature, there is still no consensus on how to describe in the best way the efficiency of vehicle routes in large metropolises [29]. Here we propose that the level of heterogeneity of the distribution of time during the trajectory of a bus can be a measurement of the overall quality of the trip.

In order to quantify heterogeneity, we used the Gini coefficient, an index that has its origin in Economy and has been extensively used to quantify the inequality level of the income distribution of a social system. We show that the Gini coefficients are strongly correlated with peak usage of the mobility system, as well as the schedule delays in the system. More precisely, a large value of Gini is an indication of a bus line that has a more unpredictable time schedule. We also see that Dublin has a slightly higher Gini than Fortaleza, which can indicate that Fortaleza has a better urban traffic movement than Dublin and opens up the possibility to use the Gini to compare different cities.

The findings described in this article introduce alternatives for the implementation of innovative practices for decision makers within cities. Since we have shown that the time series follows a power–law distribution, this allows for the opportunity for a microintervention approach, in which we could change a small fraction of the bus trajectory in order to achieve a large improvement. Finally, the Gini can also be used to classify fuel consumption and pollution, since an increase of these factors is well known to be closely related to velocity variations [30].

## Declarations

### Acknowledgements

We thank the Brazilian agencies CNPq, CAPES, FUNCAP, and the National Institute of Science and Technology for Complex Systems (INCT-SC) in Brazil for financial support and Heitor Credidio for fruitful discussions.

### Availability of data and materials

The data that support the findings of this study are available from the corresponding author upon reasonable request.

### Funding

CP, HPMM, CC, JSA and VF have been funded by the Brazilian Agencies: Conselho Nacional de Desenvolvimento Científico e Tecnológico (www.cnpq.br), Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (www.capes.gov.br) and Fundação Cearense de Apoio ao Desenvolvimento Científico e Tecnológico (www.funcap.ce.gov.br). JSA has also been funded by the National Institute of Science and Technology for Complex Systems (www.cbpf.br/inct-sc). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

### Authors’ contributions

All authors contributed equally. All authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Schwanen T (2013) Commute time in Brazil (1992–2009): differences between metropolitan areas, by income levels and gender. Texto para Discussão, vol 1813a. Instituto de Pesquisa Econômica Aplicada (IPEA) Google Scholar
- Chatterjee K (2000) A human perspective on the daily commute: costs, benefits and tradeoffs. Phys A, Stat Mech Appl 281(1):69–77 Google Scholar
- Viswanathan GM, Da Luz MG, Raposo EP, Stanley HE (2011) The physics of foraging: an introduction to random searches and biological encounters. Cambridge University Press, Cambridge View ArticleGoogle Scholar
- Metzler R, Klafter J (2000) The random walk’s guide to anomalous diffusion: a fractional dynamics approach. Phys Rep 339(1):1–77 MathSciNetView ArticleGoogle Scholar
- Viswanathan GM, Buldyrev SV, Havlin S, Da Luz MGE, Raposo EP, Stanley HE (1999) Optimizing the success of random searches. Nature 401(6756):911 View ArticleGoogle Scholar
- Viswanathan GM, Afanasyev V, Buldyrev SV, Murphy EJ, Prince PA, Stanley HE (1996) Lévy flight search patterns of wandering albatrosses. Nature 381(6581):413 View ArticleGoogle Scholar
- Brockmann D, Hufnagel L, Geisel T (2006) The scaling laws of human travel. Nature 439(7075):462 View ArticleGoogle Scholar
- Gonzalez MC, Hidalgo CA, Barabasi A-L (2008) Understanding individual human mobility patterns. Nature 453(7196):779–782 View ArticleGoogle Scholar
- Pareto V (1964) Cours d’économie politique, vol 1. Librairie Droz View ArticleGoogle Scholar
- Yakovenko VM (2001) Exponential and power–law probability distributions of wealth and income in the United Kingdom and the United States. Phys A, Stat Mech Appl 299(1):213–221 MATHGoogle Scholar
- Gini C (1912) Variabilità e mutabilità. Reprinted in: Pizetti E, Salvemini T (eds) Memorie di metodologica statistica. Libreria Eredi Virgilio Veschi, Rome Google Scholar
- Sullivan D, Caminha C, Melo H, Furtado V (2017) Towards understanding the impact of crime in a choice of a route by a bus passenger. Preprint. arXiv:1705.03506
- Furtado V, Furtado E, Caminha C, Lopes A, Dantas V, Ponte C, Cavalcante S (2017) A data-driven approach to help understanding the preferences of public transport users. In: Big Data (Big Data), 2017 IEEE international conference on, pp 1926–1935. IEEE View ArticleGoogle Scholar
- Caminha C, Furtado V (2017) Impact of human mobility on police allocation. In: Intelligence and security informatics (ISI), 2017 IEEE international conference on, pp 125–127. IEEE View ArticleGoogle Scholar
- Caminha C, Furtado V, Pequeno TH, Ponte C, Melo HP, Oliveira EA, Andrade Jr JS (2017) Human mobility in large cities as a proxy for crime. PLoS ONE 12(2):e0171609 View ArticleGoogle Scholar
- Caminha C, Furtado V, Pinheiro V, Silva C (2016) Micro-interventions in urban transportation from pattern discovery on the flow of passengers and on the bus network. In: Smart cities conference (ISC2), 2016 IEEE international, pp 1–6. IEEE Google Scholar
- Dublin bus GPS sample data from Dublin City Council. https://data.gov.ie/dataset/dublin-bus-gps-sample-data-from-dublin-city-council-insight-project. Accessed 2016-10-06
- Zipf GK (2016) Human behavior and the principle of least effort: an introduction to human ecology. Ravenio Books Google Scholar
- Clauset A, Shalizi CR, Newman ME (2009) Power–law distributions in empirical data. SIAM Rev 51(4):661–703 MathSciNetView ArticleGoogle Scholar
- Sornette D (2006) Critical phenomena in natural sciences: chaos, fractals, selforganization and disorder: concepts and tools. Springer, Berlin MATHGoogle Scholar
- Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9(1):141–142 View ArticleGoogle Scholar
- Watson GS (1964) Smooth regression analysis. Sankhya, Ser A 26:359–372 MathSciNetMATHGoogle Scholar
- Lilliefors HW (1967) On the Kolmogorov–Smirnov test for normality with mean and variance unknown. J Am Stat Assoc 62(318):399–402 View ArticleGoogle Scholar
- Chang SK, Schonfeld PM (1991) Multiple period optimization of bus transit systems. Transp Res, Part B, Methodol 25(6):453–478 View ArticleGoogle Scholar
- Gordillo F (2006) The value of automated fare collection data for transit planning: an example of rail transit od matrix estimation. PhD thesis, Massachusetts Institute of Technology Google Scholar
- Gao Z, Wu J, Mao B, Huang H (2005) Study on the complexity of traffic networks and related problems. J Commun Transp Syst Eng Inf 2:014 Google Scholar
- Wang J, Mo H, Wang F, Jin F (2011) Exploring the network structure and nodal centrality of China’s air transport network: a complex network approach. J Transp Geogr 19(4):712–721 View ArticleGoogle Scholar
- Munizaga MA, Palma C (2012) Estimation of a disaggregate multimodal public transport origin–destination matrix from passive smartcard data from Santiago, Chile. Transp Res, Part C, Emerg Technol 24:9–18 View ArticleGoogle Scholar
- Bast H, Delling D, Goldberg A, Müller-Hannemann M, Pajor T, Sanders P, Wagner D, Werneck RF (2016) Route planning in transportation networks. In: Algorithm engineering. Springer, Berlin, pp 19–80 View ArticleGoogle Scholar
- Cappiello A (2002) Modeling traffic flow emissions. PhD thesis, Massachusetts Institute of Technology Google Scholar
- Mardia KV (1970) Measures of multivariate skewness and kurtosis with applications. Biometrika 57(3):519–530 MathSciNetView ArticleGoogle Scholar