2.1 Data
In order to extract relevant patterns related with the emergence of train delays, we collected and analysed data about the daily operations of the Italian and German Railway Systems during the year 2015. Such data were collected in the first case through the ViaggiaTreno website [25] and for the second one through the OpenDataCity website [26]. The information we acquired allowed for a complete reconstruction of the schedules of the trains, the structure of the Railway Network and the delays that affected the trains during their movements. For more information about the data collection please refer to Additional files 1–3. The choice of the Italian and German Systems as subjects for our studies lies in their similarities in structure and management. In fact, their networks have remarkable and comparable sizes (41,315 and 16,723 km, respectively) and densities (8.22 and 12.46 km2 per km of tracks, respectively [27]). Moreover, these two countries share also a crucial characteristic: in both railway systems traffic is handled mainly by a single national company. In other countries with similar networks, like in the United Kingdom, the railway network is managed by different companies resulting in a more complex system where trains are additionally subject to the commercial policies of different operators. Figure 1 shows a set of topological analyses of the two railway networks, together with a preliminary analysis of the traffic load and delays in the systems. Following existing literature, we refer to a railway network as the network whose nodes represent stations that are connected by a link whenever there is a train connecting them with two consecutive stops in its schedule. The network is directed since it is possible that a connection between two stations is travelled just in one direction. In this paper we refer to the nodes in a railway network either as “nodes” or as “stations”. Moreover, the action performed by a train travelling from station A to station B will be referred as “travelling over the link” connecting A and B. In both networks the distribution of the degree k is peaked on \(k=4\) and for larger degrees has exponential-like decrease [19, 28] proportional to \(e^{-k/k_{0}}\) with \(k_{0} \simeq 4.5 \pm 0.1\). This finding is in agreement with other geographical networks [29]. The network assortativity (Fig. 1A) [30] is for Italy and Germany, 0.18 and 0.24 respectively. This indicates that while there is a slight preference for stations with the same degree to be connected each other, yet the various degrees are mostly mixing, i.e., smaller and larger stations are typically connected. The local clustering coefficient (Fig. 1D), defined as the fraction of pairs of neighbours of a given station that are connected over all pairs of neighbours of that station [31], can be used to infer the redundancy of the network. When a disruption occurs, a station with an high clustering coefficient can be bypassed easily. Complementarily, betweenness centrality (Fig. 1E) is a measure of the centrality of a node in a network based on the number of shortest paths that pass through it [31]. This measure highlights how strategic a given station is at a global level.
The clustering and the betweenness outline the typical small word topology of transportation networks [19]. German stations show a distribution between 10 and 100 trains per day, narrower that the train distribution of the Italian stations, which instead display a broader distribution, suggesting a more heterogeneous handling of train traffic load (Fig. 1F). Finally, Fig. 1G shows the histogram of the average delays of trains aggregated by stations for the two nations. Both distributions show a peak and heavy tail, whereas compared to the German distribution, the Italian distribution is clearly shifted towards higher delays. The topological similarities between the two networks suggest that the differences in delays and traffic load might be the result of differences in the trains dynamics.
2.2 Train interaction on railways networks
To focus on the analysis of delays, their outbreak and evolution, we choose trains instead of stations as our reference systems. The intermediate delays for a train i travelling from station A to station B on the link e is defined as \(\Delta_{i} t(e)\), The departure delays at the initial station have been subtracted to analyse only the delays that have been generated during the travel. Hence \(\Delta_{i} t(e)\) can be negative whenever the train is in advance, resulting in the train waiting at the station for the correct time of departure. Figure 2A shows the delay distributions for both national systems, considering the delays at intermediate stations along the path of a train, or just at the final station. We observe similar shapes of these distributions, with the Italian one exhibiting broader tails than the German. More than 10% of the train stops are on-time. The distributions of both countries exhibit an asymmetric pattern: the right tail (labeled Delay) shows a power-law like behaviour compatible with a q-exponential distribution [32], while the left tail (labelled Advance) has an exponential steeper slope. We expect this distribution to be the result of the interplay between the occurrence of adverse conditions and the interaction between trains, influencing their dynamics. Despite the fact that the microscopic details in our data do not allow for a precise investigation of possible interactions between trains, we can highlight how the possibility of interaction might affect the delays in railway networks. Hence, we study the relation between the first order co-activity of a link, i.e., the probability that at a certain time t (with time steps of 30 minutes) a link which is active has at least one active neighbouring link, to the average delay of the link itself. In Fig. 2B we show this quantity as a function of the average delay for both the Italian and German cases. We notice a slight increase in average delay as the co-activity increases, confirmed also by the Spearman’s coefficients, 0.43 for Italy and 0.56 for Germany. Hence, the possibility of interaction between trains in a certain part of the railway network seems to increase the delay localized in that area. Note that we have defined the first order co-activity between neighbouring links, i.e. links that have at least one node in common. We can define a k-order co-activity considering links connecting at least two nodes that are less than \((k-1)\) links apart following the shortest-path connecting them. We report the same measurements for \(k=2\) and \(k=3\) showing that in general, the same relation with the average delay is confirmed even though the curves are shifted towards lower values. This indicates that considering the interaction between non-neighbouring links is relevant but might include less important contributions to the delay. Hence, in the following we will limit our analysis to the first order neighbour links and assume that interactions between trains are possible only when they are in nearby links. Figure 2B supports the thesis of a propagation effect but an important feature has yet to be determined. In fact the direction of the propagation still has to be determined.
2.3 Defining possible interactions
Let us consider a train i travelling between two stations A and B (i.e., on the link \(e=AB\), see Fig. 3) with some delay. We can argue that the propagation of this delay to other trains can occur only if they travel on the neighbour links of e. Due to the fact that the railway networks are directed, there are four different configurations of the links with respect to e:
-
(i)
links entering A, i.e. trains moving towards the last station crossed by i;
-
(ii)
links entering B, i.e. trains travelling towards the same station i is currently travelling to;
-
(iii)
links exiting from A, i.e. trains departed from the last station crossed by i;
-
(iv)
links exiting from B, i.e. trains leaving the station i is currently travelling to.
We can exclude the last two case: (iii) given that all the trains in such configuration will have no interaction with i; (iv) is less important because it describes scheduled connections. In the latter case, schedules foresee extra-time between the two trains exactly to avoid delay propagation. We checked whether the propagation occurs in the case (i) of backward propagation, in the case (ii) of forward propagation, whose definitions are depicted in the graphic Fig. 3. To discriminate which mechanism is at play, we measured the average delay time sequence \(\Delta t (e)\) of each link e of the network, defined as the average delay of all the trains that are currently travelling on e. Successively we measured the cross-correlation functions of the average delay time series of all the pairs of links, i.e., \(\mathrm{CC}_{e,e'}(dt)=\sum_{t} \Delta t_{e}(t) \Delta t_{e'}(t+dt)/\sigma_{e}\sigma_{e'}\) being e and \(e'\) generic neighbours links of the network, \(\sigma_{e}\) and \(\sigma_{e'}\) the standard deviation of the whole time series \(\Delta t_{e}(t)\) and \(\Delta t_{e'}(t)\). Then we averaged, aggregating the pairs of neighbours links according to their configuration (forward, backward, etc). In this way, for each of the four configuration, we obtained an average cross-correlation function. In the backward propagation configuration can be defined:
$$ \mathrm{C}_{\mathcal{B}}(dt)= \bigl\langle \mathrm{CC}_{e,e'} (dt)\bigr\rangle _{(e,e') \in \mathcal{B}} $$
(1)
with \(\mathcal{B}\) as the ensemble of links pairs. For both networks the Backward mechanism is dominating while the Forward can be neglected (Fig. 2D). The high-speed layer of the railway network shows the similar backward mechanism, while there is no cross-correlations between the delays of high-speed vs. low-speed (see Additional file 1 Fig. S1) acting as two independent layers.
2.4 Exogenous generation of delay
We define two kinds of delays: endogenous and exogenous. By “endogenous” we mean that its origin is inside the railway system dynamics, i.e. it has been caused by another train. Conversely, by “exogenous” we mean that its cause is of a different nature: strikes, malfunctioning, bad weather or anything else which is not the result of the interaction with another train. We measure directly this types of delay in our datasets. Let us consider a train i travelling from a station A to a station B on the link e and further to a station C on the link \(e'\). It will travel first on the link e and then on the link \(e'\). The delay are indicate respectively as \(\Delta t_{i}(e)\) and \(\Delta t_{i}(e')\). If there is a increase in the delay \(\Delta t_{i}(e') > \Delta t_{i}(e)\) it might be “endogenous” or “exogenous”. The exogenous delay is defined as \(\delta t =\Delta t_{i}(e') - \Delta t_{i}(e)\), the variation of the train delay traveling on links whose neighbouring links were empty or hosted trains perfectly on time. It is worth noticing that δt might also be negative, for example, if the train managed to make up for lateness. Results are reported in Fig. 4, showing the distribution of positive exogenous delays as well as negative ones for the Italian and German cases. In order to model these distributions, we adopted the same approach used in [32] for departure delays. We fitted both the positive and negative parts of the distributions with q-exponential functions, where the parameter q modulates from an exponential distribution \(q\rightarrow 1\) to a fat-tailed distribution for \(q\in (1,2]\) [33]:
$$ e_{q,b}(\delta t) \propto \bigl(1 + b(q-1) \delta t\bigr)^{1/(1-q)} \quad \mbox{with } q\in[1,2], b>0. $$
(2)
It has been shown that such distribution can be obtained starting from a poissonian process \(p(\delta t| \alpha) = \alpha e^{-\alpha t}\), where α is a random variable extracted from of n independent gaussian random variables \(X_{i}\) with \(\langle X_{i} \rangle=0\) and \(\langle X_{i}^{2} \rangle\neq0\), so that \(\alpha=\sum_{i=1}^{n} X_{i}^{2}\) [32]. In this way it can be proven that \(n = 2/(q-1) - 2\), i.e. the parameter q, is related to the number of random variables composing α. The parameter b is proportional to the average value of α, so that large values of b at fixed q result in a distribution biased toward shorter delays. The parameter q quantifies how much equation (2) deviates from being exponential, which is the case \(q=1\). This model has already been applied to the departure delays in the UK railway system, showing that the value of q where so that \(4\leq n \leq 11\) and thus estimating the number of independent occurrences contributing to the delay. For the positive exogenous delays in the Italian and German case respectively, we found \(q=1.23\) and \(q=1.32\), corresponding to \(n\simeq 7\) and \(n\simeq 4\). The negative part of the distribution is exponential-like for the Italian railway network and broader for the German, this outlines the delay recovery strategies in the second case. To characterize the effect of the spatial distance on the delay distribution, we subdivided the links e of the railway networks in classes according to the geodesic distance \(d(e)\). Figure 5 shows the behavior of the q and b parameters of the q-exponential fit as functions of \(d(e)\). The parameter q remains constant in every case, while on the other hand the parameter b decreases as \(b \sim d^{-a}\). Figures S6–S9 of Additional file 1 show the different best fit for equation (2) as \(d(e)\) varies. This result suggest that while the causes of the delay remain the same, the distribution of disturbances gets closer to a power-law as the length of the links increases, this outlines a relation between link length and delay. Finally, we can assume that the occurrence of positive or negative exogenous delays on links, \(P(\delta t>0|d(e))\) and \(P(\delta t<0|d(e))\), are not roughly constant with \(d(e)\) and hence are not depending on the length of the links (Fig. S10 of Additional file 1).
2.5 Generation of delay at departure
Departure delays, i.e. the delay a train acquires right before leaving the first station on its route, cannot be considered in principle completely exogenous. In other words, due to the fact that different trains in our datasets can actually be the same physical train (e.g., the same convoy travelling back and forth along the same path on the railway network – this is denoted as “rotation” –), the delay at departure might suffer from the influence of the traffic. However, railway administrators envisage suitable time buffers at the endpoints of the paths of each train so that it is reasonable to assume, at least as crude approximation, that departure delay is exogenous in character. It has already been shown that this kind of delay can be described by a q-exponential distribution in [32]. However, the dependence on the parameters of the obtained distributions with respect to the topological properties of the network has not been investigated yet. Following the same spirit of the previous paragraph for the exogenous delay on links, we divide the nodes in the network (the train stations), with respect to their out-degree. The out-degree \(k_{\mathrm{out}}\) represents roughly the number of different railway lines starting from a certain stations and hence can be considered as a proxy for the complexity of the station itself. Once the nodes of the networks have been divided according to \(k_{\mathrm{out}}\), we fitted these distributions using a q-exponential following the procedure defined in [32] (see Fig. S12, Fig. S13 and Fig. S14 of Additional file 1).
Figure 6 shows the behaviour of the parameters q and b of the q-exponential distribution as functions of \(k_{\mathrm{out}}\) for the positive and negative departure delays in the two considered railway networks. Negative departure delays were never reported in the German dataset and hence we assume they are not present. Despite the fact that better proxies for station complexity than \(k_{\mathrm{out}}\) might exist (weighting each link with the actual number of railway lines on it is a valuable alternative example), it is possible to see that we have again a constant parameter q indicating that the sources of delays can be assumed to be the same independently from the station, while on the other hand the parameter b decreases exponentially with \(k_{\mathrm{out}}\). The small value of \(R^{2}\) in the case of Italy might reflect the above mentioned possibility of having a non negligible endogenous contribution to the departure delays because of train rotations.
In Germany the departure delay can be considered constant and independent on the size of the station. In Italy the \(P(\delta t>0 | k_{\mathrm{out}})\) grows linearly with \(k_{\mathrm{out}}\) meaning that stations with high degree are generating larger disruptions in the network.