Corporate payments networks and credit risk rating

Aggregate and systemic risk in complex systems are emergent phenomena depending on two properties: the idiosyncratic risks of the elements and the topology of the network of interactions among them. While a significant attention has been given to aggregate risk assessment and risk propagation once the above two properties are given, less is known about how the risk is distributed in the network and its relations with the topology. We study this problem by investigating a large proprietary dataset of payments among 2.4M Italian firms, whose credit risk rating is known. We document significant correlations between local topological properties of a node (firm) and its risk. Moreover we show the existence of an homophily of risk, i.e. the tendency of firms with similar risk profile to be statistically more connected among themselves. This effect is observed when considering both pairs of firms and communities or hierarchies identified in the network. We leverage this knowledge to show the predictability of the missing rating of a firm using only the network properties of the associated node.

Assessing the aggregate risk emerging in complex systems is of paramount importance in disparate fields, such as economics, finance, epidemiology, infrastructure engineering, etc.. A large body of recent literature has explored, both theoretically and empirically, how risk propagates [Pozzi et al., 2013] and how to assess aggregate risk when the risk of each individual entity is known [Nier et al., 2007], as well as the topology of the network of interaction among them. Both aspects have been shown to be important, however their mutual relation is relatively less explored. In theoretical studies, one typically assumes independence between idiosyncratic risk and topology, while in empirical studies the correlation is the one present in the investigated dataset.
But what is the relation (if any) between the idiosyncratic risk of a node and its local topological properties (e.g. degree, centrality, community, etc)? In this paper we answer this for rating prediction, it can be embedded in the now broad literature of classification problems [Friedman et al., 2001]. The idea of employing machine learning techniques in credit rating scoring has been explored before [Altman et al., 1994], but in that case the predictors for the rating are all derived from balance sheets, so the results are not comparable with ours. Other works use more heterogeneous information to predict the rating [Wilson and Sharda, 1994, Grunert et al., 2005, Lee, 2007, Parnes, 2012.
This paper contributes to these streams of literature in several aspects. First, we investigate the topological properties of payment networks by considering standard network metrics, such as degree and strength distribution and components decomposition. We find that the large payment networks investigated in this paper share the properties observed in other complex networks, namely they are sparse but almost entirely made of a single component, they are scale free and small world. Then, we look into the distribution of risk of firms in the network of payments in order to quantify the dependence between the network property of a node or a group of nodes and the risk of the firm represented by the node(s). The main and most innovative contribution of this paper is to document the existence of such correlations. We find an homophily of risk, i.e. the tendency of a firm to interact with firms with similar risk. This is a two nodes properties, but a similar behaviour is observed, even more clearly, also at larger aggregation scales. Communities of firms, detected by using different methods, often display a statistically significant abundance of firms of a specific risk class, indicating the tendency of firms with similar rating to be linked together through payments. Risk is therefore not spread uniformly on the network, but rather it is concentrated in specific areas. This implies that an idiosyncratic shock on a single firm can propagate more or less quickly depending on the local network structure and the community the node belongs to. The last contribution, is to exploit this correlation between risk of a firm and network characteristics of the corresponding node to predict the risk rating of the firm using network properties alone. To this end, we employ machine learning techniques to build classifiers for risk rating whose inputs are only network properties (e.g. degree, community, etc.). We show that our classification method has a good performance both in terms of accuracy and of recall and that outperforms significantly random assignments.
1 The network of payments

The dataset
The investigated dataset contains information on payments between more than two million Italian firms and is built from transactional data of the payment platform of a major European bank. Transactions are registered with daily granularity for the year 2014, for a total of 47M records, each of which includes the two counterparts involved, date, type, amount and number of transactions in the same day. Transactions are originally identified by account, but in case of customers and former customers, multiple accounts associated to the same firm are aggregated into a single entity 1 . This results in a total of 2.4M entities (which will be referred to as firms, for brevity) operating through the platform during the whole investigated period. The counterparts can be of different types. In principle, any firm or public body can make use of the platform, but in practice in most cases at least one is a customer of the bank. Similar considerations hold for the total amount exchanged: in each month more than 50% of the volume is transferred between customers, and it rises to above 95% when considering transaction with at least one customer involved. More details on the dataset and some descriptive statistics is presented in A. For customers, the dataset contains information on the economic sector and on the internal rating of the firm on a three value scale: Low (L), Medium (M), and High (H) risk.

Networks definition and basic metrics
A network, or graph, is identified by two sets: V , the sets of nodes with cardinality |V | = n, and E, the sets of links or edges, with cardinality |E| = m. The latter is the collection of ordered pairs of connected nodes. In our case, we also take into account the strength of interactions so a weight w ij is associated with each link. Starting from transaction data, payment networks are constructed as follows: given a time window, each node represents a firm active in that period, if there is payment between two firms a link from the source to the recipient is added, with weight equal to the payment amount. If multiple transactions occur between the same (ordered) pair of nodes, the weight of the link is the sum of the amounts of the payments. Therefore for each time period we construct a directed and weighted network. The time window of analysis may vary depending on the type of information one wants to extract from the dataset. In the following the focus will be on monthly networks, for which results are quite stable, at the cost of dealing with fewer and larger graphs. For the period covered by the dataset, each monthly network consists on average of n = 1M nodes and m = 3.2M links with the lowest activity in August and the highest in July (see A.1). The density ρ = m n(n−1) is thus small, resulting in a so called sparse network. Nevertheless this low density does not imply a disaggregated system. Indeed for all the networks the diameter is very small compared to the size: on average starting from a node, one has to pass at most 19 links to reach any other node in the weakly connected component (see Table 1). Thus the networks have the so called small-world property.

Networks topology
When considering a small number of firms, one would expect simple topologies: one firms is the supplier of intermediate products for another firm, resulting in a line (the simplest supply chain), or one firm is a supplier or a buyer for many others firms, resulting in a star network. Instead what is observed is a much more complex organisation, with a non negligible presence of cycles.
At a very coarse level, it is possible to identify two large classes of firms. The first constitute the core of the network, which includes approximately 20% of the nodes and more than half of the links. This core has a density an order of magnitude larger than that of the whole network and it is characterised by the fact that any pair of firms is connected, directly or via intermediaries. Around 60% of the total volume circulates among the nodes of the core. The other class is made of payers-only, i.e nodes that have no incoming links. These represent each month about one half of the active firms and their activity is sporadic. To better understand the role of this significant subset of firms we check their customer status and we find that the majority of them are unclassified, and that their number is larger than one expects from the unconditional distribution among all the firms. This means that likely they are not customers and, more importantly, almost no information, for example about risk, is available on them. For further details on this refer to A.3. We now turn our attention to the distribution of degree and strength. In our case the in-(out-) degree is the number of payers (payees) of a given firm and the corresponding amount of Euro. For the monthly aggregation case the average in-and out-degree of a firm is 6 and 4, respectively (see Table 1). These low values are a direct consequence of the low density of the network. However the degrees and the strengths are extremely heterogeneous as testified by the degree and strength distribution. Figure 1 shows the empirical cumulative distribution for these two quantities in a double logarithmic scale. The approximately straight line indicates the presence of a fat tail with a power law behaviour. The fit of the exponent supports the observation that in-and outdegree distribution data are consistent with a power-law tail and the estimated exponents are around 2.6 and 2.8, respectively. Similarly, in-strength and out-strength are well fitted by power-law distributions of exponents around 2.1 and 2, respectively. Despite the fact that a large fraction of nodes is different in each month, the tail exponents are remarkably stable (see Table 6 of the A.3).
This scale free behaviour is quite ubiquitous in complex networks has been found in many other real economic and financial networks [Serrano and Boguná, 2003, Garlaschelli and Loffredo, 2005, Boginski et al., 2005, Boss et al., 2004, Kim et al., 2002, Huang et al., 2009, Ohnishi et al., 2009. The fat-tailed distribution for the degree has two interesting consequences: first, there is no characteristic scale for the average degree or strength; second, there are a few nodes that act as hubs for the system, in the sense that, having a large amount of connections, many pairs of nodes are connected through them. This partially explains the low values for the diameter. Finally, we measure the tendency of firms to be connected to firms which are similar with respect to some attribute, namely the number and the total volume of connections (i.e. degree and strength). Following [Newman, 2002], we compute the assortativity coefficient for a categorical variable, where e ij is the fraction of edges connecting vertices of type i and j, a i = j e ij and b j = i e ij . It is r max = 1 for perfect mixing, while when the network is perfectly disassortative (each node connects to a node of a different type) it is Using the number of connections as categorical variable, an high value for the assortativity coefficient indicates that highly connected firms tend to interact significantly more than average with other highly connected firms. Similar reasoning holds using the volume exchanged as categorical variable.
Beside the entire graph, we also consider the subgraph of firms with rating and the subgraph of customers. The assortativity coefficient is consistently slightly negative for both attributes, for all months and graphs, namely around −0.03 for the entire graph and the subgraph of firms with rating, and −0.04 for the subgraph of customers, with no strong differences among months and attributes. Table 7 of A reports the summary of values of the assortativity coefficient for each month. A possible explanation can be that large, very interconnected firms are connected to many subsidiaries which in turn do not engage with many other firms, being their business almost exclusively focussed on the relationship with the large and central firms.
To summarise, each month the payment network of firms is very sparse but almost entirely connected. Half of the firms appear in the network as payers only (no incoming links) and they are mainly unclassified with respect to customer status, so no much information is available on them. Of the remaining nodes, almost half constitutes the denser core of the network where more than a half of the transactions occur and above 60% of the volume circulates. Finally the network is small world, scale free, and slightly disassortative both for degree and for strength.

Risk distribution and network topology
In this Section we investigate the distribution of risk of firms in the network of payments. We are interested in measuring the dependence between the network property of a node or a group of nodes and the risk of the firm represented by the node(s). We proceed in a bottom-up fashion, zooming out from single nodes to subsets. At first we consider a firm's local property (the number of connections) and we check if it correlates with the risk. Then we consider pairs of linked firms and measure the homophily in risk, i.e. whether firms with similar risk profile tend to do business together and thus to be linked. Finally, we divide firms into subsets induced by the network structure and we check whether the inferred subsets are informative with respect to the riskiness of the composing firms. Specifically, we partition the network in groups (or communities) of firms by using only network information, and we test if the distribution of risk within each group is statistically different from the global one. Thus the goal is to understand if the inferred communities are homogeneous with respect to the risk profile of the composing firms: a community with many firms with high risk rating is a clear indication of financial fragility and a possible source of instability, since the distress of one or few firms of the community is likely to propagate to the other firms.
For the sake of brevity, in the following the analysis is carried on for one month, but results are consistent for all the months.

Degree and risk
The first investigation is on the relation between the degree of a firm and its risk. The probability for each risk level r ∈ L, M, H conditional to the out-degree is computed 2 and plotted against the degree. The results are shown in Figure 2. We notice an interesting correlation between degree and risk: small degree nodes are more likely medium risk firms, whereas large degree nodes are more likely low risk firms. The high risk firms are more evenly spread across degrees, even if a larger fraction is observed for low degree nodes. To assess if the three curves are statistically different we perform a multinomial logistic regression on data [Greene, 2003] (the solid lines in the plot). This choice is justified by the fact the quantities just described are the probabilities of outcomes in a multi-class problem given an independent variable (the degree). The estimated probabilities follow quite closely the trend of the empirical distribution and the coefficients are all significant. More detailed results of the fit are given in table 8 of B.1 (first two columns).
The correlation just highlighted can be, at least in part, influenced by the effect of the size of the firm (in term of assets value from the balance sheet): a large firm is usually considered less risky than a small one, at the same time, a larger size generally implies a higher number of connections, as seen for example in the interbank network [Bargigli et al., 2015]. As the size of firms is not available to us, we use the sum of the incoming and outgoing amounts as proxy. Defined in this way, the size has a Pearson correlation of around 0.19 , 0.15 with inand out-degree, but a Spearman rank correlation of 0.67 , 0.57 respectively. To control for the effect of the size, we repeat the same procedure on subsets of firms, grouping according to their size into tertiles. The results, not shown here but available upon request, hold true also 2: Probability of rating of a firm conditional to its out-degree. The solid lines show the fitted multinomial logistic distribution, with its confidence intervals (dashed lines) in matching colours.
on the sub-samples, but slightly less sharply. We repeat the multinomial logistic regression adding the size tertiles among the predictors, and we still obtain statistically significant coefficients (last four column in 8 of B.1).
Similarly, the three conditional degree distributions given the rating result statistically different, as for every month all pairs reject the null hypothesis in the 2-sample Kolmogorov-Smirnov test [Smirnov, 1939]. Therefore topological characteristics (the degree) of the node can be used to obtain information on the riskiness of the corresponding firm. From a risk management perspective this is an important results, since on average highly connected nodes are also less risky.

Assortative mixing of risk
The next step is to check whether risk is correlated with direct connection preferences. To clarify this point, we consider two features: the assortative mixing of the risk and the conditional distribution of rating given the distance.
In the first case we compute a weighted variant of the assortativity coefficient in Eq. (1) using as categorical variable the risk rating rather than the degree or the strength. In practice, the quantities e ij are substituted byẽ ij , the fraction of volume from nodes of type i to nodes of type j. The reason for this choice is to mitigate the impact of the aforementioned large number of uncategorised payers. In most cases their links are associated with low volume and few transactions. Also, customer firms, even if they represent only around 1/3 of the firms, exhibit a generally more intense activity, both in terms of number of transactions and of volume, hence accounting for the stronger ties between the firms.
The metric is positive for all the three graphs, 0.070, 0.157, 0.163 for the whole set, the nodes with rating, and the customers, respectively, with significant variability across the months but always positive sign 3 . In Table 9 of B.2, the summary of values of the assortativity coefficients for each month is presented.
With the same quantitiesẽ ij we define metrics to assess different preferences in connection between incoming and outgoing payments. We test if firms are more concerned with the risk of payers than of the payees by testing for different risk distribution between incoming and outgoing connection. To discriminate between these two cases, for each node i we compute the percentage excess of volume with respect to the average toward nodes in certain risk class and we group according the rating of the node. The distributions are compared using MannWhitney U test [Mann and Whitney, 1947]. This non-parametric test allow to assess if one distribution is stochastically greater than the other. Details on the metrics and the test performed are given in B.2. We find that it is likely that firms are, at least in part, aware of the riskiness of their counterparts and results suggest they use this information in choosing their business partners. However the hypothesis that incoming payments show a more marked preference for low risk is not supported by data. Moreover the overall positive assortativity is mainly due to low risk nodes. This suggests that low risk firms are more careful in the choice of their business counterparts, possibly also because their relative larger creditworthiness allow them to find available partners more easily.
The quantities considered so far in this section are pairwise comparisons between the rating of nearest neighbours, and give an aggregate measure. A possible 4 way to enrich this information is to consider the conditional distribution of rating for nodes at a given distance 5 and to compare it to the unconditional distribution. In the case of no influence of the rating on the connection pattern, the conditional distribution of risk given the distance should be statistically undistinguishable from the null unconditional distribution. To test if this is the case, we first compute the distance between all the nodes for which the rating is available. Then for any fixed k, the occurrences of ratings are computed by looking at the set of pairs at distance k. Finally, the estimated distributions are tested against the null one with an hyper-geometric test, as explained in details in B.3.
Results for April are summarised in Figure 3, which considers the case when the source node is in class L. Results are similar when considering a medium or high risk source. For each k a marker indicates the percentage of nodes with low (green circles), medium (yellow squares) or high (red diamonds) risk at distance k. A marker is full when the percentage is statistically different from the null distribution (the dashed lines, with matching colours).
We note that up to distance 5 the class of low risk firms is significantly over-represented in the distributions. At greater distances, medium and high risk groups are over-represented. This means that more steps in the networks are necessary to reach riskier firms. This fact is particularly interesting when considering that each firm is in theory unaware of others firms' ratings and in some cases even its own. appears to be crucial. When considering the entire network, the assortativity coefficient is negative, around −0.07, hence indicating a slightly disassortative behaviour with respect to risk. The subgraphs, instead, show an assortative tendency, with coefficients around 0.025 and 0.038 for the nodes with rating and for customers, respectively. This shift can be explained again by the impact of the large number of uncategorised nodes.
4 An alternative strategy to go beyond first order neighbours in the computation of assortativity has been recently proposed by [Arcagni et al., 2017]. 5 The distance between nodes in a network is defined as the length of the shortest directed path connecting two nodes, where a path is a sequence of links. Clearly, in a directed network in general d(u, v) = d(u, v) and moreover d(u, v) can be not defined (or ∞) if there is no path from u to v.
When considering the same quantities for incoming paths, results (not shown) are very similar, namely at short distances the low risk class is over-represented, while medium and high risk nodes are over-represented for longer distances. Figure 3: Distribution of ratings for nodes at distance k from a node with rating L. The dashed lines are the unconditional (null) distribution of ratings among nodes in the entire sample. A full marker indicates that the over or under representation with respect to the null distribution is statistically significant in the hyper-geometric test at 1% significance level with Bonferroni correction.
A possible explanation for these observations is that among the hubs of the systems (i.e the most connected nodes) firms with rating L (i.e the most creditworthy) constitute the vast majority. This holds true when considering both in-coming and out-going links, and including also the nodes with no rating. Moreover, they are in the denser core previously described, while many high risk firms have a few or no out-going links and they are peripheral in network. This asymmetry in the position in the network is observed also when considering the distribution of the closeness centrality [Newman, 2010] of the nodes, i.e the harmonic mean of the distances to all other nodes, conditioning on the risk class (not shown 6 ).

Network organization and risk
In this Section we study the relation between the organization of the network at a more aggregate level and the distribution of risk. We are interested in two types of organization of networks into groups. The first is the modular organization: each module is composed by nodes, which are much more connected among themselves than with the rest of the network. In economic terms modules could represent, for example, firms operating in the same region or area, and the high density of the module reflects the fact that payments are more frequent with geographically close firms. We saw before that the network shows an assortative tendency with respect to risk, so we want to test if the homophily on risk can be observed beyond pairwise relationship.
6 Results available upon requests The second is a hierarchical organization. Since the payment network is directed, we look for a ranked partition (i.e. each group of nodes is labelled with an integer from 1 to the number of groups M ) such that most links are from nodes of low rank classes to nodes in high rank classes. This type of organization could represent, for example, a supply chain and the flow of payments between the firms of a group and those in the group in the next rank class reflects the (opposite) flow of goods or services. This classification is important because a high risk concentration in low class nodes of a strongly hierarchical network can trigger a cascade of distress in the higher rank classes.
Modularity and hierarchy are conceptually opposite as the first penalises connections towards other groups, which instead are encouraged in the latter (provided that they go from low rank to high rank nodes).
For each metric, we proceed in the following way: i. find the optimal partition according to the criterion; ii. compute the distribution of ratings within each subset of the partition; iii. test whether such local distribution is statistically different from the overall distribution of ratings by employing the hypergeometric test used in the previous Section and described in B.3. In order to have a sample large enough to perform the test, we only consider subsets with at least 500 known ratings.
We showed so far that the structure of the payments network is very complex. Since our goal is to obtain information on the risk of the firms, it can be helpful to filter the network before performing communities detection, in order to keep the most relevant connections. Thus we focus on the subgraph of customers. The reasons for this choice are many. First, the percentage of nodes with rating active every month is quite low, around 20%, but it raises to 70% when considering only the customers (see Table 4 in A.1 for a summary). This will help having a more informative local distribution of risk when considering subsets of nodes. Secondly, more than a half of the volume is transferred between customers (see Table  3 in A.1), so even if a large fraction of transactions is dropped, we are mostly pruning weak connections, while keeping the strongest ones. Finally, as it has been shown in the previous Subsection about assortativity, considering the entire network can be misleading, especially when looking at the connections without considering the weights, as it will be necessary for some metrics.

Modular structure
One of the standard methods for inferring a modular structure in a network is via modularity maximization. This method divides nodes into subsets, called modules, such that nodes are well connected with other nodes in the same module and there is a smaller number of links with nodes in other modules. Given a partition P in modules C the modularity is where A ij is the (i, j) element of the adjacency matrix and k in i (k out i ) is the in-(out-) degree of node i. The optimal partition is the one which maximizes modularity. Despite the associated optimisation problem is NP-Hard, fast and reliable heuristics for an approximate solution exist, and here the well known Louvain method [Blondel et al., 2008] is employed.
In each month we find that the optimal partition has around 2, 000 modules. These are really heterogeneous in size: for example, the 13 largest ones cover more than 95% of the nodes of the network. We perform the hypergeometric test of the null hypothesis of an homogeneous distribution of risk in each module with at least 500 known ratings. These are less than 1%, around 19 every month. (see Table 10 in B.3 for more details). These are clearly very large modules but a significant number of them shows an over or under-expression of one or two risk classes.
For some specific module it is possible to draw statistical robust conclusions on its risk profile. The top panel of Fig. 4 shows the over-or under-representation for the largest modules for January. The seventh module, for example, has an over-representation of firms with low risk and an under-representation of the other two risk profiles, thus it represents a group of firms with small risk. On the contrary the eighth module has an over-representation of highly risky firms and under-representation of low risk firms, representing a possible warning for the bank.

Hierarchical organization
We now consider explicitly the directed nature of the payment graph and the hierarchical organization of the network. An ordered partition is such that each subset is associated with an integer number (rank) r ∈ {1, ..., M }. A graph has a hierarchical organization if nodes are more likely linked to other nodes with a higher rank [Simon, 1991], such as in military organizations or in administrative staff. Finding the optimal ordered partition and revealing the hierarchy of a graph is in general complex and requires the minimization of a suitable cost function, similarly to what is done with modularity.
In this paper we use a recently proposed cost function [Gupte et al., 2011]. Given a rank function r : V → {1, ..., M }, the cost function penalizes links from a high rank node to a low rank node. The penalization is a linear function of the difference between the ranks. Thus the optimal hierarchical partition is obtained by solving the optimisation problem where R denotes the set of all ordered partitions and the cost function is The hierarchy of the graph is defined by By definition, h ∈ [0, 1], and 0 is the value for the trivial partition with only one set, while h = 1 is obtained when the network is a Directed Acyclical Graph and it signals a perfect hierarchy. The linear choice of the penalization function is convenient because the associated optimisation is solvable in polynomial time and few exact algorithms exist [Gupte et al., 2011, Tatti, 2017, while non-linear forms can lead to NP-hard problem. We apply the hierarchy detection to the monthly networks of payments and the results are summarized in Table 11 of B.3. First of all we notice that the number of inferred classes, roughly 18, is much lower than in the modular case. Moreover the size of the classes is much more homogeneous. The value of h is also quite stable, around 0.75, indicating a strong hierarchical structure, a remarkable result considering that we are studying only the customers network.
We now consider the distribution of risk in each class and we study the over or underexpression of certain levels of risk as a function of the rank of the class in the inferred hierarchy. The test rejects the null hypothesis of uniform risk distribution a considerable number of times (also compared with the total number of subsets in the partitions). As displayed in the bottom panel of Figure 4, low rank classes have an over-expression of high and medium risk firms, while middle and low rank classes (i.e. r ∈ [8, 12]) have an over expression of low risk firms and an under-expression of medium and high risk firms. More details on the test results are given in This empirical evidence may signal the presence of paths of risk propagation, since low rank firms, typically more risky, are payers of high rank firms, which are instead less risky. We now consider the distribution of risk in each class and we study the over or under-expression of certain levels of risk as a function of the rank of the class in the inferred hierarchy. The test rejects the null hypothesis of uniform risk distribution a considerable number of times (also compared with the total number of subsets in the partitions). As displayed in the bottom panel of Figure 4, low rank classes have an over-expression of high and medium risk firms, while middle and low rank classes (i.e. r ∈ [8, 12]) have an over expression of low risk firms and an under-expression of medium and high risk firms. More details on the test results are given in This empirical evidence may signal the presence of paths of risk propagation, since low rank firms, typically more risky, are payers of high rank firms, which are instead less risky. We now consider the distribution of risk in each class and we study the over or under-expression of certain levels of risk as a function of the rank of the class in the inferred hierarchy. The test rejects the null hypothesis of uniform risk distribution a considerable number of times (also compared with the total number of subsets in the partitions). As displayed in the bottom panel of Figure 4, low rank classes have an over-expression of high and medium risk firms, while middle and low rank classes (i.e. r ∈ [8, 12]) have an over expression of low risk firms and an under-expression of medium and high risk firms. More details on the test results are given in 11 in B.3. This empirical evidence may signal the presence of paths of risk propagation, since low rank firms, typically more risky, are payers of high rank firms, which are instead less risky.

Discussion
Both investigated partitions give interesting insights on the relationship between risk and network structure. On one side, the percentage of rejected tests in the case of modularity partition is consistent with the observed assortativity of risk. It may be noticed that the preference for low risk business partners is not always a realistic option, because in some sectors business partners are not replaceable for geographical reasons. To better assess this point, one possibility could be to include the comparison between modules and geographical location of firms, which is not available to us. On the other side, the hierarchical partition appears to follow the risk distribution slightly better and this is probably related to the peculiar conditional distribution of risk with respect to the distance described in Subsection 2.2. Indeed, given the fact the high risk nodes are over represented for longer distances, they should be located in extreme positions in the ranking, either at the top or at the bottom, and this is what is observed. It must be stressed that in the case of the two methods chosen here, one does not exclude the other, as they give different and complementary standpoints for interpretation. In this sense a multi-dimensional perspective is needed, where the dimensions are the mechanisms that either favour or discourage the creation of business relationships.

Missing rating prediction using payments network data
In the previous Sections we showed that network metrics can be informative of the risk of a firm. It is therefore natural to ask whether it is possible to predict the missing risk rating of a firm by using only information on network characteristics of the corresponding node, as well as risk rating of the neighbour firms. This problem is particularly relevant since we noticed that around 30% of the customers in the dataset do not have a rating and this percentage is even higher when the entire dataset is considered (see table 4 in A.1).
Here we use network characteristics as predictors for the missing ratings into well known methods of machine learning for classification problem. The predictors we employ are the following: i. in-and out-degree; ii. weighted fraction of (in-and out-) neighbours with a given rating (H,M,L or NA) iii. rank of the class in the hierarchy inferred by agony minimisation; iv. membership in community inferred by modularity maximisation; v. sum of in-and out-strength.
The fractions in (ii.) are computed considering the amount (weight) of each payment and are together a measure for rating assortativity, while (v.) is a proxy for the size. Data are preprocessed following [Friedman et al., 2001] so that variables are comparable in order of magnitude, as detailed in Appendix C.1. These transformations result into a total of 25 predictors. The dataset is the one which includes only the customers, and we consider the monthly network for January. In order to assess the performance of the prediction, we train each model using 75% of the data, and the remaining 25% is used for testing.
We consider three methods for classification: i. multinomial logistic; ii. classification trees; iii. neural networks.
See [Friedman et al., 2001] for a review of these methods.
The class H is under-represented in the sample, as it includes only around 10% of the firms with rating. This affects the ability of any classifiers to recover this class. This is undesirable, since the class H the most critical for the riskiness.
To address this issue we proceed with a 2-step classification strategy for all the three methods. The intuition behind this strategy is to train a classifier more specialised in the recovery of one specific class at the first step, and then separate the remaining classes in the second step. In the first step we fix a risk class, say L, and we merge the other two classes into a fictitious class X. We fit a first instance of the chosen model on the modified database. In the second step, we train another instance of the model only on the two previously merged classes. This is repeated for all the three risk classes. In the case of class H being the one selected for step one, we apply SMOTE [Chawla et al., 2002] before training, a well-known algorithm for data rebalancing. 7 .
Once the models are trained, the prediction are obtained by iterating the following two steps for each risk class (see the schematic representation in Figure 5) i. apply the first step classifier; ii. if the entry is classified as X, apply the second step classifier. The final prediction is the median of the predictions. In case of draw, more weight is given when the class is obtained from the first instance (as the classifier is more specialised). For the 2-steps method, the random classifier can be defined in the following way: the null distribution for the first step is obtained for each classifier, by taking into account the fictitious class, and at the second step by considering only the two classes previously merged. Table 2 shows the results for each classifier, together with the value for the same metrics computed for the random classification. In the case of classification trees and neural networks, different combinations for the hyper-parameters have been tested (such as depth for the trees, and number and size of hidden layers for neural networks), here we present the results for the best choice for each model, and in Supplementary Information we explain the selecting procedure.
The three models behave quite similarly, with slightly better overall performance of neural networks, and the training times are comparable. It must be noted that, among the predictors only network deduced metrics have been included, while any data from the balance sheet, which is likely to represent the main source for the risk rating model, as well as the sector or geographic location, are excluded. When adding the economic sector, which is the only metadata available to us, as further predictor the prediction power only slightly improves to from 49%− 50% to around 52% of accuracy for both classification trees and neural networks. The natural benchmark models are the random classifiers, both 1-step and 2-steps, due to the total lack of data employed in the proprietary rating model. We are able to outperform the first by 30% to 38%, and the latter by 15% to 22% in term of accuracy, and especially in the case of neural network, we are able to find a good compromise with recall for H.

Conclusions
In this paper we empirically study the interactions and the risk distribution of 2 million Italian firms, via the investigation of payments networks built from transactional data. Our contribution is threefold. On one side, the empirical study of the relationship between the high number of firms to our knowledge has not been done before, especially with this granularity. The study of the structure of the network highlights a complex interdependence between firms; indeed particularly interesting is the presence of a relatively small core of firms, which are involved in most transactions. This feature, paired with the power-law tail distribution of the number of connections and the total volume exchanged by the firms, can be a symptom of an architecture which favours the spread of distress, or positive feedbacks. Also relevant is the observed tendency of large, well-connected firms to be connected to small (in terms of exchanged volume), poorly connected firms. This can be the result of almost exclusive relationships between a big producer and its subsidiaries.
The second and main contribution is the assessment of the correlation between the network structure and the distribution of risk. From our analysis, we can conclude that the risk level of a firm is related to its features and role in the network at different levels. For single firms, we observed that low risk firms are more likely to have a high number of connections, and some of them acts as hubs for the entire network, being connected to thousands of other firms. When pairs of linked firms are considered, we observed the tendency to favour connections towards firms with the same risk level. This tendency can be observed also on a more aggregate level. Indeed, we found that also groups of firms which are more connected among them that with the rest of the network, have a local distribution of risk which is statistically different from the global one, meaning that some risk classes are over-or under-represented. Finally, we divided firms into a hierarchical organisation, in such a way to highlight the main direction along which money circulates. This simplified structure showed once more that many levels of the hierarchy have a local distribution of risk statistically different from the global one. As high risk firms are over-represented at the beginning of the flow of money, this can be a source of distress for the entire system.
Finally, we showed that network metrics and communities can be successfully used to predict the missing ratings with machine learning models. We propose a simple 2-steps strategy to compromise between overall accuracy and recall on the smallest but most risky class. We test our strategy with three methods, namely multinomial logistic, classification trees and neural networks. Since predictors are all network-derived quantities, and no information from balance sheets or other meta-data are used, the random rating assignment is the natural benchmark. We find that all the three methods are able to outperform sizeably the benchmark, with slightly better results for neural networks.  In order to give an intuition of different behaviours, two quantities can be considered. The first is the persistence of links and nodes, which is measured by counting the number of times a node or an edge appears in the networks for different time aggregations. From Figure 6 one can see that most of nodes are active only for few days, while a small core of firms is intensely active through the whole year. Secondly, the size of the networks, both in terms of number of nodes and links, for different time aggregations is shown in Figure 7. Interestingly, for daily aggregation, see Left panel, both quantities show a high periodicity, with a very high peak (a factor ∼ 5 with respect to the other days) at the end of each month. This effect is evident also with weekly aggregation, see (

A.3 Network metrics
A network's component is a subsets of nodes such that there is path between any pair of nodes, either undirected (weakly connected components), or directed (strongly connected components). From the definition of the networks it is clear that there are no isolated nodes, since the smallest weak components include at least two nodes, namely a payer and a payee. As it is common for many other real networks, it is possible to identify a weak component, which is of the order of magnitude of the entire network. In our case this giant component (GC) includes on average 98% of the nodes. Considering instead the largest strongly connected component (SCC), it includes approximately 20% of the nodes but more than half of the links. As a consequence the density of the strongly connected component is an order of magnitude larger than the density of the whole network or of the weakly connected component. See Table 5 in for more details of these quantities.
In the standard definition of the bow-tie structure of a network, the nodes in the GC but outside the strongly connected component are divided between the in-component, the nodes from which links arrive in the strongly connected component, and the out-component, the nodes reachable from the SCC. Nodes in the in-component that have no incoming links, represent each month about one half of the active firms and their activity is sporadic. Table 5: Percentage size (%n) and density (ρ) of the largest weakly (GC) and strongly (SCC) connected components. The last column (%w) contains the relative volume transferred among nodes in the SCC with respect to the total volume. GC SCC month %n ρ %n ρ = m n(n−1) %w Jan 0.989 3.4 · 10 −6 0.232 3.29 · 10 −5 0.75 Feb 0.989 3.21 · 10 −6 0.237 2.99 · 10 −5 0.69 Mar 0.980 3.15 · 10 −6 0.235 2.98 · 10 −5 0.70 Apr 0.980 3.16 · 10 −6 0.231 3.09 · 10 −5 0.67 May 0.981 3.16 · 10 −6 0.232 3.06 · 10 −5 0.69 Jun 0.980 3.11 · 10 −6 0.230 3.03 · 10 −5 0.69 Jul 0.982 3.05 · 10 −6 0.237 2.88 · 10 −5 0.70 Aug 0.970 3.08 · 10 −6 0.204 3.23 · 10 −5 0.69 Sep 0.981 3.31 · 10 −6 0.233 3.23 · 10 −5 0.73 Oct 0.981 3.00 · 10 −6 0.237 2.81 · 10 −5 0.68 Nov 0.979 3.08 · 10 −6 0.227 3.00 · 10 −5 0.65 Dec 0.979 2.81 · 10 −6 0.228 2.69 · 10 −5 0.67 Table 6: Results of power law fit of the degree and strength distribution for all the months obtained by using the algorithm described in [Clauset et al., 2009]. The α parameter is the fitted exponent and the k min and w min parameter is the estimated minimum value after which the behaviour of the distribution is consistent with a power law tail. Since the volume of payments are scaled, the values of w min s are not much informative, so for the strength  The multinomial logistic regression aims to model the probabilities for a classification problem with more than two outcomes. Here we treat the responses (L, M, H) as categorical and ordered. In practice this means to find parameters that best fit the model . X i are the predictors, a and b i · are the coefficients. We consider the cases p = 1, where the predictor is the degree X 1 = k, and the case p = 2 where also the size is used as predictor X 2 = s. In the following table 8, the b coefficients are shown, together with an indication for the statistical significance. Table 8: Coefficients for multinomial logistic regression. The first two columns refer to the regression with the degree as only predictor. The last four columns refer to the regression with also the size as predictor. The superscript indicates the predictors: k for the degree,  Table 9: Assortativity coefficient for risk rating. The columns with rating refers to the subgraph of nodes with known rating. The columns customers refers to the subgraph of nodes with customer status yes. In the last two columns, the metric for assortativity is modified in order to take into account weights, specifically e ij is computed as the fraction of volume, not the number of edges (see main text for more details The notation is consistent with the definition in (1): r(i) is the risk of node i;ã X ,b X are the percentage volume from or to nodes with rating X for the whole network, w i (X)) is the percentage of the volume from (to) node i to (from) nodes of rating X. Samples are obtained by grouping nodes by rating, for a total of 18(= (3 ratings) 2 · 2 directions) distributions. For example, the distribution of excess percentage volume from L towards M is given by Similarly, the excess percentage volume entering M from H is given by Note that in general, F We perform two set of test. In the first case we fix one rating and we compare outand in-excess percentage volume with respect to a certain rating. In all the cases the null hypothesis is rejected with very low p-values, however it is not straightforward to give an economic interpretation of the overall results: for all the rating, the excess percentage toward L is greater that the analogous for incoming volume, while the opposite holds for payments to and from H. In the second set of test we fix a rating and a direction (in or out), and we compare the excess percentage volume from (or to) all the ratings. Also in this case all the tests reject the null with very low p-values, so we are able to order the distributions and evaluate the preference in connection. For the outgoing volume, rating L is preferred to the more risky ones in all the case. Payments to nodes rated M follows in preference from nodes having risk M and H, but are last in order for nodes having rating L. For incoming payments, the situation is slightly different. Rating M is preferred by nodes rated M and H, and it is followed by L. While the preference is reversed for payments from nodes rated L.

B.3 Test for risk distribution within a community
The statistical test employed in the main text has the purpose to assess whether a given rating is under-or over-represented in a certain subset, obtained by one of the partitioning methods described in the paper. In general, this means to test if the distribution of ratings in a single subset is statistically different from the unconditional distribution obtained considering the entire sample. To do so, one computes the p-value representing the probability to observe a given number of ratings in each community under the null hypothesis of that ratings are distributed in the community as in the whole sample. As shown in [Tumminello et al., 2011] the probability under the null is the hyper-geometric distribution. Moreover, since for each community multiple tests (one for each rating and community) are performed, a correction for the p-value for multiple hypothesis testing is used. In particular, the Bonferroni correction is chosen, i.e. fixed a threshold p s for the p-value, the corrected threshold is given by ps Nr , where N r is the number of tests. The threshold of is fixes at p s = 1% before correction.
Specifically, given a partition {C i } i the following quantities are computed k x,i = #{nodes in C i with rating x} n i = #{nodes in C i } K x = #{nodes with rating x} N = #{nodes} and the p-value is given by Note that {K x } and N are computed in the specific monthly network under consideration.
In the case of the distribution conditioned on the distance, the subsets are obtained by considering pairs of nodes. For example, the fraction of nodes with rating L at distance k from H is computed as p (k) HL = |{(i, j) : d(i, j) = k, i ∈ H, j ∈ L}| |{(i, j) : d(i, j) = k, i ∈ H}| .
The partitions resulting from the other methods are very different in terms of number and size of subsets, so to make tests comparable, only communities including at least 500 nodes with known rating. In the cases of modularity, subsets are ordered by descending size. Note that, since each month the set of active nodes and the labelling of subsets changes, one cannot easily compare the behaviour of a subsets across months.
Tables 10, 11, present a summary of the tests, recording for each month and risk class the number of times the null hypothesis has been rejected, separated in over-(+) and under-(−) representation. The last two columns contain the number classes respectively tested, and in total (nC).   For classification trees, the hyper-parameter of interest is the depth, i.e the maximum number of condition to be satisfied for classification (or the length of the longest path from root to leaves). A higher value for depth results in lower training error but may lead to over-fitting. We considered value of depth from 3 to 10. For the 1-step model, the tree with depth 6 resulted the best choice, while for the 2-steps, the best results have been attained with a depth of 9 for the first step tree and 5 for the second. For neural networks, the hyper-parameters of interest are the number and size of hidden layers. As before, increasing too much these values may lead to over-fitting. In order to avoid extremely high number of parameters when adding layers, we consistently reduce their size as their number increases (intuitively, the number of parameter grows as i |l i |, where |l i | is the size of the ith layer). For example, in the case of 1 (hidden) layer the number of nodes is between 10 and 100, while for two layers, it goes from 5 each to 10 each. For the 1-step model the best results are obtained with 1 layer of 50 nodes, while for the 2-steps the best choice is 2 layers of 5 nodes each for the first step and 1 layer of 10 nodes for the second.