Comparative analysis of layered structures in empirical investor networks and cellphone communication networks

Empirical investor networks (EIN) proposed by \cite{Ozsoylev-Walden-Yavuz-Bildik-2014-RFS} are assumed to capture the information spreading path among investors. Here, we perform a comparative analysis between the EIN and the cellphone communication networks (CN) to test whether EIN is an information exchanging network from the perspective of the layer structures of ego networks. We employ two clustering algorithms ($k$-means algorithm and $H/T$ break algorithm) to detect the layer structures for each node in both networks. We find that the nodes in both networks can be clustered into two groups, one that has a layer structure similar to the theoretical Dunbar Circle corresponding to that the alters in ego networks exhibit a four-layer hierarchical structure with the cumulative number of 5, 15, 50 and 150 from the inner layer to the outer layer, and the other one having an additional inner layer with about 2 alters compared with the Dunbar Circle. We also find that the scale ratios, which are estimated based on the unique parameters in the theoretical model of layer structures \citep{Tamarit-Cuesta-Dunbar-Sanchez-2018-PNAS}, conform to a log-normal distribution for both networks. Our results not only deepen our understanding on the topological structures of EIN, but also provide empirical evidence of the channels of information diffusion among investors.


Introduction
proposed the empirical investor network (EIN) as a novel representation of the interactions between investors, based on their order placements: two investors are said to be connected if they placed the same type (ask or bid) of orders within a short time window (usually 30 seconds). The underlying hypothesis behind the EIN is that, when new information comes, it spreads from the source nodes to the peripheral nodes in the investor social networks and the time lags with which the information reaches different investors determine the lags between their order placements. Therefore, EIN can be regarded as a proxy of the investor social network. We propose to check the validity of the EIN construction by studying some of its properties, such as the layer or hierarchical structures in the EIN. As a reference and for comparison, we also test the hierarchical structures present in cellphone communication networks (CN), which are usually considered as information spreading network. Our finding of similar layer structures in EIN and CN gives credence to the hypothesis that EIN uncovers a significant part of the information spreading path between investors.
The present work is related to the research on Dunbar's number and its generalised discrete hierarchical structure in social networks. Recall that Dunbar's number of about 150 represents the average size of the personal ego network, i.e., the group of people one can typically maintain stable social relationships with due to cognitive limits (Dunbar, 1992(Dunbar, , 1993. Furthermore, the social relations in human and animal network have been found to form layer structures, each layer representing different emotional closeness (Dunbar, 1998;Dunbar and Shultz, 2007). And layer structures have approximately the configuration of 3-5, 10-15, 30-50, and 100-200 alters from the inner layer to outer layer (Zhou et al., 2005). Many empirical ego networks are found to exhibit such layer structures, including the network abstracted from the exchange of Christmas cards (Hill and Dunbar, 2003), the hunter-gatherer social networks (Hamilton et al., 2007;Zhou et al., 2005), and online societies in virtual world (Fuchs et al., 2014).
Another strand of literature relevant to our work is the use of cellphone and internet communication data that enable one to test the classical social theories empirically in large scale individuals. For example, the weak tie theory (Granovetter, 1973) has been validated for cellphone communication networks (Onnela et al., 2007;Kovanen et al., 2013). Such data have also been used to verify the hierarchical layer structures in social networks (Saramäki et al., 2014). Arnaboldi et al. (2016) found that the co-author networks in academic fields also have discrete hierarchical structures. By scanning the online social network from Facebook and Twitter, Dunbar et al. (2015) found that the ego networks exhibit limit size and hierarchical structures. More importantly, such layer structure can be considered as a "social fingerprint" for a specific individual, because it is stable and not affected by the change of friends (Tamarit et al., 2018). This paper is organized as follows. Data and methods are given in Sec. 2. Sec. 3 presents the results on the degree distribution, clustering, and theoretical model fits. Sec. 4 concludes.

Empirical investor networks
Our empirical investor networks (EIN) are constructed from the order flows of 100 stocks included in the Shenzhen 100 index (399004). The order flow data span the whole year of 2013. Following Ozsoylev et al. (2014), on each trading day, the EIN is obtained by connecting investors if they submit at least 3 buy (or sell) orders for the same stocks within 30 seconds. By aggregating the EIN on each trading day together, we obtain the annual EIN, which contains 381,345 nodes and 8,143,541 links. Ozsoylev et al. (2014) argued that the links in EIN may reflect the potential channels of information diffusion among investors, which could be reveal the existence of localized structures in social networks formed by investors. Thus, the larger the occurrence of links between two investors, the higher the probability for the existence of social connections between them. We further employ a statistical validated method (Tumminello et al., 2011a(Tumminello et al., , 2012Li et al., 2014;Hatzopoulos et al., 2015;Curme et al., 2015;Gualdi et al., 2016) to check whether two investors are occasionally connected, which provides us with the statistical validated empirical investor networks, abbreviated as SVEIN.

Cellphone communication network
The cellphone call records obtained from one Chinese cellphone operator cover periods from June 28th to July 24th and October 1st to December 31st in 2010. By excluding the days October 12th, November 5th, 6th, 13th, 21st and 27th, and December 6th, 8th, 21st and 22nd on which the data were missing, we have a total of 109 days. In the data, there are 91,911,735 cell phone users and 4,599,472,652 calls. As we cannot access the call records from the other cellphone operators, only the call records in which both mobile phone subscribers belongs to the data provider are included in our analysis, which leads to 1,173,501,607 records. As it is known that the frequency of calls may represent the intimacy between friends, the higher the communication frequency between two cellphone users, the stronger their assumed intimacy. We exclude the users who are identified as robots, telecom frauds and telephone sales . Finally, we build cellphone communication networks based on the reciprocal calls between normal users. The statistical validated method mentioned above is also employed to remove the random calls, thus providing us with the statistical validated cellphone communication networks, abbreviated as SVCN.

Statistical validated method
As is well known, EIN and CN contain a great deal of noise: for instance, two investors may submit orders at the same time by pure coincidence and callers may make wrong calls to callees. This suggests to remove such irrelevant signals by testing whether two nodes are randomly connected. For this, we employ a statistically validated method, proposed by Tumminello et al. (2011a) and used in different systems (Tumminello et al., 2012;Li et al., 2014;Hatzopoulos et al., 2015;Curme et al., 2015;Gualdi et al., 2016) to extract the links that are not randomly generated.
For two given nodes i and j, the purpose of the statistical validation is to check whether i preferentially connects to j. The EIN is taken as an example to illustrate the statistical validation method. Let us denote by N is the total number of transactions between investors in EIN, by N ic the number of transactions initiated by investor i, by N jr the number of transactions matched by investor j, and by X = N ic jr the number of transactions initiated by investor i and matched by investor j. We can then calculate the probability of observing X co-occurrences via the following equation (Tumminello et al., 2011a,b) where C X N ic is a binomial coefficient. We can also estimate the p-value associated with the observed N ic jr as follows: (2) For the EIN, we need to perform 2 × 8, 143, 541 = 16, 287, 082 tests. The corresponding Bonferroni correction of our multiple testing hypothesis is p b = 0.01/N E where N E = N(N − 1)/2 is the maximal possible number of edges. If the estimated p(N ic jr ) is less than p b , we can infer that investor i preferentially connects to investor j. Otherwise, we conclude that the edge pointed from i to j is randomly generated. For a given edge between node i and node j in the CN, we are able to estimate the p-value for the number of calls N jcir initiated by j and received by i in a similar way. We need to conduct 2×296, 928, 030 = 593, 856, 060 tests. And the Bonferroni correction is set as p b = 0.01/N E . When p(N ic jr ) is less than p b , this suggests that individual i preferentially calls individual j. Only when the two conditions that (1) i preferentially calls j and (2) j preferentially calls i are simultaneously satisfied, do we conclude that the edge between i and j is significant. Fig. 1 illustrates the layer structure of a typical ego network. The ego in the center are surrounded by the alters, who have direct connections with the ego. The alters usually form a layer structure, in which their emotional closeness decrease from the inner layer to the outer layer. The theoretical Dunbar Circle corresponds to a four-layer hierarchical structure with the cumulative number of 5, 15, 50, and 150 from inside to outside.

Clustering method
We employ two clustering algorithms, including the k-means algorithm and the head-to-tail (H/T ) break algorithm (Jiang, 2013), to detect the layer structures of the ego network in the SVEIN and SVCN based on the activity frequencies on links. The k-means algorithms is implemented with the R package CKmeans.1d.dp (Wang and Song, 2011). The optimized number of clusters are determined according to the BIC. In the H/T break algorithm, the data is split into two parts according to the data mean m 1 , and the head part in which all values are larger than m 1 is further separated into two parts according to the head mean m 2 . Such process iterates until the head is not heavy-tailed distributed. The H/T break algorithm is proposed to cluster the data with a heavy-tailed distribution, corresponding to the case of link weights in the SVEIN and SVCN.

Ego Alters
Emotional closeness Figure 1: Illustration of the theoretical Dunbar Circle in ego networks. The square in the center represent the ego and the circles around are the alters, who have direct connection with the ego. The circle size is proportional to the emotional closeness between the alters and the ego. According to the emotional closeness, the alters form a hierarchical structure with different layers in which their closeness to the ego decrease from inner layer to the outer layer. The theoretical Dunbar Circle corresponds to a four-layer hierarchical structure with the cumulative number of 5, 15, 50, and 150 from inside to outside.

Degree distribution
We first report the descriptive statistics of both filtered networks. As reported in Panel A of Table 1, in the SVEIN we find that there are 2.23%, 6.39%, and 91.37% of the total number of users (about 21,806 users) whose degrees are in the range of k > 100, 50 < k ≤ 100, and k < 50, respectively. And their average degree and standard deviation are 142.9 and 38.5, 68.8 and 13.9, and 10.0 and 11.8, leading to a coefficient of variation of 26.95%, 20.22%, and 117.95% (standard deviation/mean). Their average weighted degree and standard deviation are 18487.1 and 10984.6, 5504.3 and 2935.4, and 477.0 and 1134.
In Panel B of Table 1, we find that the number of users in the SVCN with degree k > 100, 50 < k ≤ 100, and k < 50 are 60748, 177076, and 3930604, accounting for 1.46%, 4.25%, and 94.29% of the users, respectively. The corresponding average degree and standard deviation are 142.2 and 45.8, 69.4 and 13.7, and 8.1 and 10, resulting in a coefficient of variation of 32.23%, 19.79%, 124.08%. And their average weighted degree and standard deviation are 1544.7 and 775, 780.3 and 410.9, and 92.1 and 161.7. The absolute number of nodes with k > 100 in the SVEIN is much smaller than those in the SVCN, but the relative numbers are very close to each other. According to the descriptive statistics, both filtered networks exhibit great similarities in their degree distributions.
We further fit the empirical degree and weighted degree distributions of the SVEIN and SVCN with the following four distributions, including the power-law, the normal, the exponential, and the log-normal distribution, The parameters of these distributions are obtained by Maximum Likelihood Estimation (MLE). The results are listed in Table 2. Kolmogorov-Smirnov (KS) tests are also conducted to check whether the (weighted) degrees are drawn from the four distributions. The null hypothesis is that the data set follows one of the four distributions. One find that, for both networks, the samples of the degree with k > 0 and the weighted degree with k > 0 and k > 100 conform precisely to none of the four distributions. This is not surprising, given the large sizes of our data sets, which will thus reject null hypotheses on the basis of even slight deviations. However, we can still compare the goodness of the fits by the four distributions using the Akaike information criterion (AIC) listed in Table 2. Except for the sample with k > 100 in the SVEIN, the log-normal distribution has the smallest AIC value. Thus, among the four distributions, the log-normal distribution fits the empirical degree distributions best.
The results of Table 2 strongly suggest that the correct distribution of degrees is a mixture of at least two log-normal distribution, one for small k and one for large k. Roughly, we can find a threshold k H , the degrees less than k H are fitted by the left truncated log-normal distribution and the degrees greater than k H are fitted by the right truncated log-normal distribution. Following Wu et al. (2010); Jiang et al. (2016), the threshold k H can be estimated by minimizing the following residual, where K fit and K emp represent the fitting distribution and empirical distribution, the superscripts s and l stand for the sample less and greater than the threshold k H , and n is the sample size. The parameters of both truncated distributions are determined through the Maximum Likelihood Estimation (MLE).    Thus, we can find that the optimal threshold are 152 and 48 for SVEIN and SVCN, respectively. The corresponding righttruncated and left-truncated degree distributions are plotted in Fig. 2 (c -f) for SVEIN and SVCN. The solid lines in each panel represent the best fits to the truncated log-normal distributions. For the weighted degrees of both networks, we perform the same analysis and illustrate the results in Fig. 3. The optimal thresholds are 374 and 653 for the weighted degrees of SVEIN and SVCN, respectively. One can see that the empirical distributions agree well with the fitted distributions in Figs 2 and 3, which support that the (weighted) degrees of both network conform to a mixed log-normal distribution. As is well known, the log-normal distribution plays an important role in describing natural phenomena in which growth processes are driven by the accumulation of many small percentage changes (growth rates), which is additive on the logarithmic scale. If each percentage change is small enough, the summation on the logarithmic scale tends to be normally distributed according to the central limit theorem, which means that the percentage change follows a log-normal distribution in the linear scale. One intriguing feature of the log-normal distribution is that the growth rate is independent of its size. According to the log-normal degree distributions, one can infer that the growth rate of one's "friends" should be independent of one's current number of "friends" in the SVEIN and SVCN.

Clusters
The layer structures in ego networks is usually determined based on the emotional closeness on links. Here, we cannot measure the emotional closeness directly. As an alternative, we employ the number of order placements in the EIN and the number of calls in the CN as a proxy for the emotional closeness on links. For a given node with n links, we first normalize the number of order placements (resp. the number of calls) W i (i = 1, 2, 3, · · · , n) on each links via the following equation, where The presence of natural breaks (associated with network layers) should then be reflected in the existence of sharp peaks in the distributions ofŴ i . We thus plot the distribution of the normalized weightsŴ i in Fig. 4 for both networks. As shown in Fig. 4 (a), no break can be observed for the SVEIN. A possible explanation is that the data sample of SVEIN is too small. In contrast, there is a significant peak at around 0.1 for the SVCN, as illustrated in Fig. 4 (b), which corresponds to the natural break w i ≈ 0.1 = 15/150, i.e. the second layer at 15 of Dunbar's discrete hierarchy. In the following, we use the clustering algorithm (k-means and H/T break) to uncover the discrete hierarchical structure of the node with k > 100 based on the normalized weightsŴ i . Fig . 5 shows the percentage of users who have the same number of layers according to the clustering algorithm of k-means and H/T break. As shown in Fig. 5 (a) and (b), the alters belonging to investors with degree k > 100 in the SVEIN are mainly divided into 2-4 classes and 4-6 classes according to the k-means and H/T Break algorithm, respectively. And we also find that 56.9% of the investors whose alters can be grouped into 5 layers. In order to measure the similarity and robustness of the clustering result, we further estimate the Jaccard coefficient between the clustering results of the two algorithms for the same user. The average Jaccard coefficient of all users is 0.11. As illustrated in Fig. 5 (c) and (d), we find that in the SVCN the alters of the users with degree k > 100 are mainly divided into 3-6 classes and 4-5 classes based on the k-means algorithm and the H/T Break algorithm. And the average Jaccard coefficient of the clustering results is 0.23. Our results thus indicate that the overlapping of the clusters from both algorithms is low.  Table 3 shows the comparison of the clustering results for the users with degree k > 100 in both networks based on the kmeans and H/T break algorithms. The results of the two clustering algorithms for the SVEIN are reported in panel A of Table 3. We find that 43% of users with degree k > 100 in SVEIN are grouped into 3 layers and the average cumulative number of alters in layers is 10.9, 45.8 and 141.7, in which the last two lay-ers correspond to the middle two layers of the empirical discrete hierarchical structure and the first layer seems to correspond to the coalition of the first two layers of the empirical structure reported previously in (Zhou et al., 2005;Hill and Dunbar, 2003). The H/T Break algorithm reveals that about 90% of the investors whose alters exhibit a configuration with 5 and 6 layers. One can observe that the number of alters in the outer four layers are very close to the theoretical Dunbar Circle 5, 15, 50, and 150. The number of alters in the inner or two layers is only 1-3.
Panel B of Table 3 lists the cumulative number of friends in each layer for the SVCN. For the k-means algorithm, we find that 16,918 (a fraction of 41.1%) users have a four-layer structure. The average cumulative number of alters from inside to outside are 3.0, 12.8, 42.8 and 132.0, which is in agreement with the discrete hierarchical structure 3-5, 10-15, 30-50, and 100-200 previously reported (Zhou et al., 2005;Hill and Dunbar, 2003). The corresponding scale ratio is 3.22 which is near to the Dunbar number 3. We also find that there are 15209 users have a five-layer structure with an average accumulative number of 2. 1, 7.3, 20.4, 54. 2, and 141.4. Besides the inner layer n 1 = 2.1, the number of alters in the outside four layers are very close to the reported hierarchical structure in Ref. Zhou et al. (2005); Hill and Dunbar (2003). For the H/T Break algorithm, 29125 users (about 50.2%) exhibit a fourlayer structure and the average cumulative number of alters are 2.1, 8.7, 33.4 and 133.9. There are 25539 (about 44.1%) users whose alters can be classified into 5 layers and the average accumulative number of alters in successive layers are 1.2, 3.8, 11.7, 39.5 and 147.6.
Both clustering algorithms reveal a similar discrete hierarchical structure in cellphone networks. We find that there is an extra innermost layer (1.2-2.1), with about 1-2 alters, for the users with four layers in their ego networks. We further fix the number of clusters to 4 for the k-means algorithm and estimate the cumulative numbers of in each layer, obtaining 2.5, 10.3, 36.8, and 142.2. In addition, we perform the clustering analysis on the link activities for each ego network, in which the ego investor with degrees 50 < k < 100, by means of the k-means algorithm. We find that there are 621 investors (about 44.9%) having a two-layer structure and the corresponding layer structure is 19.8 and 67.2, which is close to the middle two layers of the previously reported hierarchical structure (Zhou et al., 2005;Hill and Dunbar, 2003).
The empirical hierarchical structures of the personal ego networks in SVEIN and SVCN are compatible with the structure of 3-5, 10-15, 30-50, 100-200 from the inner to the outer layer, which is close to the theoretical Dunbar Circle. And the average empirical scaling ratio is close to the previously found value 3, which can also be accounted for theoretically (Lera and Sornette, 2019).
Figs. 6 and 7 show the distributions of the numbers of alters in each layer for the egos having degree k > 100 in the SVEIN and SVCN. We only show the nodes whose personal ego networks having three-layer and four-layer (respectively, five-layer and six-layer) structures in the SVEIN (SVCN). For both networks, the clustering results of both algorithms are not Table 3: Comparison of the clustering results for the users with degree k > 100 based on the k-mean and H/T break algorithm for the SVEIN and SVCN. N and f represents the total number and the percentage of users. n k stands for the cumulative number of users in the k-th layer. r is the average scale ratio.  in agreement with each other, as reflected by the low values of their Jaccard coefficients. An intriguing phenomenon is that the empirical distributions of the number of alters in each layer can be well fitted by the log-normal distributions, evidenced by the solid curves. Such log-normal distribution are robust when using different clustering algorithms, which is in agreement with the results of the online social network from Facebook and Twitter (Dunbar et al., 2015).

Fits to the theoretical model
We further fit the clustering results to the theoretical model of layer structures in personal social network (Tamarit et al., 2018). According to this model, the probability, that the alters of an individual are divided into ℓ ℓ ℓ = (ℓ 1 , ℓ 2 , ..., ℓ r ), is calculated as follows (9) where ℓ ℓ ℓ = (ℓ 1 , ℓ 2 , ..., ℓ r ) represents the number of alters in each layer. L represents the sum of the alters expectation of each layer and is equal to the total number of alters L. N is the total number of individuals in the network. B(L, p, N) = N L p L (1 − p) NL represents a binomial distribution. There is a unique parameter µ in the model, which is an indicator of the discrete hierarchy for the ego network. The parameter µ is approximately equal to the logarithm of the scale ratio log(r) between the cumulative numbers of individuals in successive layers, if the personal investment (time and energy) decrease linearly with the layers (Tamarit et al., 2018). Once the empirical hierarchical structure of egos is obtained, we calculate the average scale ratio r between adjacent cumulative layers based on the model proposed by Tamarit (Tamarit et al., 2018). The estimated theoretical scale ratios of both algorithms are listed in the last column of Table 3. For the SVEIN, the k-means algorithm indicates that the users are preferentially divided into the group having a three-layer structure while the H/T break algorithm uncovers that the ego networks exhibit a configuration of five layers. And their scale ratio are very close to the scaling ratio 3 discovered by Zhou et al. (2005). However, we find the existence of significant differences in the average scale ratio between the two clustering algorithms for the SVCN. On average, the average scale ratio of the H/T break algorithm is larger than 3.5 and the scale ratio obtained with the k-means algorithm is smaller than 3.5. Both clustering algorithms reveal that most of the users exhibit a four-layer structure in their ego networks, for which the scale ratio are respectively 3.2 and 4.0, which are roughly compatible with the scale ratio reported previously (Zhou et al., 2005). Figs. 8 and 9 show the distribution of the estimated average scale ratios for the egos having the same layer structure for both networks. We find that the scale ratio distributions given by the Tamarit's model conform to the log-normal distributions for both clustering algorithms. The χ 2 test, KS test and AD test can not reject the null hypothesis, that the scale ratio are log-normal distributed, at the significant level of 5%. The solid curves in Figs. 8 and 9 are the best fits to the log-normal distributions. The estimatedμ of the scaling ratios are located in the range of 2.5-3.3, which is compatible with the previous scaling ratio 3 discovered by Zhou et al. (2005). Our results reveal that the ego networks in SVEIN exhibit very similar layer structures to those in SVCN, confirming that the SVEIN captures the information spreading channels between investors.

Conclusion
We have performed a comparative analysis to detect the layer structures in Empirical Investor Networks and Cellphone Communication Networks. The layer structures have been quantified by two clustering algorithms, namely the k-means and H/T break algorithms. And both clustering algorithms reveal that there are two types of inner structure for both networks: one exhibits a layer structure similar to that of the theoretical Dunbar Circle, while the other has an additional inner layer, which is also found in Facebook and Twitter datasets Dunbar et al. (2015). Furthermore, we also find that both networks have a similar scale ratio (close to 3). And more interesting, these scale ratios remain stable even when old alters are replaced by new alters. By fitting our empirical clustering results to the theoretical model of layer structures (Tamarit et al., 2018), we confirm that the scale ratios of different egos follow a log-normal distribution for both networks. Our results suggest strong evidence that the structures of ego networks in EIN and CN exhibit great similarities, which captures the information spreading routes between investors and validates the underlying assumption of EIN.