Prediction of new scientific collaborations through multiplex networks

The establishment of new collaborations among scientists fertilizes the scientific environment, fostering novel discoveries. Understanding the dynamics driving the development of scientific collaborations is thus crucial to characterize the structure and evolution of science. In this work, we leverage the information included in publication records and reconstruct a categorical multiplex networks to improve the prediction of new scientific collaborations. Specifically, we merge different bibliographic sources to quantify the prediction potential of scientific credit, represented by citations, and common interests, measured by the usage of common keywords. We compare several link prediction algorithms based on different dyadic and triadic interactions among scientists, including a recently proposed metric that fully exploits the multiplex representation of scientific networks. Our work paves the way for a deeper understanding of the dynamics driving scientific collaborations, and validates a new algorithm that can be readily applied to link prediction in systems represented as multiplex networks.


Introduction
One of the main drivers of scientific discoveries is the establishment of new collaborations among researchers. The collision of different scientific trajectories, even if they belong to the same research area, brings together different methods, concepts, and ideas, fostering the ideal environment for scientific creativity. Understanding the dynamics that drives the development of scientific collaborations is thus pivotal to characterize the structure and evolution of science [1]. In this endeavour, two factors play a crucial role. On the one hand, the digitalization of large scale bibliographic databases has provided comprehensive data sets of publication records including research from all disciplines, without geographical limits. By leveraging on these databases [2], researchers have pictured the structure of different research fields [3,4], measured the emergence of new interdisciplinary areas [5,6], mapped the evolution of scientific interests [7,8], and characterized scientific productivity at the individual and geographical level [9][10][11][12]. On the other hand, network science [13] has been established as the main tool to analyze and model cooperation in science. Since the seminal work by Newman [14], scientific collaborations are represented in the form of a network, where nodes stand for scientists and a link between two nodes is drawn if two scientists have co-authored a paper together.
Forecasting a new collaboration translates, within the network science domain, into a link prediction problem [15], a prolific area of network research with applications ranging from detecting hidden links in economic networks [16] to enhancing user experience in online social platforms [17]. Many link prediction algorithms are based on similarity measures computed on the node attributes, i.e. two nodes are likely to be linked if they are similar with respect to certain features [18]. In social networks, one of the most successful similarity metrics for link prediction is the presence of common neighbors between two nodes, e.g. a new friendship on Facebook can be recommended on the basis of the number of common friends shared [19,20]. Despite its simplicity, this concept has proven to be quite successful for link prediction in scientific collaboration networks [21,22]. Moving from this simple approach, several attempts have been made to improve the prediction of collaborations by incorporating additional layers of data, for instance, by adding information about the organization the authors work at [23], topical interest [24], time at which collaborations are established [25], offline relationships among employees of the same university [26], weights of the collaboration links [27] or journal information [28]. However, in most of these approaches, the scores are computed individually for each set of data and then aggregated into a unique score, possibly after associating a specific weight to each set.
In this paper, we merge different bibliographic sources to leverage the whole information included in publication records to improve the prediction of new collaborations. To this aim, we reconstruct a multiplex network [29,30] in which nodes represent scientists and different kinds of relations among them are encoded in different layers, i.e., a given relational category corresponds to a layer, see Fig. 1(A). In particular, we focus on scientific credit, represented by citations, and common interests, measured by the usage of common keywords, to predict new collaborations. We compare several link prediction algorithms based on different dyadic and triadic interactions between scientists. We also consider a recently proposed metric for link prediction in multiplex networks, based on a generalization of the Adamic-Adar method for single-layered networks [31], able to fully exploit the multiplex representation of scientific networks. We show that scientific credit and common scientific interests can be predictive of new collaborations between scientists.

Data
Our dataset is composed by merging two different bibliographical sources. First, the American Physical Society (APS) database, including authors' names, publication date and references of over 400,000 papers published from 1893 to 2009 [32]. Here we considered the disambiguated dataset published in Ref. [12]. Second, the ArnetMiner database [33], containing title, authors' list, publication year and keywords for almost 155 billion papers belonging to multiple research fields. From this dataset, we select only those papers present in the APS dataset, by matching the DOI number. Our final dataset is composed, for each paper, by the list of authors with their affiliations, the list of keywords associated to the paper, and the papers cited as references. Before analyzing the data, we apply a cleaning procedure to the information related to keywords, see Additional file 1 for details.
We then reconstruct a scientific weighted multiplex network [34], where nodes represent scientists and different layers account for different interactions among them: collaborations, common interests, and scientific credit, see Fig. 1 collaborations and it corresponds to a classical co-authorship network: two authors are linked if they published at least a paper together. A second layer (r) represents scientific credit, measured by references or citations: a link from author u to author v indicates that u cited at least one paper from v. Lastly, the third layer (k) represents common scientific interests, which can be measured by the usage of common keywords: two authors are connected if, out of all the keywords they have ever used, they have at least one in common. The collaboration and keyword layers are formed by undirected links, while the reference layer includes directed interactions. Finally, the weight w α uv of a link represents the number of co-authored papers (α = c), citations (α = r), or common keywords (α = k) between two authors u and v.

(A). A first layer (c) represents
We consider two subsequent time intervals, first an interval over which link prediction algorithms will be trained, corresponding to a training network with all authors who published a paper between t 0 and t 1 , and a test interval for testing the predictions of new collaborations, including all authors active between t 1 and t 2 . We then consider the prediction of new links in a subset of nodes of these networks, which we name Core, corresponding to the authors that have at least k min edges in the collaboration layer, i.e., a minimum number of co-authors equal to k min both in the training and test intervals. This choice is to ensure authors to be active in both intervals, as it is common practice in link prediction problems on social networks [15]. In order to reduce the computational complexity of the prediction algorithms, we restrict our analysis to only papers published in Physical Review Letters (PRL) between t 0 = 1994 and t 2 = 2005, split at t 1 = 2000, see Ad- Table 1 Properties of the different layers of the scientific networks and the Core over which link prediction is computed. We show the number of nodes N, the total weight W = ijα w α ij , the average degree k , the overlap between the collaboration layer and the other layers, and the global clustering coefficient C. The overlap is defined as the fraction of links in the collaboration layer that are also present in citations layer or keyword layers  Table 1 reports several properties of the different layers of the scientific multiplex network and the Core over which link prediction is computed. In particular, note that the keyword layer is denser than the others.

Link prediction algorithms
To determine if the information provided by the citation and keyword layers is actually useful to predict the appearance of new links in the collaboration layer (see Fig. 1B), we propose several novel metrics based on the similarity between nodes in these layers. First, we consider metrics based on dyadic interactions between scientists, that is, to predict a new collaboration between nodes u and v (i.e., a new uv link in the collaboration layer), we consider links between nodes u and v in different layers. For instance, we consider Mutual Citations (MC): if two authors mutually cite each other, it might be more likely for them to collaborate. The MC score between nodes u and v is defined simply as the weight of the link between u and v in the citation layer, Similarly, we consider Common Keywords (CK ): if two authors show common scientific interests, using the same set of keywords, the chances that they collaborate in the future should be higher than if they did not have common interests. Thus, the CK score between nodes u and v can be expressed as the weight of a link between u and v in the keyword layer, For each case, we also define a normalized variant. The Normalized Mutual Citations (NMC) score normalizes the number of citations between two authors by the total citations received by each of them. The idea is that mutual citations between very popular scientists (who attract many citations in general) should count less than mutual citations between scientists receiving less citations. The NMC is thus defined as where s r u = v w r vu is the total number of citations received by u, corresponding to the total incoming strength. Note that this metric considers the directed citation network, explicitly differentiating between incoming and outgoing citations. The last dyadic metric considered is the Normalized Common Keywords (NCK ), computed as where K u is the keyword list used by node u. Here, the idea is that authors using more keywords than others are more likely to share keywords with someone else. Next, we consider metrics based on triadic closure. That is, to predict a new uv link in the collaboration layer, we consider triangles involving nodes u and v in different layers. The most common and successful method of this class has been developed by Adamic and Adar (AA) [19]. The AA score between nodes u and v is given by counting their common neighbors w weighted by the inverse of the logarithm of their degree. In this way, more active authors, which are more likely to be common neighbors of a given pair of nodes, weight less in the AA score. In a multiplex network, the AA score could be applied to different layers, i.e. considering neighbors also in layers different from the one where new links are predicted. Therefore, the AA score computed by counting neighbors in layer α can be defined as where α (u) represents the set of neighbors of node u in layer α and k α w = | α (w)| is the degree of node w in layer α. By applying Eq. (5) to the collaboration layer (α = c), one has the classical AA score for collaboration networks, AA c : two scientists are more likely to collaborate if they share many common collaborators. Equation (5) can also be applied to the citation (α = r) or keyword (α = k) layer, the rationale being that two scientists are more likely to collaborate if they cite the same set of authors (AA r score for the citation layer), or have similar scientific interests (AA k score for the keyword layer). Note that in all cases, the common neighbors of u and v can be both in the Core and outside it.
Finally, we consider a recently proposed generalization of the AA score [19] to multiplex networks, which takes into account all possible triadic closures in multiplex networks [31]. The MAA score for the prediction of a link between nodes u and v in the collaboration layer c is defined as where T αβ are different kinds of triadic relations among three nodes u, v and w [31]. While the link uv to be predicted is in the collaboration layer, the other two links uw and vw may lay in any layer. For instance, one link uw in the collaboration layer (α = c) and the other link vw in the citation (β = r) or keyword (β = k) layer, or one link uw in the citation layer (α = r) and the other link vw in the keyword (β = k) layer. The coefficients η cα and η cβ before each term control the relative weight of each type of triadic closure in the total score of the link, thus η cα corresponds to the weight of layer α, with α η cα = η cc + η cr + η ck = 1. The case η cc = 1, η cr = η ck = 0, corresponds to the classical AA c score on collaboration networks, while η cr = 1 (η ck = 1) corresponds to the AA r (AA k ) score applied to the citation (keyword) layer, see Fig. 1 for a schematic illustration of this process.

Results
The quality of link prediction algorithms is usually evaluated by the Receiver Operating Characteristics (ROC) curve, with the corresponding Area Under the Curve (AUC) value. However, due to the limited amount of links present in a network, the AUC of any similarity-based link prediction algorithm is bounded [31,35]. For this reason, we also consider the Precision of different scores, computed as n * /n, where n is the number of new links that we want to predict and n * is the amount of correct predictions among the top n links.
As a first step, we explore the coefficients η cα of the MAA metric to find the combination that maximizes the prediction of new collaborations. Figure 2 shows the AUC and Precision of the MAA metric, given by Equation (6), as a function of the coefficients η cα . Figure 2(a) shows that the AUC value has an important contribution from triads involving the citations and keywords layers, as shown by the discontinuity for η cc < 1. This result is consistent with the fact that citations and keywords relationships contribute to increase the amount of information carried by the collaboration layer, see Table 2. The Precision is maximum for η ck = 0.05 and η cr = 0.1 (see Fig. 2(b)), showing that the contribution of the collaboration layer is important to keep high precision. Next, we compare other scores with the MAA metric with this combination of coefficients. Table 2 shows the Precision and AUC values obtained for all proposed metrics, together with the theoretical bounds of the AUC. Interestingly, the AA c score (classical AA metric for collaboration networks) has an AUC value quite close to the random one, but the second highest Precision after the MAA score. This reflects the fact that, even though the heuristic behind the AA c metric seems to be a good proxy of the real dynamics, the limited amount of information hinders the prediction process. On the other hand, the keywords layer is the densest one and thus it carries much more information than the others, yielding a larger theoretical maximum for the AUC of the metrics based on this layer, such  Table 2 Precision and AUC values obtained for different metrics proposed, with the theoretical bounds of the AUC. We consider dyadic metrics given by Eqs (1)-(4), triadic closure given by the AA metric, Eq. (5), applied to each layer (AA c , AA r , and AA k ), to the aggregated network (AA a ), and the MAA score given by Eq. (6), with coefficients η ck = 0.05 and η cr = 0.1 which maximize both AUC and Precision (see Fig. 2). Note that dyadic (MC, NMC, CK, and NCK) and triadic (based on AA) methods use different amount of information, so the theoretical bounds for the AUC are different as the CK , NCK , and AA k scores. However, the Precision of these scores is not as good as other metrics, indicating that sharing keywords is not such a good descriptor of the dynamics behind establishing new collaborations. Note that the AA score applied to the aggregated, single-layered network given by the projection of all layers onto a single layer (AA a ) is indistinguishable from the AA k score, given that the projected network is dominated by the keywords layer. Metrics based on the citation layer show a behavior between the other two: the citation layer carries less information than the keyword layer but more than the collaboration one. Therefore, the AUC value of the AA r method is larger than the AUC of the AA c . Note, however, that dyadic metrics such as the MC and NMC scores have a much lower AUC (also lower than the AA c score), even if they show a slightly larger Precision. This indicates that simple dyadic metrics cannot outperform scores based on triadic closures with respect to citations. Finally, the MAA metric given by Equation (6) with coefficients η ck = 0.05 and η cr = 0.1, which maximize both AUC and Precision, has a much larger AUC and Precision than all other single layered metrics.
The detailed ROC of different dyadic and triadic scores are showed in Fig. 3. Note that the curves obtained by normalized scores (NMC and NCK) are not shown, since they exactly overlap the corresponding ROC for non-normalized scores (MC and CK, respectively), indicating that the normalization factor has no effect. Also, the ROC obtained by applying the AA score to the aggregated network, AA a , is equivalent to the ROC of the AA k score, and thus it is not shown, for clarity. This behavior is also confirmed by the AUC values reported in Table 2. Figure 3 unveils that the ROC curve of the MAA metric given by Eq. (6), with coefficients η cα that maximize the AUC, clearly outperforms all other metrics. Finally, Fig. 3 clearly shows the point of the ranking beyond which only scoreless links remain, and thus the curves start to follow a linear trend. This point is different for different metrics and it is responsible for the theoretical bounds of the AUC indicated in

Conclusions
To sum up, we have shown that scientific credit and common scientific interests can be predictive of new collaborations between scientists. For this purpose, we reconstructed a dataset of publication records by merging different bibliographic sources, including keywords that indicate the topics of papers. We represent this dataset as a multiplex network, in which each layer encodes a different kind of interaction, directed or undirected. Next, we compare several link prediction algorithms, based on different dyadic and triadic interactions between scientists. Our findings show that metrics based on triadic closure generally outperform simpler dyadic scores, and that the contributions of different layers are bounded by the amount of information available in each layer. For this reason, the best results, both in terms of Precision and AUC, are obtained by combining the information present in different layers by means of the Multiplex Adamic-Adar score [31], that fully exploits the multiplex nature of the scientific networks reconstructed here. The coefficients that maximize the Multiplex Adamic-Adar metric indicate how the information structured in the multiplex network can be optimized for the prediction of new scientific collaborations. In this regard, one can notice that the major contribution is given, as expected, by the collaboration layer, while contributions from citation and keywords layers are smaller. For the keyword layer, this is due to its large density, that improves AUC but may reduce Precision. A possible improvement to the prediction of new collaborations could thus be given by a smaller, more precise set of keywords able to better map the different fields of Physics [8]. In future works, it would be interesting to incorporate additional information from publication records into the scientific multiplex network, such as institutional affiliations and geographical locations, to see if these features are predictive of new collaborations. While in this work we predict new collaborations only within the field of Physics due to the computational complexity of link prediction algorithms, the dataset presented and the prediction metrics proposed here can be applied beyond the