Predicting scientific success based on coauthorship networks

We address the question to what extent the success of scientific articles is due to social influence. Analyzing a data set of over 100,000 publications from the field of Computer Science, we study how centrality in the coauthorship network differs between authors who have highly cited papers and those who do not. We further show that a Machine Learning classifier, based only on coauthorship network centrality metrics measured at the time of publication, is able to predict with high precision whether an article will be highly cited five years after publication. By this we provide quantitative insight into the social dimension of scientific publishing – challenging the perception of citations as an objective, socially unbiased measure of scientific success.


Introduction
Quantitative measures are increasingly used to evaluate the performance of research institutions, departments, and individual scientists.Measures like the absolute or relative number of published research articles are frequently applied to quantify the productivity of scientists.To measure the impact of research, citation-based measures like the total number of citations, the number of citations per published article or the h-index [9], have been proposed.Proponents of such citation-based measures or rankings argue that they allow to quantitatively and objectively assess the quality of research, thus encouraging their use as simple proxies for the success of scientists, institutions or even whole research fields.The intriguing idea that by means of citation metrics the task of assessing research quality can be "outsourced" to the collective intelligence of the scientific community, has resulted in citation-based measures becoming increasingly popular among research administrations and governmental decision makers.As a result, such measures are used as one criterion in the evaluation of grant proposals and research institutes or in hiring committees for faculty positions.Considering the potential impact for the careers of -especially young -scientists, it is reasonable to take a step back and ask a simple question: To what extent do social factors influence the number of citations of their articles?Arguably, this question challenges the perception of science as a systematic pursuit for objective truth, which ideally should In this paper we address this issue by studying the influence of social structures on scholarly citation behavior.Using a data set comprising more than 100 000 scholarly publications by more than 160 000 authors, we extract time-evolving coauthorship networks and utilize them as a proxy for the evolving social network of the scientific discipline computer science.Based on the assumption that the centrality of scientists in the resulting social network is indicative for the visibility of their work, we then study to what extent the "success" of research articles in terms of citations can be predicted using only knowledge about the embedding of authors in the social network at time of publication.Our prediction method is based on a random forest classifier and utilizes a set of complementary network centrality measures.We find strong evidence for our hypothesis that authors whose papers are highly cited in the future have -on average -a significantly higher centrality in the social network at the time of publication.Remarkably, we are able to predict whether an article will belong to the 10% most cited articles with a precision of 60%.We argue that this result quantifies the existence of a social bias, manifesting itself in terms of visibility and attention, and influencing measurable citation "success" of researchers.The presence of such a social bias not only highlights problems with current publication and citation practices.It also threatens the interpretation of citations as objectively awarded esteem, which is the justification for using citation-based measures as universal proxies of quality and success.
The remainder of this article is structured as follows: In section 2 we review a number of works that have studied scientific collaboration structures as well as their relation to citation behavior.In section 3 we describe our data set and provide details of how we construct time-evolving coauthorship networks.We further introduce a set of network-theoretical measures which we utilize to quantitatively assess the centrality and embedding of authors in the evolving coauthorship network.In section 4 we introduce a number of hypotheses about the relations between the position of authors in the coauthorship network and the future success of their publications.We test these hypotheses and obtain a set of candidate measures which are the basis for our prediction method described in section 5. We summarize and interpret our findings in section 6 and discuss their implications for the application of citation-based measures in the quantitative assessment

The Complex Character of Citations
It is remarkable that, even though citation-based measures have been used to quantify research impact since almost sixty years [6], a complete theory of citations is still missing.In particular, researchers studying the social processes of science have long been arguing that citations have different, complex functions that go well beyond a mere attribution of credit [14].At the level of scientific articles, a citation can be interpreted as a "discursive relation", while at the level of authors citations have an additional meaning as expression of "professional relations" [14].Additional interpretations have been identified at aggregate levels, like e.g.social groups, institutions, scientific communities or even countries citing each other.These findings suggest that citations are indeed a complex phenomenon which have both cognitive and a social dimension [14,20].This questions an oversimplified interpretation of citations as objective quality indicator.The complex character of scholarly citations was further emphasized recently [13].Here, the authors argue that, apart from an attribution of scientific merit, references in scientific literature often serve as a tool to guide and orient the reader, to simplify scientific writing and to associate the work with a particular scientific community.Furthermore, they highlight that citation numbers of articles are crucially influenced not only by the popularity of a research topic and the size of the scientific community, but also by the number of authors as well as their prominence and visibility.
Facilitated by the wide-spread availability of scholarly citation databases, some advances in the understanding of the dynamics of citations have been made in the last years.Generally, citation practices seem to differ significantly across different scientific disciplines, and thus complicating the definition of universal citation-based impact measures.However, the remarkable finding that -independent of discipline -citations follow a log-normal distribution and can be rescaled in such a way that citation numbers become comparable [21,22], suggests that the mechanisms behind citation practices are universal across disciplines, and differences are mainly due to differing community sizes.
Additionally to investigations of the differences across scientific communities, the relations between citations and coauthorships were studied in recent works.Using data from a number of scientific journals, it was shown that the citation count of an article is correlated both with the number of authors and the number of institutions involved in its production [5,12].Studying data from eight highly ranked scientific journals, it was shown [11] that a) single author publications consistently received the lowest number of citations and b) publications with less than five coauthors received less citations than the average article.Studying citations between individuals rather than articles, in [16] it was observed that coauthors tend to cite each other sooner after the Going beyond a mere study of direct coauthorship relations, first attempts to study both citation and coauthorship structures from a network perspective have been made recently.Aiming at a measure that captures both the amount as well as the reach of citations in a scientific community, a citation index that incorporates the distance of citing authors in the collaboration network was proposed [2].Another recent study [23] used the topological distance between citing authors in the coauthorship network to extend the notion of self-citations.Interestingly, apart from direct self-citations, this study could not find a strong tendency to cite authors that are close in the coauthorship network.
Different from previous works, in this article we study correlations between the centrality of authors in collaboration networks and the citation success of their research articles.By this we particularly extend previous works that use a network perspective on coauthorship structures and citation patterns.Stressing the fact that social relations of authors play an important role for how much attention and recognition their research receives, we further contribute a quantitative view on previously hypothesized relations between the visibility of authors and citation patterns.

Time-Evolving Collaboration and Citation Networks
In this work we analyze a data set of scholarly citations and collaborations obtained from the Microsoft Academic Search 1 (MSAS) service.The MSAS is a scholarly database containing more than 35 Million publication records from 15 scientific disciplines.Using the Application Programming Interface (API) of this service, we extracted a subset of more than 100 000 computer science articles, published between 1996 and 2008, in the following way: First, we retrieved unique numerical identifiers (IDs) of the 20 000 highest ranked authors in the field of computer science.This ranking is the result of an MSAS internal "field rating", taking into account several scholarly metrics of an author (number of publications, citations, h-index) and comparing them to the typical values of these metrics within a certain research field.As the goal was to build coauthorship and citation networks of reasonable size, in a second step we chose 1000 authors i.i.d.uniformly from the set of these 20 000 authors.In the third step, we obtained information on coauthors, publication date, as well as the list and publication date of citing works for all the publications authored by these 1000 authors between 1996 and 2008.This results in a data set consisting of a total of 108 758 publications from the field computer science, coauthored by a total Emre Sarigöl, René Pfitzner * , Ingo Scholtes, Antonios Garas, Frank Schweitzer: Predicting Scientific Success Based on Coauthorship Networks of 160 891 researchers.Each publication record contains a list of author IDs, which, by means of disambiguation heuristics internally applied by the MSAS service, uniquely identify authors independent of name spelling variations.The absence of name ambiguities is one feature that sets this data set apart from other data sets on scholarly publications that are used frequently.Based on this data set we extracted a coauthorship network, where nodes represent authors and links represent coauthorship relations between authors.In addition, using the information about citing papers, we extracted citation dynamics, i.e. the time evolution of the number of citations of all publications in our data set.Similar to earlier works, we argue that the coauthorship network can be considered a first-order approximation of the complete scientific collaboration network [16].
Based on the publication date of an article, we additionally assign time stamps to the extracted coauthor links -thus obtaining time-evolving coauthorship networks.
We analyze the evolution of the coauthorship network using a sliding window of two years in which we aggregate all coauthorships occurring within that time.Starting with 1996, we slide this window in one year increments and obtain a total of 11 time slices representing the evolution of collaboration structures between 1996 and 2008.We use an extended time-window of two years to account for the continuing effect of a coauthorship in terms of awareness about the coauthors works.Although larger time windows are certainly possible (and their effects interesting to investigate), in this work we are less concerned with the optimal time-window size and consistently use the above described approach.However, performed consistency checks with varying time-window sizes suggest robustness of our results.
Table 1 summarizes the number of nodes and links in the coauthorship network, the number of publications in each time slice as well as the fractional size of the largest connected component (LCC).Note that the time-aggregated network (overall) forms one giant component with only a minor fraction of isolated nodes, whereas some of the time slices fall apart into many separated components.Note also that the size of the largest connected component is increasing with time, which may indicate either a possible bias in the coverage of the MSAS database to favour newer articles, or an increase of "collaborativeness" in science.As we are going to perform a social network analysis of the collaboration time slices -and some measures (like eigenvector centrality) are not well-defined for unconnected graphs -we apply all of the following analysis always on the largest connected component.For each network corresponding to one two-year time slice, we compute a number of node-level metrics that allow us to quantitatively monitor the evolution of network positions for all authors.In particular, we compute degree centrality, eigenvector centrality, betweenness centrality and k-core centrality of authors.For details on the used centrality measures, please refer to the Supplementary Material or the textbook by Newman [18].Here we utilize implementations of these measures provided by the igraph package [4].
A major focus of our work is to assess the predictive power of an author's position in the coauthorship network for the citation success of her future articles.To do so we adopt a so 5/21 In particular, we are interested in those publications that are among the most successful ones.Defining success is generally an ambiguous endeavor.As justified in the introduction, here we take the (controversial) viewpoint that success is directly measurable in number of citations.We specifically focus on a very simple notion of success in terms of highly cited papers and, similar to [17], assume that a paper is successful if five years after publication it has more citations than 90% of all papers published in the same year.We refer to the set of successful papers in year t as P ↑ (t).The set of remaining papers, i.e. those published at time t that are cited less frequently than the top 10%, is denoted as P ↓ (t).Predicting Scientific Success Based on Coauthorship Networks

Statistical Dependence of Coauthorship Structures and Citations
Having a large social network and "knowing the right people" often is a prerequisite for career success.However, science is often thought to be one of the few fields of human endeavor where success depends on the quality of an authors' work, rather than on her social connectedness.
Given the time evolving coauthorship network, as well as the observed success (or lack thereof) of a publication, we investigate two research questions, aiming to quantify the aspect of social influence on citation success.First, we examine whether there is a general statistical dependency of central authors in the coauthorship network to publish papers that are more successful than non-central.Second, we investigate whether the inverse effect is present and the success of a paper influences the future coauthorship centrality of its authors.

Effects of Author Centrality on Citation Success
To quantify the first research question we test the following hypothesis.
H1: At the time of publication, authors of papers in P ↑ (t) are more central in the coauthorship network than authors of articles in P ↓ (t). to their betweenness centrality.A very strong community structure is clearly visible.Furthermore, we highlighted in red one particular author that belonged to group A (t), i.e. authors who did not have a paper in P ↑ in 2002, but did so in 2007.Thus, in the considered five year span the highlighted author moved from a position in the periphery of the coauthorship network to a position in the center.Not only the authors' degree centrality increased (see size of the node as well as joined red-colored links), but also betweenness centrality improved highly.
Note that already in 2002 the author had comparatively high betweenness and degree centrality, which -according to our previous discussion-provided an ideal starting point for citation success in 2007.

Predicting Successful Publications
In the previous sections we presented evidence for the existence of statistical dependencies between authors' coauthorship centrality and the success of their publications.Results suggested that several coauthorship centrality metrics are indicative for citation success.However, we did not identify one single such centrality metric, especially we did not find that the mere number of coauthors is sufficient for a paper to become highly cited.Instead, this seems to be dependent on more than one network measure.In this section we present a machine learning classifier to predict whether a publication will be highly cited, based on several features of the authors position in the coauthorship network.
Previous works have already attempted to predict citation success.For example in [10], the predictive power of the past h-index for the future h-index of a scientist was presented.Furthermore, in [1] additional indicators like, e.g. the length of the career or the number of articles in certain journals, have been integrated into a model to predict the future h-index of scientists.The au-10/21  Color intensity of the nodes is scaled according to their degree centrality and size of nodes is scaled according to their betweenness centrality.
thors of [17] compare the number of citations an article has received at a given point in time with the expected value in a preferential attachment model for the citation network.Deriving a z-score, the authors present a prediction of which papers will be highly cited in the future.
Recently the authors reevaluate their earlier predictions and confirm the predictive power of their approach [19].Whereas these three approaches attempt to predict success based on past citation dynamics, they do not investigate the underlying mechanisms that lead to citation success.Here we address this fundamental question and try to predict citation success based merely on coauthorship network centrality of authors.Clearly, many different factors will contribute to scientific success.In this work, however, we focus on the social component (based on the coauthorship network) in order to highlight the influence of social, not necessarily merit-based, mechanisms on publication success.
Based on the observed relations between author centralities in the coauthorship network and the success of their publications presented in section 4, in this section we investigate whether we can predict a paper's future success.In particular, we try to predict whether a paper will be highly cited five years after its publication based on measures of author centrality in the coauthorship network.
In section 4.1 we presented insights about the statistical dependency of citation success and several social network centrality measures (see Table 3).These results suggest that a naive Bayes predictor for citation success can already yield quite useful results, predicting whether or not a

11/21
Emre Sarigöl, René Pfitzner * , Ingo Scholtes, Antonios Garas, Frank Schweitzer: Predicting Scientific Success Based on Coauthorship Networks paper will be toppaper, given ex ante knowledge about topmetric of the authors.Using k-core centrality as a basis, we apply the following classification rule: If a paper is authored by a top 10% k-core centrality author, then the paper will be among the top 10% most cited papers five years after publication.
To evaluate the goodness of this prediction, we will consider the error measures precision and recall 2 .Observing that for k-core centrality in a 10% success scenario it is P (topmetric|toppaper) = 0.21% as well as P (toppaper|topmetric) = 0.22% and the fact that for a naive Bayes classifier recall = P (topmetric|toppaper) and precision = P (toppaper|topmetric) holds, one sees that a classifier with the above rule yields recall = 21% and precision = 22%.Similarly, instead of k-core centrality other network measures presented in Table 3 can be used as basis for the above classification rule.As earlier works have tried to predict the success of papers based on the number of coauthors [11], using degree centrality as basis for the above classification rule directly extends these attempts, yielding recall = 20% and precision = 20%.Note, however, that degree centrality accumulates all coauthorships that have been established within the two-year sliding window of our analysis, not just the coauthorships of the paper under consideration.
We now ask whether a multi-dimensional naive Bayes classifier can improve this single metric classification result.Taking into account the intersection of all considered centrality metrics, we consider the following classification rule: If a paper is authored by an author with a top 10% betweenness centrality, degree centrality, k-core centrality and eigenvector centrality, then the paper will be among the top 10% most cited papers five years after publication.
Using this classifier, we achieve even better classification of precision = 0.36%, however diminishing recall to recall = 0.15%.Whereas these results already show that a naive Bayes classifier can yield interesting insights, in the following we will present a more sophisticated Machine Learning approach, taking multiple network centrality features into account and improving classification errors.
We first construct a feature vector for every publication as follows.For each publication appearing in year t, we extract all coauthors and compute the maximum and minimum of their centralities in the coauthorship network constructed based on the time window [t-2,t].Then, for each publication we build a feature vector with 10 features containing the maximum and minimum of the centrality metrics considered earlier (degree, eigenvector, betweenness and k-core), as well as the number of coauthors and the cumulative number of authors a paper has referenced.We then classify all publications regarding whether they fall in P ↑ or P ↓ according to the aforementioned 2 See Supplementary Material for a general definition of precision and recall
publication classes, with P ↑ defined as the set of the top 10% cited publications and P ↓ as the remaining 90%.
The classification is done using a Random Forest classifier [3], extending the concept of classification trees 3 .In general, the Random Forest is known to yield accurate classifications for data with a large number of features [3].Furthermore, it is a highly scalable classification algorithm, eliminating the need for separate cross validation and error estimation, as these procedures are part of the internal classification routine. 4able 5 summarizes precision, recall, and F-score of the resulting classification.Comparing this result with the expectation from a random guess, which will correctly pick one of the top 10% publications only in 10% of the cases, the achieved precision of 60% is striking.In particular, by only considering positional features of authors in the coauthorship network, we are able to achieve an increase of factor six in predictive power compared to a random guess.Also, we obtain a recall value of 18%, meaning that our classifier correctly identified about one fifth of all of the top 10% papers in a given research field.As a random guess would yield a recall of 10%, the Random Forest classifier improves recall by 80%.
This result allows for two conclusions: First, the fact that a high-dimensional random forest classifier performs better than a naive Bayes classifier, makes clear that social influence on scientific success cannot be measured by a single value.Second, and most importantly, that by solely considering metrics of social influence, such a classifier is able to predict scientific success with high precision.

Discussion and Conclusions
Using a data set on more than 100 000 scholarly publications authored by more than 160 000 authors in the field of computer science, in this article we studied the relation between the centrality of authors in the coauthorship network and the future success of their publications.
Clearly, there are certain limitations to our approach, which we discuss in the following.
First of all, any data-driven study of social behavior in general and citation behavior in particular is limited by the completeness and correctness of the used data set.The fact that name In order to rule out effects that are due to different citation patterns in different disciplines, we limited our study to computer science, for which we expect the coverage of MSAS to be particular good.While this limits the generalization of our results to other fields, our work nevertheless represents -to the best of our knowledge -the first large-scale case study of social factors in citation practices.As publication practises seem to vary widely across disciplines, it will be interesting to investigate whether our results hold for other research communities as well.
Clearly, any study that tries to evaluate the importance or centrality of actors in a social network needs to be concerned about the choice of suitable centrality measures.In order to not overemphasize one particular -out of the many -dimensions of centrality in networks, we chose to use complementary centrality measures that capture different aspects of importance at the same time.The results of our prediction highlight that the combination of different measures is crucial -making clear that visibility and social influence are more complicated to capture than by a single centrality measures.
Finally, one may argue that our observation that authors with high centrality are cited more often is not a statement of a direct causal relation between centrality and citation numbers.After all, both centrality and citations could be secondary effects of, for instance, the scientific excellence of a particular researcher, which then translates into becoming central and highly cited at the same time.Clearly, we neither can -nor do we want -to rule out such possible explanations for our statistical findings.However, considering our finding of strong statistical dependence between social centrality and citation success, one could provocatively state the following: if citation-based measures were to be good proxies for scientific success, so should then be measures of centrality in the social network.We assume that not many researchers would approve having their work evaluated by means of such measures.We hence think that our findings are an important contribution to the ongoing debate about the meaningfulness and use of citation-based measures, as well as a better understanding of citation dynamics in general.
In summary, the contributions of our work are threefold: 1. We provide the, to the best of our knowledge, first large-scale study that analyses relations between the position of researchers in scientific collaboration networks and citation dynamics, using a set of complementary network-based centrality measures.A specific feature of our method is that we study time-evolving collaboration networks and citation numbers, thus allowing us to investigate possible mechanisms of social influence at a microscopic scale.
Emre Sarigöl, René Pfitzner * , Ingo Scholtes, Antonios Garas, Frank Schweitzer: Predicting Scientific Success Based on Coauthorship Networks 2. We show that -at least for the measures of centrality investigated in this paper -there is no single notion of centrality in social networks that could accurately predict the future citation success of an author.We expect this finding to be of interest for any general attempt to predict the success of actors based on their centrality in social networks.
3. Using modern machine learning techniques, we present a supervised classification method based on a Random Forest classifier, using a multidimensional feature vector of collaboration network centrality metrics.We show that this method allows for a remarkably precise prediction of the future citation success of a paper, solely based on the social embedding of its authors.With this, our method provides a clear indication for a strong statistical dependence between author centrality and citation success.
In conclusion, we provided evidence for a strong relation between the position of authors in scientific collaboration networks and their future success in terms of citations.We would like to emphasize that by this we do not want to join in the line of -sometimes remarkably uncritical -proponents of citation-based evaluation techniques.Instead, we hope to contribute to the discussion about the manifold influencing factors of citation measures and their explanatory power concerning scientific success.Especially, we do not see our contribution in the development of automated success prediction techniques, whose widespread adoption could possibly have devastating effects on the general scientific culture and attitude.Highlighting social influence mechanisms, we rather hope that our work contributes to a better understanding of the multifaceted, complex nature of citations, which should be a prerequisite for any reasonable application of citation-based measures.

Centrality Metrics
There are many network metrics that can be used for social network analysis [18,24].Here we are quickly going to review the metrics we have been using in this work and their interpretation in coauthorship networks.

Degree Centrality
The degree centrality of a node is its number of first-order neighbors, i.e. the number of nodes this nodes connects to via one link.In a directed network, this measure is divided into an in-degree centrality and an out-degree centrality.Since the here considered coauthorship network is undirected, the degree centrality is simply the number of its direct neighbors.Degree centrality is a local measure, as it does not depend on any global network properties other than the number of its neighbors.In the coauthorship network the degree centrality of a node is its number of coauthors.
Eigenvector Centrality The eigenvector centrality of a node is a global centrality measure, as non-local changes in the network can alter the node's eigenvector centrality.In short, a node has high eigenvector centrality if it is connected to other nodes with high eigenvector centrality.As such, this centrality measure goes beyond degree centrality as a mere measure of quantity (the number of neighbors) in that it introduces a notion of inheritance of importance.Used often, especially in its variant PageRank, eigenvector centrality of node v is the vth component of the Perron-Frobenius-eigenvector of the network's adjacency matrix.In the coauthorship network, eigenvector centrality has a meaning of importance, if one assumes that an author is more important if she coauthors papers with other authors of high importance.
Betweenness Centrality The Betweenness centrality is another often used global centrality measure.A node has high betweenness centrality if it lies on many shortest paths of the network.Hence, this centrality measure is a measure of importance in terms of network flows.If a node with high betweenness centrality would be removed, a lot of network flows would become less efficient, as the average length of shortest paths will increase.In the coauthorship network, a node with high betweenness centrality could be interpreted as a node with high importance for "fast knowledge transfer", as this person lies on many shortest paths connecting authors and their research.

K-Core Centrality
The k-core centrality is a global centrality measure thought to measure the "coreness" of a node, i.e. how deep it is embedded in the network.A node has k-core centrality k if, when consecutively removing nodes that have degree 1, 2, ...k − 1 from the network, this node has not been removed, but will be removed in a next step when nodes with degree k are removed.k-core centrality is somewhat similar to eigenvector centrality, as a node must have neighbors with high k-core in order to have high-core itself.Different from eigenvector centrality, Emre Sarigöl, René Pfitzner * , Ingo Scholtes, Antonios Garas, Frank Schweitzer: Predicting Scientific Success Based on Coauthorship Networks k-core centrality is not additive.Hence compensating a low number of high k-core neighbors with a high number of low k-core neighbors does not guarantee the node to have high k-core.In the coauthorship network a node has a high k-core, if it is connected to many nodes that have high k-core themselves.

Correlations Between Citation Numbers and Centrality Metrics
In section Effects of Author Centrality on Citation Success of the main manuscript, we argue that citation numbers of an article, five years after its publication, are not Pearson-and Spearmancorrelated with social network centrality metrics of its authors.

Precision and Recall
In Machine Learning it is standard practice to assess the goodness of a classifier using the quantities precision and recall [7].
Precision is defined as P recision = T rueP ositives T rueP ositives + F alseP ositives . ( It hence is equal to the fraction of correctly predicted instances compared to all predicted instances.As such, precision quantifies how reliable the predicted results are.However, it does not make any statement about how many relevant results the predictor returns.For example, a simple predictor could be to always return one element from which it is known ex ante, that it is a true prediction.In this scenario precision would be 100%, however the sensitivity might be poor as there might be more than one relevant element.This last point is quantified using recall.

Figure 1 :
Figure1: Illustration of correlation between citation success and centrality in the coauthorship network.Color intensity of the nodes is scaled according to their degree centrality and size of nodes is scaled according to their betweenness centrality.

Table 1 :
Emre Sarigöl, René Pfitzner * , Ingo Scholtes, Antonios Garas, Frank Schweitzer:Predicting Scientific Success Based on Coauthorship Networks Number of papers and size of the collaboration network 2-year subgraphs between 1995-2008 used in our study.called hindcasting approach: For each publication p published in a given year t, we extract the list of coauthors as well as the LCC of the coauthorship network in the time slice [t − 2, t], and calculate the centrality measures.Based on the citation data, we furthermore calculate the number of citations c p paper p gained within a time frame of five years after publication, i.e. in the time slice [t, t + 5].

Table 3 :
First table entry indicates what fraction of papers, that have authors which are within the set of author with Top x% centrality metrics, are also Top x% of all papers in terms of citation success (P (toppaper|topmetric)).Second table entry indicates what fraction of papers, that are Top x% of all papers in terms of citation success, have authors which are within the set of author with Top x% centrality metrics (P (topmetric|toppaper)).Row Intersection indicates the intersection of all the above considered centrality metrics.extentcitationsuccessandcoauthorship network centrality are statistically dependent is summarized in Table 3. Left entry of each cell indicates what fraction of papers, that have authors with Top x% centrality metrics, belong to the Top x% of all papers in terms of citation success (P (toppaper|topmetric)).Right entry of each cell indicates what fraction of papers, that are Top x% of all papers in terms of citation success, have authors which are within the set of authors with Top x% centrality metrics (P (topmetric|toppaper)).From these results, we conclude two observations: First, the probabilities in every cell are well below 1, indicating the absence of a simple linear (Pearson) correlation.Second, especially considering k-core centrality, knowing a paper is Top 10% successful, the conditional probability that it was written by an author with Top 10% k-core centrality, is P (topmetric|toppaper) = 0.21.Additionally, Table3indicates that vice versa P (toppaper|topmetric) = 0.22 of all papers that are published by authors with Top 10% k-core centrality, are successful.Considering the intersection of all four centrality metrics, we even find that P (toppaper|topmetric) = 0.36 of all papers published by the Top 10% central Emre Sarigöl, René Pfitzner * , Ingo Scholtes, Antonios Garas, Frank Schweitzer:Predicting Scientific Success Based on Coauthorship Networks

Table 4 :
P-values of Wilcoxon-Mann-Whitney test for different coauthorship centralities and alternative hypotheses.Column A presents p-values for authors in set A , column A presents p-values for authors in set A .
Emre Sarigöl, René Pfitzner * , Ingo Scholtes, Antonios Garas, Frank Schweitzer: Predicting Scientific Success Based on Coauthorship Networks ambiguities are automatically resolved by the Microsoft Academic Search (MSAS) database by sophisticated and validated disambiguation heuristics is a clear advantage over simpler heuristics that have been used in similar studies.

Table 6 :
Table 6summarizes the Pearson and Spearman correlation coefficients for the considered metrics.None of these results allows to conclude any significant correlation.Pearson and Spearman coefficients measuring correlations between citation numbers of a paper (five years after publication) and coauthorship network centrality of its authors.