Exploiting citation networks for large-scale author name disambiguation
© Schulz et al.; licensee Springer 2014
Received: 23 January 2014
Accepted: 6 August 2014
Published: 25 September 2014
We present a novel algorithm and validation method for disambiguating author names in very large bibliographic data sets and apply it to the full Web of Science (WoS) citation index. Our algorithm relies only upon the author and citation graphs available for the whole period covered by the WoS. A pair-wise publication similarity metric, which is based on common co-authors, self-citations, shared references and citations, is established to perform a two-step agglomerative clustering that first connects individual papers and then merges similar clusters. This parameterized model is optimized using an h-index based recall measure, favoring the correct assignment of well-cited publications, and a name-initials-based precision using WoS metadata and cross-referenced Google Scholar profiles. Despite the use of limited metadata, we reach a recall of 87% and a precision of 88% with a preference for researchers with high h-index values. 47 million articles of WoS can be disambiguated on a single machine in less than a day. We develop an h-index distribution model, confirming that the prediction is in excellent agreement with the empirical data, and yielding insight into the utility of the h-index in real academic ranking scenarios.
The ambiguity of author names is a major barrier to the analysis of large scientific publication databases on the level of individual researchers , . Within such databases researchers generally appear only as they appear on any given publication i.e. by their surname and first name initials. Frequently, however, hundreds or even thousands of individual researchers happen to share the same surname and first name initials. Author name disambiguation is therefore an important prerequisite for the author level analyses of publication data. While many important and interesting problems can be examined without individual level data ,  a great many other require such data to get to the real heart of the matter. Good examples include the role of gender in academic career success , whether ideas diffuse through the popularity of individual publications or the reputation of the authors , , how the specific competencies and experience of the individual authors recombine to search the space of potential innovations , , and whether one can predict scientific carriers –. Indeed, the importance of getting individual level data has been widely acknowledged, as can be seen in recent large scale initiatives to create disambiguated researcher databases , .
Algorithmic author name disambiguation is challenging for two reasons. First, existing disambiguation algorithms have to rely on metadata beyond author names to distinguish between authors with the same name, much like some administrative institutions do when they distinguish citizens with the same name based on attributes such as date and place of birth. However, in existing large-scale publication databases – such as Thomson Reuter’s Web of Science (WoS) – metadata is often sparse, especially for older publications. Second, disambiguation algorithms may draw false conclusions when faced with incomplete metadata. For instance, when researchers change disciplines they transition to an entirely different part of the citation graph. Therefore, disambiguation algorithms that heavily rely on journal metadata to reconstruct researchers’ career trajectories can easily represent such researchers with two different researcher profiles. This issue can be present in any case where an individual metadata (disciplinary profile, collaborators, affiliation) is not consistent over time.
Existing disambiguation algorithms typically exploit metadata like first and middle names, co-authors, publication titles, topic keywords, journal names, and affiliations or email addresses (for an overview see ). Reference  (and enhanced in ) presents a comprehensive method that includes all metadata of the MEDLINE database. The use of citation graph data is less common however, since only a few databases include this information. Previous examples to exploit such data include  which mainly relies on self-citations, and  that used shared references, but only for the disambiguation of two author names. Both retrieve data from the WoS, which is also used in  and , however, without exploiting the citation graph. Reference  had access to a manually maintained database of Italian researchers as a gold standard, while  found a ground truth in Dutch full professor publication lists.
Here, we develop and apply a novel author disambiguation algorithm with the explicit goal of measuring the h-index of researchers using the entire WoS citation index database. Introduced by Hirsch in 2005, the h-index is the most widely used measure of an individual’s scientific impact. An individual’s h-index is equal to the number h of publications that are cited at least h times. It is increasingly used in both informal and formal evaluation and career advancement programs . However, despite its rapidly increasing popularity and use, very little is known about the overall distribution of h-indices in science. While an h-index of 30 is certainly less frequent than an h-index of 20, it is unknown how much less frequent. Models have been developed to estimate the distribution based upon some simple assumptions, but at best, they relied on incomplete data. Perhaps the most straightforward starting point for considering the distribution of h-index would be Lotka’s law scientific for productivity , however in the results section we will show that the empirical data deviates significantly from a Pareto power-law distribution.
The most complete data-centric work to date is that of , who calculated a probability distribution of h-indices using over 30,000 career profiles acquired via Google Scholar. Indeed this work represents a critical step forward in terms of understanding the overall distribution of h-indices and the high level dynamics that shape it. However, Google Scholar profiles are biased towards currently active and highly active researchers. As a consequence, their approach may underestimate the number of individuals with low h-index. A proper understanding of the entire h-index distribution is critical to shaping policies and best practices of using it for scientific performance. Furthermore, as research becomes more interdisciplinary, the variation of h-index distribution across disciplines must be better understood to prevent biased evaluations. To tackle these and similar challenges, we present an algorithm that is optimized towards reproducing the correct h-index of researchers, makes use of the citation network, and is applicable for the entire dataset of WoS.
This manuscript will be laid out in the following manner. First, we will describe our algorithm, novel validation & optimization approach, and implementation details. Then we will present the results of our optimization procedure and the empirical h-index distribution produced by our algorithm. We will compare the empirical distribution to the predictions of a simple theoretical h-index model, which together show excellent agreement.
2.1 The disambiguation algorithm
For each paper we denote the reference list as ; the co-author list as ; the set of citing papers as . Hence in this instantiation of the algorithm, these are the only three pieces of information one must have available for each paper. The ∩-operator together with the enclosing -operator count the number of common attributes. The first term in Eq. (1) measures the number of co-authors shared by two papers. The second term detects potential self-citations, a well recognized indicator of an increased probability of authorship by the same individual . The third term is the count of common references between the two papers. The fourth term represents the number of papers that cite both publications. The first and last terms are normalized by a technique known as overlap coefficient . It accounts for the higher likelihood of finding similarities when both co-author lists are very long or both publications are well-cited.
Once all pairwise similarities have been calculated, our algorithm moves on to the first of two clustering processes (see Figure 1). In this first clustering we start by establishing a link between each pair of papers , for which the similarity score is greater than a threshold . Then, each connected component (set of papers that can be reached from each other paper by traversing the previously created links) is labeled as a cluster. The goal is, of course, that all papers in any given cluster belong to one specific author.
Here is the number of publications in cluster γ, similarly for . For this step we calculate the similarity between publications in separate clusters. The overall cluster-cluster similarity is the sum of the similarity weights that are above a certain threshold , normalized by the number of papers of the two clusters. A link is then established between the two clusters if the new cluster similarity score is greater than a threshold . Each connected component (set of clusters that can be reached from each other cluster by traversing links) is then merged into a single cluster. Remaining individual papers are added to a cluster if they have a similarity score above a threshold with any paper in that cluster. We denote the set of clusters finally resulting from our algorithm. Each cluster is a set of papers and should ideally contain all papers published by one specific researcher.
2.2 Optimization and validation
Take the surname “Smith”, for example. Applying the algorithm to all papers with that surname we get a set of clusters. We can assume that in each cluster the initial that appears on most papers is the “correct” initial, and all other initials are likely errors. For example in the cluster where “J” is the most frequent initial for “Smith” the precision can be estimated as the number of papers with the initial “J” divided by the overall number of papers in the cluster. Not all papers with “J” may correspond to the same person (“Jason” versus “John”), but in the absence of an absolute gold standard this serves as a proxy.
This is the recall value for a specific GSP (researcher α). It corresponds to the percentage of papers in the given profile (that we managed to cross-reference to WoS) that are also in the algorithm-generated cluster which contains most papers of that profile.
With the objective of producing the highest quality h-index estimates, this measure seamlessly replaces the typical recall measure as a way to evaluate the completeness of clusters. Thus we use it for our optimization and validation procedure instead of Eq. (4). However, it is necessary we make clear that in using this h-index centric measure the resulting disambiguation is optimized with regards to reproducing h-index distribution, but may not be optimal with regards to other criteria. Indeed if a reader were to apply our algorithm, or one like it, with a different goal in mind we advise them to adapt the recall measure to their specific goal.
With about 47 million papers (for the analyzed period from 1900 to 2011), 141 million co-author entries, and 526 million citations referring to other articles within the database, the WoS is one of the largest available metadata collections of scientific articles and thus needs to be processed efficiently. While we concentrated on a few features (co-authors and citation graph), our framework can be extended to further metadata as well. We also do not make use of the full citation and co-author network when evaluating a single paper, in the sense that we do not traverse the graph to another paper node which is not directly connected to the paper in question. As a pre-processing step, we compute all publication similarity terms without applying concrete disambiguation parameters. For the complete WoS, we created 4.75 billion links between pairs of papers that have significant similarity and a common name (surname plus first initial). Publication similarity has a computational complexity of , where n is the number of papers of the ambiguous name. To reduce the cost of a single paper pair comparison, all information related to a single name is loaded into memory, whereas all feature data (mainly integer IDs) are stored in sorted arrays. For papers that have a publication year difference greater than 5, the computation is skipped to decrease the number of comparisons. This process took 11 hours on standard laptop hardware. Disambiguating the 5.6 million author names, i.e. weighting the similarity links and performing the two-step clustering took less than an hour. For the validation, we kept data for the 500 name networks in memory (consuming less than 4 GB) to test multiple parameter configurations subsequently, so that each parameter test (disambiguation and validation of the 500 names) could be executed in about 5 seconds.
3.1 Optimizing disambiguation parameters
This mean can be artificially small because it is averaged over (mostly) small clusters which easily achieve high precision. Hence, in the definition of our optimization scheme we introduce a counterbalancing statistical weight that accounts for size by requiring the algorithm to preferentially optimize the large clusters due to the cost incurred if any large cluster’s precision error value, , is high. Relying on basic statistical arguments, the natural weight that we should give the large clusters is the statistical fluctuation scale attributable to size, which is proportional to square root of the size of the cluster. This weight also compensates for the fact that there are more smaller clusters than large clusters. In practice, this means that for two clusters of different sizes (with ), then the larger cluster with will need to have a precision error equal to in order to contribute the same to the overall value which must be minimized by the algorithm.
Figure 3(b) shows how much the individual features (terms of Eq. (1)) contribute to the optimal solution. We fitted curves to the best results of a random sampling for a varying error trade-off, when only certain features are used (i.e. parameter of the other features are set to 0). Individual features cannot reach low error rates on their own. Combining features of the co-author and citation graph work best. Including more features like affiliations, topical features extracted from titles, summaries or keyword lists could potentially further improve the solution.
3.2 Further validation
We further evaluated the performance of our disambiguation method with four additional tests using different data or techniques. While each measures recall or precision, these performance indicators have different definitions and deviate here from our previous validation, but fit better with measures typically reported in past disambiguation work.
We performed a manual disambiguation validation similar to the one in . 100 publication pairs were randomly chosen from all pairs of publications that our algorithm co-clustered. Another 100 random pairs were selected from the set in which each pair belongs to the same name, but were placed in different clusters. Students were asked to determine for an author name and a given pair of publications, if they were written by the same author or different authors. When uncertain, the student could choose “Not sure”. Although all resources could be used, this is often a challenging task and especially voting for “Different authors” frequently required evidence beyond that was easily available. From 138 answers, we obtained 111 “Same authors”, of which 94 were in the same clusters (a recall of about 84.7%), and 27 times “Different authors”, of which all were correctly disambiguated to different clusters (a precision of 100%). We point out that a manual disambiguation may be biased towards easy cases that could receive a confident answer, however, it does provide further evidence of the suitability of our algorithm.
Another test for precision can be constructed from second initials metadata which we do not consider for our disambiguation algorithm (only first initials when clustering the whole WoS). Indeed, about 4.7 million clusters contain at least two second initial names. Here, for each cluster the most common second initial forms the set of correctly disambiguated publications (names that omit the second initial were ignored). We measure a mean precision of about 95.4%.
As a third way to evaluate precision, we “artificially” generated ground truth data by merging the sets of publications with two random names and then cluster them. The idea is that while we cannot say something about the correctness of the resulting clusters for one name, we can definitely show that the clustering is wrong when a cluster is generated from publications from both names. About 3,000 name pairs led to 26,887 clusters of which 18 clusters contained both names.
Our final additional validation is an estimate of recall, again for the whole disambiguated WoS. We evaluated about 870,000 arXiv.org publications, their metadata and fulltexts. From the PDFs more than half of all publications contained one or more email addresses. An email address is assumed to be a good indicator that, when two publications also share an author name, that this refers to the same unique researcher. Both arXiv and WoS provide DOIs for newer publications (starting around the year 2000), so cross-referencing was not an issue. We generated 110,011 “email” clusters, i.e. sets of publications that we also wanted to see for our disambiguation being put in one cluster. The mean recall was 98.1%.
3.3 Empirical h-index distribution and theoretical model
The empirical distribution is a mixture of h-indices of scientists with varying discipline citation rates and varying longevity within mixed age-cohort groups. Hence, it may at first be difficult to interpret the mean value as a representative measure for a typical scientist, since a typical scientist should be conditioned on career age and disciplinary factors. Nevertheless, in this section we develop a simple mixing model that predicts the expected frequencies of h, hence providing insight into several underlying features of the empirical “productivity” distribution .
The number of individuals of “career age” t in the aggregate data sample is given by an exponential distribution . We note that in this large-scale analysis we have not controlled for censoring bias since a large number of the careers analyzed are not complete, and so the empirical data likely overrepresent the number of careers with relatively small t.
The h-index growth factor is the characteristic annual change in of a given scientist, and is distributed according to an exponential distribution . The quantity g captures unaccounted factors such as the author-specific citation rate (due to research quality, reputation, and other various career factors), as well as the variation in citation and publication rates across discipline. For sake of simplicity, we assume that is uncorrelated with .
where is the Modified Bessel function of the second kind. The probability density function has mean , standard deviation , and asymptotic behavior for .
Figure 5(a) shows the empirical distribution for 4 datasets, analyzing only clusters with in order to focus on clusters that have at least two cited papers which satisfy our similarity threshold with at least one other paper. Surprisingly, each is well fit by the theoretical model with varying parameter. The parameter value was calculated for each binned using a least-squares method, yielding (Rare), 1.90 (Rare-Clustered), 5.13 (All), and 3.49 (All-Clustered). The inset demonstrates data collapse for all four distributions following from the universal scaling form of .
How do these findings compare with general intuition? Our empirical finding significantly deviates from the prediction which follows from combining Lotka’s productivity law , which states that the number n of publications follows a Pareto power-law distribution , and the recent observation that the h-index scales as , which together imply that (corresponding to ).
Figure 5(b) compares the empirical complementary cumulative distribution for both empirical data (representing the 6,498,286 clusters with identified by applying the disambiguation algorithm to the entire WoS dataset) and for the theoretical Pareto distribution . There is a crossover between the two curves around (corresponding to the 99.9th percentile) which indicates that for we observe significantly fewer clusters with a given h value than predicted by Lotka’s productivity law. For example, the Lotka law predicts a 100-fold increase in the number of scientific profiles with h larger than the 1 per million frequency, . This discrepancy likely reflects the finite productivity lifecycle of scientific careers, which is not accounted for in models predicting scale-free Pareto distributions.
So how do these empirical results improve our understanding of how the h-index should be used? We show that the sampling bias encountered in small-scale studies , and even large-scale studies , significantly discounts the frequency of careers with relatively small h. We observe a monotonically decreasing with a heavy tail, e.g. only 10% of the clusters also have . This means that the h-index is a noisy comparative metric when h is small since a difference can cause an extremely large change in any ranking between scientists in a realistic academic ranking scenario. Furthermore, our model suggests that disentangling the net h-index from its time dependent and discipline dependent factors leads to a more fundamental question: controlling for age and disciplinary factors, what is the distribution of g? Does the distribution of g vary dramatically across age and disciplinary cohorts? This could provide better insight into the interplay between impact, career length  and the survival probability of academics , .
The goal of this work was to disambiguate all author names in the WoS database. We found that existing methods relied on metadata that are not available or not complete in WoS, or were not specifically developed for an application to such a huge database. Second, we needed a test dataset which is not limited to certain research fields or geographical regions, and large enough to be representative for WoS. As previous work had shown that even under less demanding conditions perfect disambiguation is not achievable, we concentrated on the most influential work to correctly disambiguate papers that are most cited.
We achieved our goal by disambiguating author names based on the citation graph, which is the main feature of WoS. This approach exploits the fact that, on average, there is much more similarity between two publications written by the same author than between two random publications from different authors who happen to have the same name. We maximized the separation between these two classes, which can be seen as positive or wanted links and unwanted links in a publication network that connects papers written by the same unique researcher. Counting shared outgoing references and incoming citations are a much more fine-grained disambiguation criterion than for example journal or affiliation entries. Our disambiguation method does not assume any specific feature distribution, but is parameterized and trainable according to a suitable “gold standard”. It turns out that Google Scholar author profiles, one of the emerging collections of user editable publication lists, can reasonably serve as such a standard.
Our proposed method consists of three main components that could be altered or improved while still keeping the same validation framework: the error measure, the similarity measure and the clustering algorithm. The error measure we presented was specifically developed for reproducing h-indices; we believe other goals could be accomplished as well. The similarity measure could be easily extended by further metadata. Furthermore, our clustering algorithm, while intuitive and computationally efficient, could potentially be replaced by some more sophisticated community detection.
Comparing our results with previous work is difficult, as there is no common benchmark available. There are several studies that analyze small subsets of authors names, which is certainly useful to understand the mechanisms of the respectively proposed algorithms and sometimes unavoidable in lack of a massive test dataset. We realized, however, that this does not allow for generalization across disciplines, time, career age, and varying metadata availability. We also point out that there are differences in the error reporting, mainly in the way how the mean of errors is calculated. The vast majority of authors has only one or two publications, making it likely that the low error rates for precision and recall are underestimated. Some publications report error rates lower than 1-2%. We do not claim such an excellent result, since even our gold standard (cross-referenced publications from Google Scholar profiles, and name initials from WoS) cannot be assumed to have error rates significantly better than that. We have shown instead that using author and citation graph information only, we can disambiguate huge databases in a computationally efficient way and at the same time being flexible regarding the objectives one like to optimize for.
- Smalheiser NR, Torvik VI: Author name disambiguation. Annu Rev Inf Sci Technol 2009, 43(1):1–43. 10.1002/aris.2009.1440430113View ArticleGoogle Scholar
- Ferreira AA, Gonçalves MA, Laender AH: A brief survey of automatic methods for author name disambiguation. SIGMOD Rec 2012, 41(2):15–26. 10.1145/2350036.2350040View ArticleGoogle Scholar
- Mazloumian A, Helbing D, Lozano S, Light RP, Börner K: Global multi-level analysis of the ’scientific food web’. Sci Rep 2013., 3: 10.1038/srep01167Google Scholar
- Radicchi F: In science “there is no bad publicity”: papers criticized in comments have high scientific impact. Sci Rep 2012., 2: 10.1038/srep00815Google Scholar
- Larivière V, Ni C, Gingras Y, Cronin B, Sugimoto CR: Global gender disparities in science. Nature 2013, 504(7479):211–213. 10.1038/504211aView ArticleGoogle Scholar
- Mazloumian A, Eom Y-H, Helbing D, Lozano S, Fortunato S: How citation boosts promote scientific paradigm shifts and Nobel prizes. PLoS ONE 2011., 6(5): 10.1371/journal.pone.0018975Google Scholar
- Petersen AM, Fortunato S, Pan RK, Kaski K, Penner O, Rungi A, Riccaboni M, Stanley HE, Pammolli F: Reputation and impact in academic careers. Proc Natl Acad Sci USA 2014.Google Scholar
- Fleming L, Sorenson O: Science as a map in technological search. Strateg Manag J 2004, 25(8–9):909–928. 10.1002/smj.384View ArticleGoogle Scholar
- Fleming L, Mingo S, Chen D: Collaborative brokerage, generative creativity, and creative success. Adm Sci Q 2007, 52(3):443–475.Google Scholar
- Acuna DE, Allesina S, Kording KP: Predicting scientific success. Nature 2012, 489(7415):201–202. 10.1038/489201aView ArticleGoogle Scholar
- Mazloumian A: Predicting scholars’ scientific impact. PLoS ONE 2012., 7(11): 10.1371/journal.pone.0049246Google Scholar
- Penner O, Petersen AM, Pan RK, Fortunato S: Commentary: the case for caution in predicting scientists’ future impact. Phys Today 2013, 66(4):8–9. 10.1063/PT.3.1928View ArticleGoogle Scholar
- Penner O, Pan RK, Petersen AM, Fortunato S: On the predictability of future impact in science. Sci Rep 2013., 3: 10.1038/srep03052Google Scholar
- Acuna DE, Penner O, Orton CG: Point/counterpoint: the future h -index is an excellent way to predict scientists’ future impact. Med Phys 2013., 40(11): 10.1118/1.4816659Google Scholar
- ORCID (2013) Open researcher and contributor ID. Accessed 12 Aug 2013, [www.orcid.org]Google Scholar
- VIVO (2013) VIVO. Accessed 12 Aug 2013, [www.vivoweb.org]
- Torvik VI, Weeber M, Swanson DR, Smalheiser NR: A probabilistic similarity metric for medline records: a model for author name disambiguation. J Am Soc Inf Sci Technol 2005, 56(2):140–158. 10.1002/asi.20105View ArticleGoogle Scholar
- Torvik VI, Smalheiser NR: Author name disambiguation in medline. ACM Trans Knowl Discov Data 2009., 3(3): 10.1145/1552303.1552304View ArticleGoogle Scholar
- Levin M, Krawczyk S, Bethard S, Jurafsky D: Citation-based bootstrapping for large-scale author disambiguation. J Am Soc Inf Sci Technol 2012, 63(5):1030–1047. 10.1002/asi.22621View ArticleGoogle Scholar
- Tang L, Walsh JP: Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps. Scientometrics 2010, 84: 763–784. 10.1007/s11192-010-0196-6View ArticleGoogle Scholar
- D’Angelo CA, Giuffrida C, Abramo G: A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments. J Am Soc Inf Sci Technol 2011, 62(2):257–269. 10.1002/asi.21460View ArticleGoogle Scholar
- Reijnhoudt L, Costas R, Noyons E, Boerner K, Scharnhorst A: “seed + expand”: a validated methodology for creating high quality publication oeuvres of individual researchers. Proceedings of ISSI 2013 – 14th international society of scientometrics and informetrics conference 2013. e-printatarXiv.org[arXiv:1301.5177]Google Scholar
- ANVUR (2013) National Agency for the Evaluation of Universities and Research Institutes (Italy). Accessed 17 Sep 2014, [http://www.anvur.org/attachments/article/253/normalizzazione_indicatori_0.pdf]
- Lotka AJ: The frequency distribution of scientific productivity. J Wash Acad Sci 1926, 16(12):317–323.Google Scholar
- Radicchi F, Castellano C: Analysis of bibliometric indicators for individual scholars in a large data set. Scientometrics 2013, 97: 627–637. 10.1007/s11192-013-1027-3View ArticleGoogle Scholar
- Hellsten I, Lambiotte R, Scharnhorst A, Ausloos M: Self-citations, co-authorships and keywords: a new approach to scientists’ field mobility? Scientometrics 2007, 72(3):469–486. 10.1007/s11192-007-1680-5View ArticleGoogle Scholar
- Salton G: Automatic information organization and retrieval. 1968.Google Scholar
- Hirsch J: An index to quantify an individual’s scientific research output. Proc Natl Acad Sci USA 2005, 102: 16569–16572. 10.1073/pnas.0507655102View ArticleGoogle Scholar
- Petersen AM, Jung W-S, Yang J-S, Stanley HE: Quantitative and empirical demonstration of the Matthew effect in a study of career longevity. Proc Natl Acad Sci USA 2011, 108(1):18–23. 10.1073/pnas.1016733108View ArticleGoogle Scholar
- Kaminski D, Geisler C: Survival analysis of faculty retention in science and engineering by gender. Science 2012, 335: 864–866. 10.1126/science.1214844View ArticleGoogle Scholar
- Petersen AM, Riccaboni M, Stanley HE, Pammolli F: Persistence and uncertainty in the academic career. Proc Natl Acad Sci USA 2012, 109: 5213–5218. 10.1073/pnas.1121429109View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd.Open Access This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.