3.1 Optimizing disambiguation parameters
For the seven model parameters ({\alpha}_{A}, {\alpha}_{S}, {\alpha}_{R}, {\alpha}_{C}, {\beta}_{2}, {\beta}_{3}, {\beta}_{4}, while {\beta}_{1} is fixed to 1), we want to find a configuration that minimizes both mean hindex error and mean precision error:
{R}_{\mathrm{error}}^{h}=\u30081{R}^{h}\u3009,\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{P}_{\mathrm{error}}=\u3008(1P)\sqrt{K}\u3009.
(6)
This mean {P}_{\mathrm{error}} can be artificially small because it is averaged over (mostly) small clusters which easily achieve high precision. Hence, in the definition of our optimization scheme we introduce a counterbalancing statistical weight that accounts for size by requiring the algorithm to preferentially optimize the large clusters due to the cost incurred if any large cluster’s precision error value, 1P, is high. Relying on basic statistical arguments, the natural weight that we should give the large clusters is the statistical fluctuation scale attributable to size, which is proportional to square root of the size of the cluster. This weight also compensates for the fact that there are more smaller clusters than large clusters. In practice, this means that for two clusters of different sizes {K}_{+}=f{K}_{} (with f>1), then the larger cluster with {K}_{+} will need to have a precision error equal to (1{P}_{})/\sqrt{f} in order to contribute the same to the overall {P}_{\mathrm{error}} value which must be minimized by the algorithm.
Due to the simplicity of our algorithm, we can conduct an extensive sampling over the whole parameter space. The results in Figure 3(a) show that there is a clear tradeoff between the two types of errors and a lower limit that can be reached by our implementation. Our test data consists of 3,000 surnames that were randomly selected from WoS and where at least one profile could be found on Google Scholar. To further improve the result, we did an iterative local search on a 7dimensional sphere around the best previous parameter configurations, starting with the best results from the random parameter sampling. For efficiency reasons and for crossvalidation, we drew four random subsets with 500 surnames each and optimized them individually. In Figure 3(a), we aim at an error that equally prefers a high hindex and precision correctness. We find
\begin{array}{c}{\alpha}_{A}=0.54,\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{\alpha}_{S}=0.75,\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{\alpha}_{R}=0.19,\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{\alpha}_{C}=1.02,\hfill \\ {\beta}_{2}=0.19,\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{\beta}_{3}=0.011,\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{\beta}_{4}=0.49\hfill \end{array}
which leads to a precision error of 11.84% and an hindex error of 12.63%. Coauthorship {\alpha}_{A} comes out as a strong indicator for disambiguation, although coauthor names are not disambiguated beforehand and hence represent a potential source of errors. Selfcitations {\alpha}_{S} are also highly weighted, but a selfcitation link alone is not sufficient to exceed the threshold {\beta}_{1}=1 to form clusters.
Figure 3(b) shows how much the individual features (terms of Eq. (1)) contribute to the optimal solution. We fitted curves to the best results of a random sampling for a varying error tradeoff, when only certain features are used (i.e. parameter of the other features are set to 0). Individual features cannot reach low error rates on their own. Combining features of the coauthor and citation graph work best. Including more features like affiliations, topical features extracted from titles, summaries or keyword lists could potentially further improve the solution.
Size dependent biases can skew aggregate algorithm performance measures especially when there is a broad underlying heterogeneity in the data. Hence, stating mean error rates is not sufficient to fully understand the strengths and weaknesses of a disambiguation algorithm. In Figure 4(a) we show that our algorithm works better for larger profiles, i.e. researchers that have a higher hindex, which is not a surprising result since there is much more coauthor and citation graph information than for people with only a few papers. On the other hand, precision is slowly decreasing for more common names, see Figure 4(b), which becomes an issue when disambiguating very large databases, where certain combinations of surname plus first initial can result in initially undisambiguated clusters comprising around ten thousand publications.
3.2 Further validation
We further evaluated the performance of our disambiguation method with four additional tests using different data or techniques. While each measures recall or precision, these performance indicators have different definitions and deviate here from our previous validation, but fit better with measures typically reported in past disambiguation work.
We performed a manual disambiguation validation similar to the one in [18]. 100 publication pairs were randomly chosen from all pairs of publications that our algorithm coclustered. Another 100 random pairs were selected from the set in which each pair belongs to the same name, but were placed in different clusters. Students were asked to determine for an author name and a given pair of publications, if they were written by the same author or different authors. When uncertain, the student could choose “Not sure”. Although all resources could be used, this is often a challenging task and especially voting for “Different authors” frequently required evidence beyond that was easily available. From 138 answers, we obtained 111 “Same authors”, of which 94 were in the same clusters (a recall of about 84.7%), and 27 times “Different authors”, of which all were correctly disambiguated to different clusters (a precision of 100%). We point out that a manual disambiguation may be biased towards easy cases that could receive a confident answer, however, it does provide further evidence of the suitability of our algorithm.
Another test for precision can be constructed from second initials metadata which we do not consider for our disambiguation algorithm (only first initials when clustering the whole WoS). Indeed, about 4.7 million clusters contain at least two second initial names. Here, for each cluster the most common second initial forms the set of correctly disambiguated publications (names that omit the second initial were ignored). We measure a mean precision of about 95.4%.
As a third way to evaluate precision, we “artificially” generated ground truth data by merging the sets of publications with two random names and then cluster them. The idea is that while we cannot say something about the correctness of the resulting clusters for one name, we can definitely show that the clustering is wrong when a cluster is generated from publications from both names. About 3,000 name pairs led to 26,887 clusters of which 18 clusters contained both names.
Our final additional validation is an estimate of recall, again for the whole disambiguated WoS. We evaluated about 870,000 arXiv.org publications, their metadata and fulltexts. From the PDFs more than half of all publications contained one or more email addresses. An email address is assumed to be a good indicator that, when two publications also share an author name, that this refers to the same unique researcher. Both arXiv and WoS provide DOIs for newer publications (starting around the year 2000), so crossreferencing was not an issue. We generated 110,011 “email” clusters, i.e. sets of publications that we also wanted to see for our disambiguation being put in one cluster. The mean recall was 98.1%.
3.3 Empirical hindex distribution and theoretical model
Using the optimized parameters, we disambiguated the complete WoS database containing about 5.6 million author names that have a unique surname plus first initial. While the true hindex distribution is not exactly known, we can compare it to the subset of rare names – names for which we assume require little if any disambiguation. We define rare names as surnames where for the whole WoS there is only one type of initial and that initial is itself very rare (q, x, z, u, y, o, and w), which results in 87,000 author names. The disambiguation of the rare names tells us that they indeed represent to a large extent unique researchers. Unfortunately, for higher hindex values h>20 (values in the top 3% when excluding clusters with h=0,1) the rare surnames are underrepresented with respect to the whole database (see Figure 5 for the comparison between the rare dataset and the full dataset hindex distributions). However, this difference is consistent with deviations arising from finitesize effects, since the rare dataset is significantly smaller than the entire dataset.
The empirical distribution P(h) is a mixture of hindices of scientists with varying discipline citation rates and varying longevity within mixed agecohort groups. Hence, it may at first be difficult to interpret the mean value \u3008h\u3009 as a representative measure for a typical scientist, since a typical scientist should be conditioned on career age and disciplinary factors. Nevertheless, in this section we develop a simple mixing model that predicts the expected frequencies of h, hence providing insight into several underlying features of the empirical “productivity” distribution P(h).
Our hindex distribution model is based on the following basic assumptions:

1.
The number of individuals of “career age” t in the aggregate data sample is given by an exponential distribution {P}_{1}(t)=exp[t/{\lambda}_{1}]/{\lambda}_{1}. We note that in this largescale analysis we have not controlled for censoring bias since a large number of the careers analyzed are not complete, and so the empirical data likely overrepresent the number of careers with relatively small t.

2.
The hindex growth factor {g}_{i}\approx \u3008{h}_{i}(t+1){h}_{i}(t)\u3009 is the characteristic annual change in {h}_{i} of a given scientist, and is distributed according to an exponential distribution {P}_{2}(g)=exp[g/{\lambda}_{2}]/{\lambda}_{2}. The quantity g captures unaccounted factors such as the authorspecific citation rate (due to research quality, reputation, and other various career factors), as well as the variation in citation and publication rates across discipline. For sake of simplicity, we assume that {g}_{i} is uncorrelated with {t}_{i}.
Hence, the index {h}_{i}={g}_{i}{t}_{i} of an individual i is simply given by the product of a career age {t}_{i} and growth factor {g}_{i}. The aggregate hindex distribution model {P}_{m}(h) is derived from the distribution of a product of two random variables, t and g, each distributed exponentially by {P}_{1}(t;{\lambda}_{1}) and {P}_{2}(g;{\lambda}_{2}), respectively. Since both g\ge 0 and t>0, the distribution P(h) is readily calculated by
{P}_{m}(h)={\int}_{0}^{\mathrm{\infty}}\frac{dx}{x}{P}_{1}(x){P}_{2}(h/x)=\frac{2}{{\lambda}_{1}{\lambda}_{2}}{K}_{0}\left(2\sqrt{h/({\lambda}_{1}{\lambda}_{2})}\right),
where {K}_{0}(x) is the Modified Bessel function of the second kind. The probability density function {P}_{m}(h) has mean \u3008h\u3009={\lambda}_{1}{\lambda}_{2}, standard deviation \sqrt{3}\u3008h\u3009, and asymptotic behavior {P}_{m}(h)\sim exp[\sqrt{h/\u3008h\u3009}]/{h}^{1/4} for h\gg 1.
Figure 5(a) shows the empirical distribution P(h) for 4 datasets, analyzing only clusters with h\ge 2 in order to focus on clusters that have at least two cited papers which satisfy our similarity threshold with at least one other paper. Surprisingly, each P(h) is well fit by the theoretical model {P}_{m}(h;{\lambda}_{1}{\lambda}_{2}) with varying {\lambda}_{1}{\lambda}_{2} parameter. The {\lambda}_{1}{\lambda}_{2} parameter value was calculated for each binned P(h) using a leastsquares method, yielding {\lambda}_{1}{\lambda}_{2}=2.09 (Rare), 1.90 (RareClustered), 5.13 (All), and 3.49 (AllClustered). The inset demonstrates data collapse for all four P(h/\u3008h\u3009) distributions following from the universal scaling form of {K}_{0}(x).
How do these findings compare with general intuition? Our empirical finding significantly deviates from the prediction which follows from combining Lotka’s productivity law [24], which states that the number n of publications follows a Pareto powerlaw distribution {P}_{p}(n)\sim {n}^{2}, and the recent observation that the hindex scales as h\sim {n}^{1/2}[25], which together imply that {P}_{p}(h)\sim {h}^{3} (corresponding to {P}_{p}(\ge h)={h}^{2}).
Figure 5(b) compares the empirical complementary cumulative distribution P(\ge h) for both empirical data (representing the 6,498,286 clusters with h\ge 2 identified by applying the disambiguation algorithm to the entire WoS dataset) and for the theoretical Pareto distribution {P}_{p}(\ge h)=1/{h}^{2}. There is a crossover between the two P(\ge h) curves around h\approx 64 (corresponding to the 99.9th percentile) which indicates that for h>64 we observe significantly fewer clusters with a given h value than predicted by Lotka’s productivity law. For example, the Lotka law predicts a 100fold increase in the number of scientific profiles with h larger than the 1 per million frequency, h\ge 185. This discrepancy likely reflects the finite productivity lifecycle of scientific careers, which is not accounted for in models predicting scalefree Pareto distributions.
So how do these empirical results improve our understanding of how the hindex should be used? We show that the sampling bias encountered in smallscale studies [28], and even largescale studies [25], significantly discounts the frequency of careers with relatively small h. We observe a monotonically decreasing P(h) with a heavy tail, e.g. only 10% of the clusters also have h\ge 10. This means that the hindex is a noisy comparative metric when h is small since a difference \delta h\sim 1 can cause an extremely large change in any ranking between scientists in a realistic academic ranking scenario. Furthermore, our model suggests that disentangling the net hindex from its time dependent and discipline dependent factors leads to a more fundamental question: controlling for age and disciplinary factors, what is the distribution of g? Does the distribution of g vary dramatically across age and disciplinary cohorts? This could provide better insight into the interplay between impact, career length [29] and the survival probability of academics [30], [31].