- Regular article
- Open Access
- Published:

# Feature analysis of multidisciplinary scientific collaboration patterns based on PNAS

*EPJ Data Science***volume 7**, Article number: 5 (2018)

## Abstract

The features of collaboration patterns are often considered to be different from discipline to discipline. Meanwhile, collaborating among disciplines is an obvious feature emerged in modern scientific research, which incubates several interdisciplines. The features of collaborations in and among the disciplines of biological, physical and social sciences are analyzed based on 52,803 papers published in a multidisciplinary journal PNAS during 1999 to 2013. From those data, we found similar transitivity and assortativity of collaboration patterns as well as the identical distribution type of collaborators per author and that of papers per author, namely a mixture of generalized Poisson and power-law distributions. In addition, we found that interdisciplinary research is undertaken by a considerable fraction of authors, not just those with many collaborators or those with many papers. This case study provides a window for understanding aspects of multidisciplinary and interdisciplinary collaboration patterns.

## Introduction

Natural and social sciences provide methodical approaches to study, predict and explain natural phenomena and sociality (human behaviors and psychological states) respectively [1]. The specialization of knowledge in these sciences forms various disciplines. Meanwhile, to solve problems whose solutions are beyond the scope of a single discipline, researchers need to integrate data, techniques, concepts, and theories from several disciplines [2–5]. Interactions between disciplines incubate several interdisciplines, fuzz the boundary of natural and social sciences, and produce many important scientific breakthroughs [6–8].

Studying collaboration patterns within and across disciplines or sciences contributes to understand the diversity of cooperative behaviors and fusion modes of knowledge. Papers of multidisciplinary journals provide an informative and reliable platform for this studying, because the media of natural and social sciences mainly count on papers [9–12]. Here we investigated the patterns based on 52,803 papers published in *Proceedings of the National Academy of Sciences* (PNAS) over the years 1999–2013.^{Footnote 1} The content of dataset spans three science categories: social sciences and two principal sub-sciences in natural sciences, viz. biological and physical sciences.

Collaboration relationship can be expressed by graphs, termed as coauthorship networks. Hence the patterns can be studied in network perspective. Coauthorship networks from different scientific fields appear specific similarities, such as partial transitivity of coauthorship, homophily on the number of collaborators, the right-skewed distribution of collaborators per author [13–19]. These commonalities also appear in the collaboration networks of three author sets (which come from the three science categories of PNAS respectively). We dived more into the rule and reason of these commonalities. We found that the distribution of collaborators per author and that of papers per author follow the same distribution type: a mixture of a generalized Poisson distribution and a power-law. We provided a possible explanation for the distribution type and these commonalities through the diversity of author abilities to attract collaborations.

A range of previous works discussed quantitative indexes of interdisciplinarity for sciences [20–22], for disciplines [23–26], for universities [27], for journals [28, 29], and for research teams [30]. Some works addressed the correlation between interdisciplinarity and scientific impact [31–34] (e.g. citation catching ability [35–37]). Based on specific general ideas of these references, we studied interdisciplinary activities of PNAS through paper co-occurrence of disciplines, and through some indexes calculated based on the co-occurrence, such as Rao-Sterling diversity [38], and betweenness centrality [39].

We further studied the collaboration patterns across disciplines, and found that a considerable proportion of authors and papers in physical and social sciences involved in interdisciplinary research. The multidisciplinary coauthorship network extracted from the data has a giant component, which contains more than 88%, 80% and 71% authors in biological, physical and social sciences respectively. A considerable number of authors contribute to the formation of giant component. The contributions of author activity and productivity to the formation increase over time. The high extent of interdisciplinarity shown by the case study might not be representative of general collaboration patterns, because authors could submit more interdisciplinary work to multidisciplinary journals than domain specific ones.

This report is structured as follows: the data processing is described in Sect. 2; the similarities and interactions are analyzed in Sect. 3; and the discussion and conclusion are drawn in Sect. 4.

## The data

### Reason for using the data

The case study involves two concepts, namely multidisciplinarity (researchers from different disciplines study within their disciplines) and interdisciplinarity (study beyond disciplinary boundaries) [40]. Multidisciplinarity could be viewed as a combination of disciplines, and interdisciplinarity as a merging of them. A multidisciplinary journal with the scope covering natural and social sciences can be utilized to analyze the interactions between science categories. Such journal can be also utilized to compare the collaboration patterns of multi-disciplines and find similarities. PNAS publishes high quality research papers, and provides reliable discipline information of those papers. The journal also provides a high quality data platform for analyzing worldwide collaboration patterns, because nearly half of its papers come from authors outside the United States.

Multidiscipline journals: *Science*, *Nature* and *Nature Communications* do not provide discipline information of papers. *Journal of the Royal Society Interface* focuses on the cross-disciplinary research at the interface between the physical and life sciences, but does not involve social sciences. Our analysis is restricted to PNAS, which brings limitations to our findings. For example, the media of social sciences not only count on papers, but also on books [11, 12]. Hence the results obtained must be carefully interpreted as being the patterns of researchers who publish papers in the chosen journal. However, due to the influence and representability of PNAS, the case study could contribute to understanding aspects of multidisciplinary and interdisciplinary collaboration patterns.

### Discipline information

Most papers of the dataset have been classified into three first-class disciplines (biological, physical, and social sciences) and 39 second-class disciplines (Table 1). Interdisciplinary papers are classified into several disciplines. The data contain 43,304 biological papers (including 3957 papers of biophysics), which account for 82.01% of the total. The data also contain 5987 physical papers and 1310 social papers. There are 2961 interdisciplinary papers belonging to more than one of the second-class disciplines, which account for 5.61% of the total. The significant difference of discipline proportion does not mean the preference for PNAS. In reality, the number of researchers involved in natural sciences (especially, biological sciences) is far more than that of researchers involved in social sciences [41]. There are 1842 papers that are only classified into the first-class disciplines. For these papers, their second-class discipline are regarded to be missing, but which have been regarded to be the same as their first-class disciplines in our pervious work [42]. Hence the data in Table 1 are different from those in Reference [42].

Based on the discipline information of papers, we constructed a network to express the relationship between the first-class and the second-class disciplines (Fig. 1), where two disciplines are connected if they are the first-class and the second-class disciplines of a paper. We can also construct a network to express the interactions between the second-class disciplines (Fig. 2), where each node is a discipline and two nodes are connected if there is a paper belonging to them simultaneously. These networks could evolve with the discipline information of newly published papers. So using the latest data, one may have a more comprehensive view.

### Coauthorship

Identifying ground-truth authors, termed as disambiguating author names, is an important, time-consuming, but a necessary procedure of coauthorship analysis. Several methods use the information of the provided names on papers (e.g. initial based methods [43]). The dominant misidentification of initial based methods is caused by merging two or more different authors as one. Hence, it deflates the number of unique authors, and inflates the size of the ground-truth giant component. Requiring additional information (e.g. email address) helps to reduce merging errors, but brings the difficulty of collecting information.

In PNAS 1999–2013, 93.1% authors provide full first name. So the provided names on papers are directly used to identify authors. However, utilizing surname and the initial of the first given name will generate a lot of merging errors of name disambiguation [44]. The proportion of these authors in the data is 2.9%, and the proportion of these authors further conditioned on publishing more than one paper is 0.3%. Meanwhile, even utilizing full names still produces merging errors, if some authors provide exactly the same name. Chinese names were found to account for name repetition [44]. We calculated the proportion of the names with a given name less than six characters and a surname among major 100 Chinese surnames.^{Footnote 2} The proportion of these authors in the data is 2.7%, and that of these authors further conditioned on publishing more than one paper is 1.1%. The small values of these four proportions show that the impact of name repetition is limited. These proportions for specific subsets of the data are listed in Table 2.

The method adopted here will split one author as two or more, if the author does not provide his name consistently. Splitting underestimates the giant component size, and the indexes used as evidences for universality of interdisciplinary research. Hence the results in Sects. 3.5, 3.6 could be regarded as conservative ones. In addition, the inaccuracy caused by the adopted method does not change the ground truth distribution type of collaborators per author and that of papers per author [44].

## Data analysis

### Network properties

Coauthorship is a *n*-ary relation, \(n\in\mathbb{Z}^{+}\), hence it can be expressed by a hypergraph, a generalization of a graph in which an edge (termed as hyperedge) can join any number of nodes. Represent authors as nodes, and the author group of each paper (paper team) as a hyperedge. Then we can extract a coauthorship network from a hypergraph as a simple graph, where edges are formed between every two nodes in each hyperedge, and the multiple edges are treated as one. The terms “degree” and “hyperdegree” for nodes are used to express the number of collaborators and that of papers for authors respectively.

The data show that the average paper team size of biological sciences (6.624) and that of physical sciences (5.254) are larger than that of social sciences (4.634). The size relation fits the reality that the sizes of research teams are usually larger in natural sciences, and smaller in social sciences [41]. Now let us consider the coauthorship networks of the considered papers in specific disciplines or science categories. All of these networks are highly clustered, assortative, and their average shortest path length scale as the logarithms of their number of nodes (\(\log \mathrm{NN} \approx \mathrm{AP}\) in Table 3). These properties do not mean all of the networks are small-world. The network of social sciences is an exception, which even has no component containing more than 10% authors. However, it does not mean that the research in social sciences is carried out in isolation. In fact, 71.5% authors in social sciences belong to the giant component of the coauthorship network generated by the whole data. Therefore, analyzing the collaborations of authors restricting in single discipline has limitations. So we proceeded the analysis in the environment of all disciplines.

### Degree and hyperdegree

Aggregate degree and hyperdegree on the data (not restricted in single science category), and observe the degree distributions and hyperdegree distributions of three author sets (which come from the three science categories respectively). We found that although collaboration level differs from one science category to another, all of the distributions emerge a hook head, a fat tail, and a cross-over between them, which could be viewed as a common feature of coauthorship networks (Fig. 3). The head and tail can be fitted by log-normal distribution and power-law distribution respectively [45].

These distributions can also be fitted, as a whole, by a mixture of a generalized Poisson distribution and a power-law distribution. The fitting parameters are listed in Table 4. We performed a two-sample Kolmogorov-Smirnov (KS) test to compare the distributions of two data vectors: node indexes (i.e. degrees, hyperdegrees), the samples drawn from the corresponding fitting distribution. The null hypothesis is that the two data vectors are from the same distribution. The *p*-value of each fitting shows the test cannot reject the null hypothesis at 5% significance level. Note that \(\chi^{2}\) goodness-of-fit test is not suitable here, due to the small number of large degree authors.

Regarding authors as samples, a mixture distribution means those samples come from different populations, namely the collaboration patterns of the authors with few collaborators and papers differ from those with many collaborators and papers. In Reference [46], a possible explanation (which is free of disciplines) is given for the emerged mixture type of empirical degree distributions. With the same general ideas, a similar explanation can be adopted for hyperdegree distributions as follows.

The event whether a researcher collaborates with one another to publish a paper can be regarded as a “yes/no” decision. So the hyperdegree of a researcher is equal to the number of successes in a sequence of decisions made by the candidates who want to coauthor with that researcher. Suppose the number of those candidates to be *n*. Suppose the collaboration probability of each candidate to be *p*. Then, the hyperdegrees will follow a binomial distribution \(B(n, {p})\), and so a Poisson distribution with expected value *np* approximatively (Poisson limit theorem). The value of *np* varies from author to author, due to the diversity of authors’ ability to attract collaborators.

Decisions of authors could be dependent. For example, collaborating with the researchers who have publishing experience helps to publish a paper. Hence we could regard hyperdegree as a random variable following a generalized Poisson distribution (which allows the occurrence probability of an event to involve memory [47]). In empirical data, most hyperdegrees are around their mode. Hence we could think of that they follow some generalized Poisson distributions with an expected value around their mode, and so form the generalized Poisson part of a hyperdegree distribution. A few authors experience a cumulative process of papers, which makes a hyperdegree distribution skew to the right and form a fat tail.

### Transitivity of coauthorship

Transitivity in society is that “the friend of my friend is also my friend”, which is a typical feature of social affiliation networks. In academic society, collaborators of an author likely acquaint and so coauthor with each other. For example, organizational and institutional contexts drive the formation of transitive coauthorship, and so contribute to clustering structures emerging in coauthorship networks.

The transitivity of a network can be quantified by two indexes in graph theory, namely global clustering coefficient (the fraction of connected triples of nodes which also form “triangles”) and local clustering coefficient (the probability of a node’s two neighbors connecting). High transitivity is a common feature of coauthorship networks [13].

To what extent the transitivity is due to the activity of authors in academic society? The activity can be partly reflected through the number of collaborators, namely degree. Hence, the extent can be sketched through the correlation coefficients between degree and local clustering coefficient. Note that the correlation coefficients indicate the extent of a linear relationship between two variables or their ranks. The coefficients of variables *X* and *Y* generally do not completely characterize correlation, unless the conditional expected value of *Y* given *X*, denoted by \(E(Y|X)\), is linear or approximate linear function in *X*. The conditional expected value of local clustering coefficient given degree is the average local clustering coefficient of *k*-degree nodes, denoted by \(\mathrm{CC}(k)\). The approximatively linear trend of \(\mathrm{CC}(k)\) shown in Fig. 4 guarantees the effectiveness of correlation analysis in Table 5. The decreasing trend cannot be deduced out from degree information. The denominator of the local clustering coefficient of a node grows quadratically with its degree, but the numerator cannot be calculated from degree information.

Does the decreasing trend of \(\mathrm{CC}(k)\) mean activity depresses transitivity? A positive answer to it is against common sense. In PNAS 1999–2013, 74.62% authors only publish one paper in the data, and the paper team sizes of 99.9% papers follow a generalized Poisson part, namely are around the average paper team size 6.028. The boundary of generalized Poisson part is detected by the boundary point detection algorithm for probability density functions in Reference [46] (listed in the Appendix). Hence the local clustering coefficients of most small degree authors are close to 1 (Fig. 4). A few authors experience a long period of collaborations, whose degree is obtained by accumulated over papers. For these authors, their collaborators in different papers could not collaborate, which decreases their local clustering coefficient. Hence the puzzling thing does not contradict with common sense, but is due to insufficiency of measuring transitivity such a dynamical property by counting “triangles” on a static network.

To design a more reasonable index measuring transitivity, let us come back to the original meaning of transitivity on coauthorship: the probability of two collaborators (who do not coauthor yet) of a researcher coauthoring in future. The probability can be calculated for dynamic hypergraphs of collaborations through time information. Averaging the probability over authors measures the global transitivity, the value of which is quite low in each science category (Table 5). Note that the calculation is limited in PNAS 1999–2013, and transitivity may happen in other journals or in other time period. So the values of transitivity here may be underestimated. The increasing trend of the transitivity probability of *k*-degree authors (\(\mathrm{TC}(k)\) in Fig. 4) means the activity contributes to transitivity. It fits common sense: a researcher with many collaborators is likely to introduce his collaborators to cooperate.

### Homophily of coauthorship

Coauthorship is based on specific features of researchers in common, including interest and geography. The homophily phenomenon appears in many social relations, and is called assortative mixing in network science [14]. Do authors of each science category prefer to coauthor with others that are similar in social activity or productivity? The social activity and productivity of authors can be quantified by two indexes, namely degree and hyperdegree respectively. Then the preference of an index could be sketched through the correlation coefficient between two variables, namely the index of a author and the average index of the author’s neighbors. Positive correlation means assortative, negative disassortative, and zero no preference.

Degree assortativity is a feature of coauthorship networks [14]. Does it mean sociable researchers (with many collaborators) will preferentially coauthor with other sociable researchers, and unsociable to unsociable? In a previous study [48], we showed that the proportion of top 5.99% most sociable authors (measured according to degree) having coauthored with another such author is 99.5%. The proportion may even be underestimated, because these authors probably coauthored before 1999 or in other situations. Note that the splitting and merging errors of the used name disambiguation method affect the proportion at certain levels. Even so, the proportion is still remarkable.

However, if sociable researchers only coauthor with sociable ones, then there will exist many sociable researchers, which is against empirical degree distributions. Now let us analyze the influence of the social activity of authors on degree assortativity. For the authors with *k*-degree, denote the average degree of their neighbors by \(\mathrm{DN}(k)\). There exists a trend change in \(\mathrm{DN}(k)\) of each empirical dataset: the head part has a clear increasing trend, but the tail part does not (Fig. 4). It means that degree assortativity are mainly contributed by small degree authors.

The tipping point of the trend of \(\mathrm{DN}(k)\) is detected by the boundary point detection algorithm for general functions in Reference [46] (listed in the Appendix). Inputs of the algorithm are \(\mathrm{DN}(k)\), \(g(\cdot)=\log(\cdot)\) and \(h(x)= a_{1} x^{3} + a_{2} x^{2} + a_{3} x + a_{4}\) (\(x, a_{i}\in {\mathbb{R}}\), \(i=1,\ldots,4\)). Using those inputs is based on the observation of \(\mathrm{DN}(k)\). Degrees of most authors are around their mode 5, and only a few authors have a large degree. Hence the neighbors of an author are likely to be small degree authors. Therefore, for small degree authors, the degree differences between those authors and their neighbors are small, and large for large degree authors, which leads to the trend change of \(\mathrm{DN}(k)\).

The correlation coefficient between hyperdegree and the average hyperdegree of neighbors is around zero in each science category (Table 5). For the authors with *k*-hyperdegree, denote the average hyperdegree of their neighbors by \(\mathrm{HN}(k)\). It means choosing collaborators is free of the factor of productivity. In reality, members of a research team may have various scientific ages (newcomers, incumbents), so different hyperdegrees. Since collaborations mainly happen in a research team, collaborators of an author could have various hyperdegrees, which appears as the stable trend of \(\mathrm{HN}(k)\).

Based on the average value of \(\mathrm{HN}(k)\) larger than 2, and 74.62% authors only having one paper in the data, we can derive that a large fraction of authors collaborate with at least one author who has published a paper in PNAS 1999–2013 to publish their first paper in the data. The proportions of these authors are 79.22%, 71.17% and 65.12% in biological, physical and social sciences respectively. The proportions may be overestimated, because some of these authors may publish papers in PNAS before 1999.

### Interdisciplinarity at discipline level

The co-category proportion measures the activities of interdisciplinary research. There are 49.2%, 46.0% and 7.3% authors of social, physical and biological sciences who published interdisciplinary papers. The common sense suggests that social scientists engage in research solitary. The proportion of social sciences shows that the common sense does not hold in PNAS. Reference [49] also shows, there has been a move towards increased interdisciplinarity in recent decades in social sciences.

Above analysis process could be implemented to the second-class disciplines to obtain a high-resolution result. However some disciplines only have a few papers, e.g. only 17 papers of political science. So the analysis for those disciplines loses statistical meaning. Hence we took another perspective to analyze the interactions among the second-class disciplines by visualizing them as the network in Fig. 2. The network is connected, i.e. no discipline is isolated. Top three nodes of this network in terms of degrees and those in terms of betweenness centralities are Applied mathematics, Chemistry and Anthropology (Table 1). It means the theories, methods and problems of those disciplines are directly or indirectly used or studied by many disciplines. For each first-class discipline, we contracted its second-class disciplines as one node, and calculated the betweenness centrality of the contracted node. Their betweenness centrality (Biological sciences 47.51, Physical sciences 163.81, Social sciences 161.72) support the above analysis.

The co-category proportion only describes interdisciplinary activities. Now let us measure the discipline diversity of interdisciplinary research in each science category through Rao-Sterling index [38] \(\varDelta =\sum_{i,j(i\neq j)} d^{\alpha}_{ij} (p_{i}p_{j})^{\beta}\), where \(p_{i}\) and \(p_{j}\) are proportional representations of the papers/authors in science category *i* and *j* and \(d_{ij}\) is the level of difference attributed to categories *i* and *j*. Discipline information is used to classify authors into science categories: if one of his papers belongs to a discipline, an author can be classified into the discipline, so into the corresponding sciences. Note that an author can be classified into several science categories, if his papers belong to more than one discipline. Here we let \(\alpha=\beta=d_{ij}=1\) for all *i* and *j*, hence the calculated Rao-Sterling index measures the balance-weighted variety of interdisciplinary research in the level of science categories. The index in author view and that in paper view show that the discipline diversity of interdisciplinary research in social sciences and that in physical sciences are much higher than that in biological sciences (Fig. 5).

### Interdisciplinarity at author level

We analyzed the relationship between author degree/hyperdegree and the probability of doing interdisciplinary research, and the relationship between paper team size and the probability of being an interdisciplinary paper. Figure 6 shows that in each science category, interdisciplinary research is not just carried out by authors with a large degree or those with a large hyperdegree.

Figure 6 also shows that large degree or hyperdegree authors are likely to engage in interdisciplinary research, and a paper with a large team size is likely to be an interdisciplinary one. It seems these phenomena can be expected at random. Take a set of elements (collaborators, papers) of several classes, and select a subset randomly. Then a larger subset more likely contains elements from more than one class. This reasoning, though plausible, is incorrect, because scientists do not randomly select topic and collaborators. Research costs (investments of time and effort) make scientists tend to work within their familiar fields. In addition, the reasoning is based on that the selection scope of collaborators is limited to empirical data, which does not hold in reality.

We analyzed the giant component of coauthorship network PNAS 1999–2013, which contains more than 86.8% authors. There are 71.5%, 76.7% and 88.9% authors of social, physical and biological sciences in the giant component (Fig. 7(e)). Note that the author misidentification caused by initial-based methods increases the size of the ground-truth giant component [44]. Hence we identified authors by their provided names on papers (which likely split one author into two) to obtain a conservative result.

Interdisciplinary research and multidisciplinary research contribute to the giant component containing most authors of each science category. We analyzed the relationship between the author proportion of the giant component and author activity/productivity. Remove authors from high degree and hyperdegree to low respectively, and calculate the proportion of the giant component. From the relation curve between the proportion of removed authors and that of the giant component, we can find that the formation of giant component is contributed by a considerable number of authors, e.g. the top 10% authors ranked by degree (Fig. 8). Consider the relationship in three time periods, viz. 1999–2003, 2004–2008 and 2009–2013. The relation curve shifts to the left over time, which means author activity and productivity are playing increasingly important roles in the formation of the giant component.

## Discussion and conclusions

Our case study on PNAS 1999–2013 verifies the similar transitivity and assortativity of collaboration patterns in biological, physical and social sciences. The data demonstrate that the degree distribution types of the three science categories are identical, which are a mixture of a generalized Poisson distribution and a power-law. This also holds for hyperdegree. We provided an explanation for the emergence of this distribution type through authors’ “yes/no” decisions and their different abilities to attract collaborations.

The data show that a considerable number of authors pursue interdisciplinary research, and the giant component of coauthorship network PNAS 1999–2013 contains most authors of each science category. We took network perspective to analyze the interactions among the second-class disciplines, and quantify their interdisciplinarity by network indexes such as degree and betweenness centrality. We found that specific second-class disciplines (such as Applied mathematics and Anthropology) play an important role in interdisciplinary research.

The case study contributes to understanding multidisciplinary and interdisciplinarity collaboration patterns, due to the importance of PNAS and to the accurate discipline information of its papers. The selection of data might affect the details of our findings about interdisciplinarity. Our results may not be interpreted as the patterns of general researchers. For example, we cannot expect to observe a high extent of interdisciplinarity by analyzing a domain specific journal. We finished the case study by asking a question: What are the grounds of interdisciplinary research? While a thorough discussion of this question is beyond the scope of this paper, the following provides a simple discussion.

There is a tendency of fragmentation for disciplines in the development of sciences: going to split into sub-disciplines and specific topics. Although the research objects are different, their research paradigms are in common, which can be grouped into four categories, namely theoretical research, experiment, simulation, and data-driven [50]. Meanwhile, many scientific problems are too complex to be understood through the methodology of single discipline. Integrating theoretical and methodological perspectives drawn from different disciplines creates a unified methodology for research problems and even vocabulary used to present concepts in specific disciplines [51], which drives the formation of transdisciplinary disciplines [52].

Systems science, as a typical transdisciplinary discipline, studies systems from simple to complex, from natural to social sciences. The parts of a system and the relations between parts can be abstracted as networks. The rapid development of research on networks (model, algorithm, …) breeds a new discipline, namely network science. Some researchers from biological, physical and social fields investigate their respective problems under network framework [53], e.g. our case study.

To follow up the above, one would think that common research paradigms and methodology, especially those integrated as transdisciplinary disciplines, give grounds for the interactions between science categories and for the formation of giant components in coauthorship networks. It seems promising that analyzing paper content helps to validate the universality of those paradigms and methodologies. Over half the papers of PNAS 1999–2013 contain the topic words “system” and “control” [42]. The high proportion of the papers containing a topic word at certain levels reflects the typicality of the topic. However, it is not easy to say which is the relation between a paper containing the word “system” and a paper applying research results of systems science. Hence validating the universality at semantic level is a subject for further study.

## Notes

- 1.
- 2.
Wikipedia shows that people with major 100 Chinese surnames account for 84.77% of the total Chinese population.

## References

- 1.
Weingart P (2012) A short history of knowledge formations. In: Frodeman R, Thompson Klein J, Mitcham C (eds) The Oxford handbook of interdisciplinarity. Oxford University Press, Oxford, pp 3–14

- 2.
Cooper G (2013) A disciplinary matter: critical sociology, academic governance and interdisciplinarity. Sociology 47(1):74–89

- 3.
Hurd JM (1992) Interdisciplinary research in the sciences: implications for library organizations. Coll Res Liber 53(4):283–297

- 4.
National Academies (U.S.), Committee on Facilitating Interdisciplinary Research (2004) Facilitating interdisciplinary research National Academy Press, Washington. Retrieved from http://www.nap.edu/books/0309094356/html/

- 5.
Hadorn GH, Pohl C, Bammer G (2012) Solving problems through transdisciplinary research. In: Frodeman R, Thompson Klein J, Mitcham C (eds) The Oxford handbook of interdisciplinarity. Oxford University Press, Oxford, pp 431–452

- 6.
Liu Y, Rafols I, Rousseau R (2012) A framework for knowledge integration and diffusion. J Doc 68(1):31–44

- 7.
Siedlok F, Hibbert P (2014) The organization of interdisciplinary research: modes, drivers and barriers. Int J Manag Rev 16(2):194–210

- 8.
Gooch D, Vasalou A, Benton L (2017) Impact in interdisciplinary and cross-sector research: opportunities and challenges. J Assoc Inf Sci Technol 68(2):378–391

- 9.
Lariviére V, Gingras Y, Archambault É (2006) Canadian collaboration networks: a comparative analysis of the natural sciences, social sciences and the humanities. Scientometrics 68(3):519–533

- 10.
Moody J (2004) The structure of a social science collaboration network: disciplinary cohesion from 1963 to 1999. Am Sociol Rev 69(2):213–238

- 11.
Glänzel W, Schoepflin U (1999) A bibliometric study of reference literature in the sciences and social sciences. Inf Process Manag 35(1):31–44

- 12.
Hicks D (1999) The difficulty of achieving full coverage of international social science literature and the bibliometric consequences. Scientometrics 44(2):193–215

- 13.
Newman M (2001) The structure of scientific collaboration networks. Proc Natl Acad Sci USA 98:404–409

- 14.
Newman M (2002) Assortative mixing in networks. Phys Rev Lett 89:208701

- 15.
Barabási AL, Jeong H, Néda Z, Ravasz E, Schubert A, Vicsek T (2002) Evolution of the social network of scientific collaborations. Physica A 311:590–614

- 16.
Newman M (2004) Coauthorship networks and patterns of scientific collaboration. Proc Natl Acad Sci USA 101:5200–5205

- 17.
Sarigöl E, Pfitzner R, Scholtes I, Garas A, Schweitzer F (2014) Predicting scientific success based on coauthorship networks. EPJ Data Sci 2014:9

- 18.
Xie Z, Li JP (2016) A geometric graph model for coauthorship networks. J Informetr 10:299–311

- 19.
Tomasello MV, Vaccario G, Schweitzer F (2017) Data-driven modeling of collaboration networks: a cross-domain analysis. EPJ Data Sci 6:22

- 20.
Braun T, Schubert A (2003) A quantitative view on the coming of age of interdisciplinarity in the sciences, 1980–1999. Scientometrics 58(1):183–189

- 21.
Porter AL, Roessner JD, Cohenm AS, Perreault M (2006) Interdisciplinary research: meaning, metrics and nurture. Res Eval 15(3):187–195

- 22.
Levitt JM, Thelwall M, Oppenheim C (2011) Variations between subjects in the extent to which the social sciences have become more interdisciplinary. J Assoc Inf Sci Technol 62(6):1118–1129

- 23.
Porter AL, Rafols I (2009) Is science becoming more interdisciplinary? Measuring and mapping six research fields over time. Scientometrics 81(3):719–745

- 24.
Chen S, Arsenault C, Gingras Y, Lariviére V (2015) Exploring the interdisciplinary evolution of a discipline: the case of biochemistry and molecular biology. Scientometrics 102(2):1307–1323

- 25.
Rafols I, Meyer M (2010) Diversity and network coherence as indicators of interdisciplinarity: case studies in bionanoscience. Scientometrics 82(2):263–287

- 26.
Abramo G, D’Angelo CA, Costa F (2012) Identifying interdisciplinarity through the disciplinary classification of coauthors of scientific publications. J Assoc Inf Sci Technol 63(11):2206–2222

- 27.
Bordons M, Zulueta MA, Romero F, Barrigón S (1999) Measuring interdisciplinary collaboration within a university: the effects of the multidisciplinary research programme. Scientometrics 46(3):383–398

- 28.
Leydesdorff L, Goldstone RL (2014) Interdisciplinarity at the journal and specialty level: the changing knowledge bases of the journal Cognitive Science. J Assoc Inf Sci Technol 65(1):164–177

- 29.
Zhang L, Rousseau R, Glänzel W (2015) Diversity of references as an indicator for interdisciplinarity of journals: taking similarity between subject fields into account. J Assoc Inf Sci Technol 67(5):1257–1265

- 30.
Lungeanu A, Huang Y, Contractor NS (2014) Understanding the assembly of interdisciplinary teams and its impact on performance. J Informetr 8(1):59–70

- 31.
Lariviére V, Gingras Y (2010) On the relationship between interdisciplinarity and scientific impact. J Assoc Inf Sci Technol 61(1):126–131

- 32.
Lariviére V, Haustein S, Börner K (2015) Long-distance interdisciplinarity leads to higher scientific impact. PLoS ONE 10(3):e0122565

- 33.
Rinia EJ, van Leeuwen TN, van Raan AFJ (2002) Impact measures of interdisciplinary research in physics. Scientometrics 53(2):241–248

- 34.
Wan J, Thijs B, Glänzel W (2015) Interdisciplinarity and impact: distinct effects of variety, balance, and disparity. PLoS ONE 10(5):e0127298

- 35.
Levitt JM, Thelwall M (2009) The most highly cited library and information science articles: interdisciplinarity, first authors and citation patterns. Scientometrics 78(1):45–67

- 36.
Levitt JM, Thelwall M (2008) Is multidisciplinary research more highly cited? A macrolevel study. J Assoc Inf Sci Technol 59(12):1973–1984

- 37.
Chen S, Arsenault C, Lariviére V (2015) Are top-cited papers more interdisciplinary? J Informetr 9(4):1034–1046

- 38.
Stirling A (2007) A general framework for analyzing diversity in science, technology and society. J R Soc Interface 4(5):707–719

- 39.
Leydesdorff L (2007) Betweenness centrality as an indicator of the interdisciplinarity of scientific journals. J Assoc Inf Sci Technol 58(9):1303–1319

- 40.
Van den Besselaar P, Heimeriks G (2001) Disciplinary, multidisciplinary, interdisciplinary: concepts and indicators. In: ISSI, pp 705–716

- 41.
Kagan J (2009) The three cultures: natural sciences, social sciences, and the humanities in the 21st century. Cambridge University Press, Cambridge

- 42.
Xie Z, Duan XJ, Zhang PY (2015) Quantitative analysis of the interdisciplinarity of applied mathematics. PLoS ONE 10(9):e0137424

- 43.
Milojević S (2013) Accuracy of simple, initials-based methods for author name disambiguation. J Informetr 7(4):767–773

- 44.
Kim J, Diesner J (2016) Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks. J Assoc Inf Sci Technol 67(6):1446–1461

- 45.
Milojević S (2010) Modes of collaboration in modern science: beyond power laws and preferential attachment. J Assoc Inf Sci Technol 61(7):1410–1423

- 46.
Xie Z, Li JP, Dong EM, Yi DY (2018) Modelling transition phenomena of scientific coauthorship networks. J Assoc Inf Sci Technol 69(2):305–317

- 47.
Consul PC, Jain GC (1973) A generalization of the Poisson distribution. Technometrics 15(4):791–799

- 48.
Xie Z, Xie ZL, Li M, Li JP, Yi DY (2017) Modeling the coevolution between citations and coauthorship of scientific papers. Scientometrics 112:483–507

- 49.
Levitt JM, Thelwall M, Oppenheim C (2011) Variations between subjects in the extent to which the social sciences have become more interdisciplinary. J Assoc Inf Sci Technol 62(6):1118–1129

- 50.
Hey T, Tansley S, Tolle KM (2009) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond.

- 51.
Haythornthwaite C (2006) Learning and knowledge networks in interdisciplinary collaborations. J Assoc Inf Sci Technol 57(8):1079–1092

- 52.
Grauwin S, Beslon G, Fleury É, Franceschelli S, Robardet C, Rouquier JB, Jensen P (2012) Complex systems science: dreams of universality, interdisciplinarity reality. J Assoc Inf Sci Technol 63(7):1327–1338

- 53.
Brier S (2013) Cybersemiotics: a new foundation for transdisciplinary theory of information, cognition, meaningful communication and the interaction between nature and culture. Integr Rev 9:222–263

### Acknowledgements

We thank the anonymous reviewers for their valuable suggestions and great help.

### Availability of data and materials

The data are freely available from the their website http://www.pnas.org. Feel free to get in contact with the corresponding author in case you need more information.

### Funding

ZX acknowledges support from National Science Foundation of China (NSFC) Grant No. 61773020.

## Author information

### Affiliations

### Contributions

The first three authors have contributed equally to this work. All authors conceived and designed the research. ZX and ML wrote the paper. ZX and JPL analyzed the data. OYZZ acquired the data. ZX and XJD wrote the discussion. All authors discussed the research and approved the final version of the manuscript.

### Corresponding author

Correspondence to Zheng Xie.

## Ethics declarations

### Competing interests

The authors declare that they have no competing interests.

## Additional information

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## About this article

#### Received

#### Accepted

#### Published

#### DOI

### Keywords

- Collaboration pattern
- Interdiscipline
- Hypergraph
- Complex network