The classical origin of modern mathematics
- Floriana Gargiulo^{1}Email authorView ORCID ID profile,
- Auguste Caen^{2},
- Renaud Lambiotte^{1} and
- Timoteo Carletti^{1}
DOI: 10.1140/epjds/s13688-016-0088-y
© Gargiulo et al. 2016
Received: 18 March 2016
Accepted: 5 August 2016
Published: 16 August 2016
Abstract
This paper introduces a data-driven methodology to study the historical evolution of mathematical thinking and its spatial spreading. To do so, we have collected and integrated data from different online academic datasets. In its final form, the database includes a large number (\(N\sim200\mbox{K}\)) of advisor-student relationships, with affiliations and keywords on their research topic, over several centuries, from the 14th century until today. We focus on two different issues, the evolving importance of countries and of the research disciplines over time. Moreover we study the database at three levels, its global statistics, the mesoscale networks connecting countries and disciplines, and the genealogical level.
Keywords
academic genealogies history of mathematics directed acyclic graphs1 Introduction
The statistical analysis of scientific databases, including those of the American Physical Society, Scopus, the arXiv and ISI web of Knowledge, has become increasingly popular in the complex systems community in recent years. Important contributions include the development of appropriate scientometric measures to evaluate the scientific impact of scholars, journals and academic institutions [1–5] and to predict the future success of authors [6, 7] and papers [8]. In parallel, the structure of collaboration has attracted much attention, and collaboration networks have become a central example for the study of complex networks, thanks to the high quality and availability of the datasets [9]. From a dynamical point of view, different papers [10, 11] studied the mobility of researchers during their academic career, showing that the statistical properties of their mobility patterns are mainly determined by simple features, such as geographical distance, university rankings and cultural similarity.
Limitations of the aforementioned datasets include their relatively narrow time window extension, at best, over 100 years and the difficulty to disambiguate author names, and thus to correctly distinguish career paths across time. The original motivation of this paper was to address these issues by performing an extended study of The Mathematics Genealogy Project, a very large, curated genealogical academic corpus [12]. The dataset, whose basic statistics have been already analysed elsewhere [13, 14], extends over several centuries and contains pieces of information allowing us to retrieve the direct genealogical mentor-student links, but also university affiliations at different points of a career and the research domains. Data from the same website have already been used to assess the role of mentorship on scientific productivity [15] and to study the prestige of university departments [16].
Our main goal is to analyse the history of modern mathematics, through the processes of birth, death, fusion and fission of research fields across time and space. In particular, we focus on the temporal evolution of the roles and importance of countries and of disciplines, on the structure of ‘scientific families’ and on the impact of genealogy on the development of scientific paradigms. As it is often the case when performing a data-driven analysis of historical facts [17, 18], the data set is expected to be incomplete and to present biases, mainly for the more ancient data. In the present case, the website collects the data in two ways: a participative method, based on the spontaneous registration of scholars (who can also register their students and their mentors), and a curated method, based on historical facts and performed by the creators of the web site.
The presence of biases calls for the use of appropriate statistical measures, in preference based on ranking instead of absolute measures. In this work, we have also introduced data-mining methods to correct and enrich the data structure. A first contribution of this work is thus methodological, with the design of a methodological setup that could be applied to other systems. We have then performed an analysis of the system at three levels of granularity. First, a global one investigates the fully aggregated ‘demography’ (population in terms of countries and disciplines) of the database, with the aim to classify countries and disciplines according to their normalised activity behaviour. Tracking the evolution of the rankings helps identify transition points in the mathematical history, associated to emerging fields of research. Second, we have constructed directed weighted networks where nodes are scholars endowed with a set of attributes (thesis defence date, thesis defence location, thesis disciplines) and linked to other nodes using the genealogy associated to the mentor-student relation. This ‘mesoscale’ network allows us to investigate the relationships between the attributes and to identify a strong hierarchical structure in the scientific production in terms of countries as well as its evolution in the course of time. Finally, using an approach typic of kinship networks studies [19, 20], we focus on the statistical properties of the tree structure of the genealogy in terms of family structures. We conclude by showing the presence of strong memory effects in the network morphogenesis.
To summarise this paper has a twofold goal: first to propose, in the framework of data science, new tools to collect and analyse historical databases, second and in complement with the former, to provide a narrative on the history of mathematics as extracted from data. In both cases, this work opens interesting perspectives. Because of their generality, the presented tools could clearly be used to study different databases with a genealogical structure, for instance in the case of bibliometrics or Wikipedia studies. In addition, our results provide a first glimpse of the potential use of data and algorithms in the study of the history of science. An important future step would consist in complementing and interpreting this data-driven view with that of epistemologists and historians of science, as briefly outlined in the conclusions.
2 Dataset and associated networks
The core of our dataset has been extracted from the website ‘Mathematical Genealogy Project’. It is one of the largest academic genealogy available on the web, consisting of approximatively 200K not-isolated scientists (186,505) with information on their mentors and students. The data cover a period between the 14th century until nowadays. For a majority of mathematicians, we have detailed information about his/her PhD, including the title (for 88% of the scholars), the classification according to the 93 classes proposed by the American Mathematical Society [21] (for 43% of the scholars), the University delivering the degree as well as the year of its defence. However, because a large part of the database is spontaneously filled by the scientists, the data is imperfect and attributes may be wrong or missing. A first step has thus consisted in comparing the database with additional data form Wikipedia [22]. In particular, we first downloaded, when available, the Wikipedia pages of all the scholars present in the Mathematics Genealogy Project database. Disambiguation of the names is assured by the fact that Wikipedia pages have a direct link to the Mathematics Genealogy Project site. In the text of the Wikipedia pages, we then searched the keywords associated to the AMS classification in order to expand the information about authors. This external dataset allowed to assign a discipline to the 54% of the mathematicians.
For more recent entries, we retrieved the affiliations with the Scopus profiles of scientists [23]. Notice that we only extracted the information required for our needs and that additional information, e.g. about their scientific impact or on their geographical links, could be collected in order to address other research questions.
It is worth noting that our analysis are biased by the actual scientific and socio-political environment. First, the countries’ borders changed in time. In the Mathematics Genealogy Project, the location of the PhD defence is determined according to the position of the university in the current geo-political setting. We kept in our analyses this county classification, but it would be interesting in the future to consider, for example, the resilience of the system to borders shifts. Similarly, the concept of discipline is also very delicate to define on such a long time scale [24], and we decided here to use the current classification from the AMS for all authors.
After this preliminary phase, we have enriched the information available for the authors, by developing algorithms aimed at correcting the dates and assigning to each thesis a discipline. The algorithm for fixing errors in temporal entries is based on the topological structure of the genealogical network and uses the available statistics on the age difference mentor-student to identify and suitably correct wrong time sequences (e.g. the cases where the mentor has completed its PhD after its student, or where the time distance between mentor’s and the student’s PhD is too large). The missing disciplines (not previously extracted from Wikipedia) have been learned based on the thesis title using a Bayesian supervised dictionary learning technique.
As previously stated, these algorithms, summarised in Additional file 1 (Sections I.B-I.D), are general and could be applied in other contexts. After the enrichment, all the scholars of the database have a corrected date, 88% of them have an associated discipline and 94% an associated country. As a next step, we have exploited the enriched database in order to study the geographical and temporal evolution of mathematics. Different data representations, described below, have been adopted to mine different typologies of information from the dataset.
2.1 The mesoscale networks
2.2 The genealogical tree and its partitions into families
The genealogical graph is the most obvious representation of our dataset, consisting in an oriented acyclic graph [25] linking a mentor to her/his students. This defines automatically the structure of hierarchical generations. Notice however that the structure of our data is not simply a tree due to the several cases where a student has two advisors. A very common process in kinship is to cut the genealogical directed acyclic graphs into linear trees (alliances) where each individual has a single progenitor (the mother for representing the uterine links and the father for the agnatic ones). In this representation, the links between alliances represent the matrimonial structures between the different alliances in the society. In our context, when a scientist has more than one advisor, it is not clear which links should be cut to retrieve the original ancestors (our dataset prevents us from identifying the principal supervisor from the secondary one, if any). We thus propose a method to reproduce the optimal ancestry lines and to identify the important families in the genealogy. The method, fully described in Additional file 1 (Section I.D), is based on the decomposition of the network into pure linear trees, and their statistical clustering based on probabilistic arguments; roughly speaking given two nodes A and B that can be linked in more than one way, thus implying the presence of non-trivial loops, we assign to every link in such paths the probability that A and B will be disconnected if the link is removed. We thus select links to be removed by maximising the probability that A and B are still linked. The resulting partition of the graph into families identifies 84 families; remarkably, the 24 most populated families cover the 65% of the scientific population in the database. Let us observe that alternative methods for family identification do exist, see for instance [26, 27].
3 Results
3.1 Global statistics
Additional information on how the total number of mathematicians compares with that of scientists would make these results more significative, but this type of information is difficult to be retrieved in electronic archives.
To capture the rise and fall of countries or disciplines, we have compared the rankings of the top 10 countries and disciplines in different time periods. Standard indicators for rank comparison, such as the Kendall-Tau index, cannot be applied here since the elements in the top-k lists are not conserved in time [28]. For this reason, we have used a distance measure based on a modified version of the Jaccard index allowing to compare ranked sets, \(J(\mathit {rank}_{1},\mathit{rank}_{2})\) (more information is provided in Additional file 1, Section A). As for the original Jaccard index the modified version is such that \(J(\mathit{rank}_{1},\mathit {rank}_{2})=1\) when the rankings \(\mathit{rank}_{1}\) and \(\mathit{rank}_{2}\) are completely equivalent, and gives a value 0 when these latter are not correlated at all. This information is then transformed in a distance by taking \(d_{J}=1-J\). Increases in the distance measure, \(d_{J}\), indicate major reshaping of the rankings.
3.2 Mesoscale networks
3.2.1 Network of countries
The countries network can be used to represent the knowledge flows from one country to another one, associated to the transition of a student in a country, becoming a professor and PhD supervisor in another country. The network presents few important hubs, that are the gravity centres of the scientific research (USA, Germany, Russia, UK). Each of these hubs tends to be surrounded by a community of countries. These communities can be associated to historical divisions, for instance a large block connected to USA scientific production, the Commonwealth nations, the ex-Soviet block, the central European countries. The betweenness of countries allows to detect countries at the interface between different communities, such as France connecting the central European countries with the USA-centred community or Poland connecting European research and the ex-Soviet area.
More information about this network, in particular the properties of the aggregated transition networks concerning the whole historical period, can be found in Additional file 1 (Section III).
3.2.2 The transition network of disciplines
The transition network of disciplines represents transfers of knowledge from one scientific discipline (the one of the mentor) to another one (the one of the student). The structure of this graph is quite homogeneous in terms of degree and four major topological communities can be identified using standard community detection algorithms working on the topological structure of the weighted network [29]: computer science, geometry, analysis and physics. Each community represents the disciplines exchanging more knowledge between them than with other research fields, and therefore can be interpreted as the scientific paradigms (according to Thomas Kuhn definition) at a certain period.
3.3 The genealogical structure
This last section is devoted to the study of the genealogy tree reconstructed from our data and of its relevance in the evolution of the history of mathematical science.
The aggregated network between the families, reminiscent of kinship of the alliance networks defined in [19, 31], can be described using some typical topological indicators [20]: (1) the endogamy index, \(\epsilon_{0}\) describing the fraction of loops in the network (links between the same family); (2) the concentration index \(c_{x}\) denoting the heterogeneity of the concentration of links between pairs of families (\(c_{x}=1\) when all links are concentrated on a single pair and \(c_{x}=1/n^{2}\) when links are homogeneously distributed among the n families); (3) the network symmetry index \(s_{x}\) that varies from 0 in case of total link unbalance, namely the outgoing flux and the ingoing one are very different each other, to 1 in case of perfect symmetry of fluxes. To asses the relevance of such indicators computed for our genealogy network, we compared them with the expected values for a random multinomial reshuffling - null model - (see Figure 10B), we can observe that, while the symmetry is a structural property, being unchanged by the reshuffling, the endogamy and the concentration are typical signatures of this network and moreover they are much higher than in traditional kinship networks [20]. These results imply that the obtained scientific families are structurally very distant between them and that their relationships are very hierarchical (being these mediated by the largest families).
This strong separation between the genealogical families can be a signature of the existence of tacit knowledge in mathematics [32]. It would be interesting to study the historical development of the kinship structure in order to better address this phenomenon.
4 Conclusions
In this paper, we have presented a data-driven study of the history of mathematical science, based on the Mathematical Genealogy Project. A first important aspect has been the cleaning and correction of the incomplete and sometimes inaccurate dataset. This operation was performed by means of machine-learning and by incorporating data from other sources, including Wikipedia.
We have then considered three different approaches to analyse the data: a demographic approach analysing the time evolution of the prevalence of certain attributes (i.e. country or disciplines); a mesoscale network approach focusing on the connections between these attributes; a ‘kinship’ approach based on the clustering of genealogical trees. Our analysis reveals important transition points in the history of mathematics and allows us to categorise countries according to their capacity to attract, export and self-maintain knowledge. Moreover, the community structures of the network of disciplines allows us to better describe the transformation of knowledge across time. Finally, we have also identified important scientific families, associating them to their founder, and described their geographical and disciplinary distribution.
Interesting lines of research for the future include the integration of additional datasets, based on different methodologies, to extend the scope of this work beyond the mathematical sciences.
Another research direction still connected to history of mathematics, could be to analyse how the scientific labor market reacts to exogenous events [33, 34] or to study the innovation dynamics due for instance to the impact of the computer age, of the Internet, of the peer review practices, etc. in the disciplinary prevalence.
Finally it would be worth also to build an abstract agent based models of innovation diffusion, that could be calibrated and implemented on this framework, in order to forecast future events and thus to add a predictive character to this dataset.
Other interesting research directions could include the analysis of gender roles in scientific production, using methods similar to the ones proposed in [35, 36].
Declarations
Acknowledgements
The work of FG, TC and RL presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- Bergstrom CT, West JD, Wiseman MA (2008) The eigenfactor metrics. J Neurosci 28(45):11433-11434 View ArticleGoogle Scholar
- Ramasco JJ, Dorogovtsev SN, Pastor-Satorras R (2004) Self-organization of collaboration networks. Phys Rev E 70:036106 View ArticleGoogle Scholar
- Pan RK, Kaski K, Fortunato S (2012) World citation and collaboration networks: uncovering the role of geography in science. Sci Rep 2:902 Google Scholar
- Radicchi F, Fortunato S, Markines B, Vespignani A (2009) Diffusion of scientific credits and the ranking of scientists. Phys Rev E 80:056103 View ArticleGoogle Scholar
- Radicchi F, Castellano C (2011) Rescaling citations of publications in physics. Phys Rev E 83:046116 View ArticleGoogle Scholar
- Wang D, Song C, Barabási AL (2013) Quantifying long-term scientific impact. Science 342(6154):127-132 View ArticleGoogle Scholar
- Acuna DE, Allesina S, Kording KP (2012) Future impact: predicting scientific success. Nature 489(7415):201-202 View ArticleGoogle Scholar
- Shen HW, Wang D, Song C, Barabási AL (2014) Modeling and predicting popularity dynamics via reinforced Poisson processes. In: AAAI 2014, pp 291-297 Google Scholar
- Newman ME (2001) The structure of scientific collaboration networks. Proc Natl Acad Sci USA 98:404-409 MathSciNetView ArticleMATHGoogle Scholar
- Deville P, Wang D, Sinatra R, Song C, Blondel VD, Barabási AL (2014) Career on the move: geography, stratification, and scientific impact. Sci Rep 4:4770 View ArticleGoogle Scholar
- Gargiulo F, Carletti T (2014) Driving forces of researchers mobility. Sci Rep 4:4860 View ArticleGoogle Scholar
- The mathematical genealogy. http://genealogy.math.ndsu.nodak.edu/index.php
- Narayan P (2011) Mathematics genealogy network. Thesis, University of Oxford. http://people.maths.ox.ac.uk/porterm/research/priya_thesis_final.pdf
- Engin A, Gunes MH, Yuksel M (2011) Analysis of academic ties: a case study of mathematics genealogy. In: 2011 IEEE GLOBECOM Workshops (GC Wkshps). IEEE Press, New York Google Scholar
- Dean MR, Ottino JM, Amaral LAN (2010) The role of mentorship in protégé performance. Nature 465(7298):622-626 View ArticleGoogle Scholar
- Myers SA, Mucha PJ, Porter MA (2011) Mathematical genealogy and department prestige. Chaos 21(4):041104 View ArticleGoogle Scholar
- Schich M, Song C, Ahn YY, Mirsky A, Martino M, Barabási AL, Helbing D (2014) A network framework of cultural history. Science 345(6196):558-562 View ArticleGoogle Scholar
- Sinatra R, Deville P, Szell M, Wang D, Barabási AL (2015) A century of physics. Nat Phys 11(10):791-796 View ArticleGoogle Scholar
- White D, Jorion P (1992) Representing and analyzing kinship: a new approach. Curr Anthropol 33:454-462 View ArticleGoogle Scholar
- Roth C, Gargiulo F, Bringé A, Hamberger K (2013) Random alliance networks. Soc Netw 35(3):394-405 View ArticleGoogle Scholar
- 2010 mathematics subject classification. http://www.ams.org/msc/msc2010.html
- Wikipedia. http://www.wikipedia.org
- Scopus. http://www.scopus.com
- Sugimoto CR, Weingart S (2015) The kaleidoscope of disciplinarity. J Doc 71(4):775-794 View ArticleGoogle Scholar
- Karrer B, Newman M (2009) Random acyclic networks. Phys Rev Lett 102(12):128701 View ArticleGoogle Scholar
- Karloff H, Shirley KE (2013) Maximum entropy summary trees. Comput Graph Forum 32(3):71-80 View ArticleGoogle Scholar
- Clough JR, Gollings J, Loach TV, Evans TS (2015) Transitive reduction of citation networks. J Complex Netw 3(2):189-203 MathSciNetView ArticleGoogle Scholar
- Fagin R, Kumar R, Sivakumar D (2003) Comparing top k lists. SIAM J Discrete Math 17(1):134-160 MathSciNetView ArticleMATHGoogle Scholar
- Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):P10008 View ArticleGoogle Scholar
- Rosvall M, Bergstrom CT (2010) Mapping change in large networks. PLoS ONE 5(1):e8694 View ArticleGoogle Scholar
- Hamberger K, Houseman M, White D (2011) Kinship network analysis. In: Carrington P, Scott J (eds) The SAGE handbook of social network analysis. Sage, London, pp 533-549 Google Scholar
- Polanyi M (1969) Personal knowledge: towards a post-critical philosophy. University of Chicago Press, Chicago Google Scholar
- Borjas GJ, Doran KB (2012) The collapse of the Soviet Union and the productivity of American mathematicians. Q J Econ 127:1143-1203 Google Scholar
- Moser P, Voena A, Waldinger F (2014) German-Jewish emigres and US invention. Am Econ Rev 104:3222-3255 View ArticleGoogle Scholar
- Larivière V, Ni C, Gingras Y, Cronin B, Sugimoto CR (2013) Bibliometrics: global gender disparities in science. Nature 504(7479):211-213 View ArticleGoogle Scholar
- King MM, Bergstrom CT, Correll SJ, Jacquet J, West JD (2016) Men set their own cites high: gender and self-citation across fields and over time. arXiv:1607.00376