Testing the hypothesis of preferential attachment in social network formation
- Thomas House^{1, 2}Email authorView ORCID ID profile,
- Jonathan M Read^{3},
- Leon Danon^{4} and
- Matthew J Keeling^{2}
DOI: 10.1140/epjds/s13688-015-0052-2
© House et al. 2015
Received: 2 July 2015
Accepted: 28 September 2015
Published: 9 October 2015
Abstract
The hypothesis of preferential attachment (PA) - whereby better connected individuals make more connections - is hotly debated, particularly in the context of epidemiological networks. The simplest models of PA, for example, are incompatible with the eradication of any disease through population-level control measures such as random vaccination. Typically, evidence has been sought for the presence or absence of preferential attachment via asymptotic power-law behaviour. Here, we present a general statistical method to test directly for evidence of PA in count data and apply this to data for contacts relevant to the spread of respiratory diseases. We find that while standard methods for model selection prefer a form of PA, careful analysis of the best fitting PA models allows for a level of contact heterogeneity that in fact allows control of respiratory diseases. Our approach is based on a flexible but numerically cheap likelihood-based model that could in principle be applied to other integer data where the hypothesis of PA is of interest.
Keywords
MLE Phase-type distribution model selection spectral methods1 Introduction
1.1 Contact heterogeneity in infectious disease epidemiology
Infectious pathogens that spread via contact between people are a major cause of human disease, driving attempts to understand their epidemiology [1]. Much theoretical work on infectious disease dynamics has been focused on the role of heterogeneity in the human population [2], which is often conceptualised as a network of epidemiologically relevant contacts [3–5].
1.2 Data
Of course, whether such a theoretical possibility matters for the study of infectious diseases depends on the actual variance in degree for epidemiologically relevant contacts. While 20th century models of infectious disease were often based on strong a priori assumptions about mixing patterns [1], various methods for measurement of contact patterns now exist and were reviewed by Read et al. [10]. As well as direct measurement of individuals through surveys [11] it is possible to improve coverage through snowball and respondent-driven sampling [12, 13], to make use of the extremely large datasets produced by electronic sensors [14, 15], and also to combine aggregate data [16, 17].
These empirical observations of high heterogeneity in contact number, together with theoretical results about \(R_{0}\), present a paradox for infectious disease epidemiology: is the extreme heterogeneity in observed contact patterns indicative of PA and does that imply that \(R_{0}>1\) for almost any finite level of person-to-person transmissibility meaning that our theoretical understanding of infectious disease epidemiology is somehow severely lacking?
1.3 Preferential attachment and power laws in empirical data
Recent years have seen a debate about the level of heterogeneity that exists in a variety of observed networks. A particularly influential paper by Barabási and Albert [22] considered a model of network formation in which many new nodes are added to a small existing network. These new nodes connect preferentially to nodes that have more links in the existing network, leading to the asymptotic result (2) with \(\gamma=3\). In this way preferential attachment is intimately linked with, but not always equivalent to, asymptotic power-law behaviour.
Simple power-law relationships have been claimed for numerous real-world systems, and a critical review of these claims by Clauset et al. [23] used maximum-likelihood fitting of distribution tails to power-law distributions to show varying levels of statistical support for claims in the literature. In the context of discrete data, pioneering work by Zipf [24] found power-laws in word frequencies; considering the count of unique words in Moby Dick both Newman [25] and Clauset et al. [23] agree that the statistical evidence for Zipf’s power-law distribution in this context is strong. On the other hand, the in- and out-degrees of E. coli metabolic networks have been claimed to follow a power law [26], but this is disputed by the analyses of Huss and Holme [27] and Clauset et al. [23].
The debate around presence or absence of power laws in real data continues, perhaps most strongly in the context of networks. For example, Barabási [28] writes that preferential attachment is network science’s “most profuse concept,” and that “the impact of preferential attachment is hard to miss.” At the same time, Stumpf and Porter [29] argue that “most reported power laws lack statistical support and mechanistic backing.”
1.4 Testing preferential attachment directly
In this work, we attempt to test the hypothesis of preferential attachment in social contact data directly, rather than via asymptotic power law behaviour. We make use of previously collected data on social encounters specifically designed to measure heterogeneity in numbers of contacts amongst the British population, and fit mechanistic models of different complexity to these data. We determine that models with significant levels of preferential attachment have better evidential support from the data than models without.
2 Methods
2.1 Social Contact Survey data
A cross-sectional study was conducted between May 2009 and October 2010, recruiting households and individuals through postal and online questionnaires, supported by a large random-address mailshot and a modest online and media promotion [30, 31]. Questionnaires asked respondents to report on the number of distinct individuals they encountered the previous day: their contacts. Respondents were able to report contacts either as individuals or as members of a group with a reported size. Allowing the reporting of groups of individuals was a deliberate methodological design to permit the easy reporting of large numbers of contacts, to avoid the approach taken by previous studies [11], which imposed a high burden on respondents with large number of contacts, and to ensure the best capture of the right-hand tail of the degree distribution. In general, we expect that such data will become increasingly available due to the epidemiological importance of this tail (e.g. the study of Read et al. [21]).
In total, completed questionnaires were received from 5,388 participants in Great Britain, 3,901 of which were from postal surveys. There was some bias in demographical representation, most notably younger age groups and males were generally under-represented (see Danon et al. [31] for more details). The data is available at http://wrap.warwick.ac.uk/54273/.
2.2 Generalised preferential attachment
2.3 Phase-type holding times
The question is then posed as to an appropriate distribution from which to draw the holding times \(\{T_{i}\}\) for the amount of time spent making new contacts on the day for which individuals provide data. In previous work [30] on a related model of contact formation we considered holding times \(T_{i}\) that were log-normally distributed. This provided a good fit to data, but was computationally intensive and lacked a mechanistic interpretation. We therefore consider here a class of distributions for the holding times that is highly flexible, but which has analytic and numerical benefits - the distributions of phase type [35]. Phase-type distributions are dense in the space of positive-valued probability distributions [36], meaning that they can be made arbitrarily close to any other distribution. They have a mechanistic interpretation and allow for analytic manipulations that greatly reduce the numerical cost of likelihood evaluation.
In general, however, combination of (10) and (6) is not the most numerically efficient method for calculation of the overall probability mass function for final number of contacts \(K_{i}(T_{i})\) and a different approach is needed.
2.4 Numerically efficient model solution
2.5 Model likelihood, fitting and selection
We consider the use of the likelihood function (20) using standard statistical methodology. Numerical maximum likelihood estimation was performed using simulated annealing run from multiple starting points to ensure the global optimum was obtained. Model selection was performed using AIC [38] and BIC [39], as well as likelihood ratio tests [40] on pairs of models where this test was informative. This was done since each approach involves different trade-offs between model fit and complexity, and to check that our conclusions about PA are not overly sensitive to the precise method used. Uncertainty in model parameters was quantified using confidence intervals obtained through bootstrapping the data, and uncertainty in model outputs such as the predicted degree distribution was quantified using a parametric bootstrap.
3 Results and discussion
Comparison of models with different numbers of phases, with and without preferential attachment (PA), together with: number of parameters; differences in AIC and BIC values compared to the overall minimum; and the lowest divergent moment for models with PA
(Phases, PA) | No. Params | ΔAIC | ΔBIC | Diverge |
---|---|---|---|---|
(1,No) | 1 | 2.2 × 10^{3} | 2.1 × 10^{3} | – |
(2,No) | 4 | 2.1 × 10^{2} | 1.5 × 10^{2} | – |
(3,No) | 8 | 1.2 × 10^{2} | 83 | – |
(4,No) | 13 | 42 | 38 | – |
(5,No) | 19 | 23 | 58 | – |
(6,No) | 26 | 27 | 1.1 × 10^{2} | – |
(1,Yes) | 2 | 1.9 × 10^{2} | 1.1 × 10^{2} | 3 |
(2,Yes) | 5 | 1.3 × 10^{2} | 72 | 4 |
(3,Yes) | 9 | 31 | \(\mathbf{[0]}\) | 3 |
(4,Yes) | 14 | 11 | 14 | 3 |
(5,Yes) | 20 | \(\mathbf{[0]}\) | 42 | 3 |
(6,Yes) | 27 | 9 | 97 | 3 |
For the 3-phase model with PA, \(\tau= 0.018\ [0.012,0.026]\); and if we set \(\tau=0\) but leave the other parameters at their fitted values, then the total number of contacts per person is reduced to 64% of its original value. For the 5-phase model with PA \(\tau= 0.026\ [0.019,0.036]\); and if we set \(\tau=0\) but leave the other parameters at their fitted values, then the total number of contacts per person is reduced to 58% of its original value. This shows that in both of these models, we can attribute a substantial fraction of the contacts to PA.
We also calculate that the second moment does not diverge in any of the fitted models, which helps to resolve the epidemiological paradox that we introduced at the start of this paper. PA is empirically supported, and is also mechanistically plausible since existing social contacts give more opportunities for future social contact. Combined with a sufficiently detailed phase-based mechanistic model of the contexts in which social contacts are made, however, PA does not imply a divergent second moment for the distribution of contacts relevant for the spread of directly transmitted infections. This means that our understanding of how basic epidemiological quantities like the basic reproductive ratio, \(R_{0}\), are related to contact networks does not need to be revised in the light of empirical evidence.
As a final observation, we believe that as computational resources for fitting models to data improve, it will in general be easier to test the hypothesis of PA directly in all kinds of data, rather than looking for asymptotic power laws.
Declarations
Acknowledgements
The Social Contact Survey was funded by the Medical Research Council, grant number G0701256. TH and MJK are supported by the Engineering and Physical Sciences Research Council. JMR and MJK are supported by the Economic and Social Research Council, grant ES/K004255/1. LD is supported by the Leverhulme Trust.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- Anderson RM, May RM (1991) Infectious diseases of humans. Oxford University Press, Oxford Google Scholar
- Diekmann O, Heesterbeek JAP, Britton T (2012) Mathematical tools for understanding infectious disease dynamics. Princeton University Press, Princeton View ArticleGoogle Scholar
- Bansal S, Grenfell BT, Meyers LA (2007) When individual behaviour matters: homogeneous and network models in epidemiology. J R Soc Interface 4(16):879-891 View ArticleGoogle Scholar
- Danon L, Ford AP, House T, Jewell CP, Keeling MJ, Roberts GO, Ross JV, Vernon MC (2011) Networks and the epidemiology of infectious disease. Interdiscip Perspect Infect Dis 2011:284909 Google Scholar
- Pellis L, Ball F, Bansal S, Eames K, House T, Isham V, Trapman P (2014) Eight challenges for network epidemic models. Epidemics 10:58-62. doi:10.1016/j.epidem.2014.07.003 View ArticleGoogle Scholar
- Diekmann O, Heesterbeek JAP (2000) Mathematical epidemiology of infectious diseases: model building, analysis and interpretation. Wiley, New York Google Scholar
- Pastor-Satorras R, Vespignani A (2001) Epidemic dynamics and endemic states in complex networks. Phys Rev E 63:066117 View ArticleGoogle Scholar
- May RM, Lloyd AL (2001) Infection dynamics on scale-free networks. Phys Rev E 64:066112 View ArticleGoogle Scholar
- Durrett R (2010) Some features of the spread of epidemics and information on a random graph. Proc Natl Acad Sci USA 107(10):4491-4498 MathSciNetView ArticleGoogle Scholar
- Read JM, Edmunds WJ, Riley S, Lessler J, Cummings DAT (2012) Close encounters of the infectious kind: methods to measure social mixing behaviour. Epidemiol Infect 140(12):2117-2130 View ArticleGoogle Scholar
- Mossong J, Hens N, Jit M, Beutels P, Auranen K, Mikolajczyk R, Massari M, Salmaso S, Tomba GS, Wallinga J, Heijne J, Sadkowska-Todys M, Rosinska M, Edmunds WJ (2008) Social contacts and mixing patterns relevant to the spread of infectious diseases. PLoS Med 5(3):381-391 View ArticleGoogle Scholar
- Goodman LA (1961) Snowball sampling. Ann Math Stat 32:148-170 MATHView ArticleGoogle Scholar
- Heckathorn DD (1997) Respondent-driven sampling: a new approach to the study of hidden populations. Soc Probl 44:174-199 View ArticleGoogle Scholar
- Salathé M, Kazandjieva M, Lee JW, Levis P, Feldman MW, Jones JH (2010) A high-resolution human contact network for infectious disease transmission. Proc Natl Acad Sci USA 107(51):22020-22025 View ArticleGoogle Scholar
- Isella L, Stehlé J, Barrat A, Cattuto C, Pinton J, Van den Broeck W (2011) What’s in a crowd? Analysis of face-to-face behavioral networks. J Theor Biol 271(1):166-180 View ArticleGoogle Scholar
- Eubank S, Guclu H, Kumar VSA, Marathe MV, Srinivasan A, Toroczkai Z, Wang N (2004) Modelling disease outbreaks in realistic urban social networks. Nature 429(6988):180-184 View ArticleGoogle Scholar
- Eubank S, Barrett C, Beckman R, Bisset K, Durbeck L, Kuhlman C, Lewis B, Marathe A, Marathe M, Stretz P (2010) Detail in network models of epidemiology: are we there yet? Journal of Biological Dynamics 4(5):446-455 MathSciNetView ArticleGoogle Scholar
- Fournet J, Barrat A (2014) Contact patterns among high school students. PLoS ONE 9(9):e107878 View ArticleGoogle Scholar
- Schneeberger A, Mercer CH, Gregson SAJ, Ferguson NM, Nyamukapa CA, Anderson RM, Johnson AM, Garnett GP (2004) Scale-free networks and sexually transmitted diseases: a description of observed patterns of sexual contacts in Britain and Zimbabwe. Sex Transm Dis 31(6):380-387 View ArticleGoogle Scholar
- Leigh Brown AJ, Lycett SJ, Weinert L, Hughes GJ, Fearnhill E, Dunn DT (2011) Transmission network parameters estimated from HIV sequences for a nationwide epidemic. J Infect Dis 204(9):1463-1469 View ArticleGoogle Scholar
- Read JM, Lessler J, Riley S, Wang S, Tan LJ, Kwok KO, Guan Y, Jiang CQ, Cummings DAT (2014) Social mixing patterns in rural and urban areas of southern China. Proc R Soc B 281(1785):20140268 View ArticleGoogle Scholar
- Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509-512 MathSciNetView ArticleGoogle Scholar
- Clauset A, Shalizi CR, Newman MEJ (2009) Power-law distributions in empirical data. SIAM Rev 51(4):661-703 MATHMathSciNetView ArticleGoogle Scholar
- Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley, Reading Google Scholar
- Newman MEJ (2005) Power laws, Pareto distributions and Zipf’s law. Contemp Phys 46(5):323-351 View ArticleGoogle Scholar
- Jeong H, Tombor B, Albert R, Oltvai ZN, Barabási AL (2000) The large-scale organization of metabolic networks. Nature 407(6804):651-654 View ArticleGoogle Scholar
- Huss M, Holme P (2007) Currency and commodity metabolites: their identification and relation to the modularity of metabolic networks. IET Syst Biol 1(5):280-285 View ArticleGoogle Scholar
- Barabási AL (2012) Network science: luck or reason. Nature 489(7417):507-508 View ArticleGoogle Scholar
- Stumpf MPH, Porter MA (2012) Critical truths about power laws. Science 335(6069):665-666 MathSciNetView ArticleGoogle Scholar
- Danon L, House T, Read JM, Keeling MJ (2012) Social encounter networks: collective properties and disease transmission. J R Soc Interface 9(76):2826-2833 View ArticleGoogle Scholar
- Danon L, Read JM, House T, Vernon MC, Keeling MJ (2013) Social encounter networks: characterizing Great Britain. Proc R Soc B 280(1765):20131037 View ArticleGoogle Scholar
- Durrett R (2007) Random graph dynamics. Cambridge University Press, Cambridge MATHGoogle Scholar
- Simkin MV, Roychowdhury VP (2011) Re-inventing Willis. Phys Rep 502(1):1-35 MathSciNetGoogle Scholar
- Yule GU (1925) A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis, F.R.S. Philos Trans R Soc Lond B, Contain Pap Biol Character 213:21-87 View ArticleGoogle Scholar
- Neuts MF (1981) Matrix-geometric solutions in stochastic models: an algorithmic approach. Johns Hopkins University Press, Baltimore MATHGoogle Scholar
- Neuts MF (1975) Probability distributions of phase type. In: Liber amicorum Professor emeritus Dr. H. Florin. Katholieke Universiteit Leuven, Departement Wiskunde, Leuven, pp 173-206 Google Scholar
- Bailey NTJ (1957) The mathematical theory of epidemics. Griffin, London Google Scholar
- Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716-723 MATHMathSciNetView ArticleGoogle Scholar
- Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461-464 MATHView ArticleGoogle Scholar
- Neyman J, Pearson ES (1933) On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc Lond A, Math Phys Eng Sci 231:289-337 View ArticleGoogle Scholar