Hypothesis and formalisation | Notation | Description | Data source |
---|---|---|---|
H0: Uniform hypothesis \(t_{ij} = 1 \) | – | All co-occurrences are equally probable, i.e. every edition i covers the same concept as edition j with a constant probability | – |
H1: Shared language family \(t_{ij} = |f_{i} \cup f_{j}| \) | \(f_{i}\) is the set of branches describing the full language family profile of language i, \(t_{ij}\) is the count of shared branches in the family tree of i and j | Language communities of linguistically related languages will show more co-editing similarity | The data on language family classification was taken from English Wikipedia infoboxes of articles on each of 110 languages, such as ‘Hebrew language’ |
H2: Bilingual population within a country \(t_{ij} = \frac{1}{N_{ij}} \sum_{A} p(i)_{A} p(j)_{A} \) | \(p(i)_{A}\), \(p(j)_{A}\) are proportions of speakers of i, j in a country A, \(N_{ij}\) is the number of countries where i, j are co-spoken | Multilingual editors belong to multiple cultural communities and might serve as bridges between them. The more bilinguals speaking i and j live in the same country, the higher the transition belief | Territory-language information was downloaded from [51], and is based on the data from the World Bank, Ethnologue, FactBook, and other sources, including per-country census data |
H3: Geographical proximity of languages \(t_{ij} = \frac{1}{N_{ij}} \sum_{A,B} \frac{ d_{\mathrm {min}} }{ d_{AB}} \) | \(N_{ij}\) is the number of country permutations where i or j are spoken as primary language, \(d_{AB}\) is Euclidean distance between each pair of countries, and \(d_{\mathrm{min}}\) is the smallest distance between countries in the dataset | The smaller the distance between speakers of i and j living in separate countries, the higher the chances for languages i, j to cover the same concept. We consider one (primary) language per country | Distance between countries is computed as Euclidean distance in kilometers between country capitals [52] |
H4: Gravity law - demographic force attracting language communities \(t_{ij} = \frac{1}{N_{ij}} \sum_{A,B} \frac{m_{A,i} m_{B,j}}{d_{AB}^{2}} \) | \(m_{A,i}\) is the number of speakers of the primary language i in a country A, \(d_{AB}\) is Euclidean distance between each pair of counties, \(N_{ij}\) is the number of country pairs where i or j are spoken as primary language | The larger the language-speaking population and the smaller the distance between the countries A, B, the more the attraction between i and j. Based on the countries’ primary languages | Country population data is taken from CIA Factbook [52] |
H5: Shared religion \(t_{ij}= \left\{\begin{array}{l@{\quad}l} 1, & \text{if } r_{i}=r_{j}\\ 0, & \text{otherwise} \end{array} \right.\) | \(r_{i}\) is the dominating religion of a language community. It is defined as the most common religion in the list of countries whose primary language is i | Cultures which profess the same religion will show consistent interest in the same topics | The data on world religions was taken from the most recent 2010 Report on Religious Diversity provided by the Pew Research Center [53] |