In this section, we describe the procedure of extracting cultural similarities from co-editing activity in Wikipedia, and present the network of significant shared interests between 110 language communities. The section begins with summarising our pre-analysis check of whether the language-concept overlap in Wikipedia is random.
4.1 Testing for non-randomness of co-editing patterns
Theoretically, each concept covered in Wikipedia could exist in all 288 language editions of the encyclopedia. This is possible because Wikipedia does not censor topic inclusion depending on the language of edition, and anyone is free to contribute an article on any topic of significance. However in practice, such complete coverage is very rare, and concepts are covered in a limited set of language editions. Is this set of languages random? To answer this question, we analyse matrices of language co-occurrences based on a 6.5% random sample of the data (200,748 concepts).
We construct the matrix of empirical co-occurrences \(C_{ij}\), based on the probability of languages i, j to have an article on the same concept. We also construct a synthetic dataset where we preserve the distribution of languages and the number of concepts, \(N = 200{,}748\), but allow languages to co-occur at random. We use the resulting data to produce the matrix of random co-occurrences \(C^{\mathrm{rand}}_{ij}\), and compare it to the matrix of co-occurrences \(C_{ij}\). Our null model corresponds to belief that in Wikipedia each concept has equal chances to be covered by any language, with larger editions sharing concepts more frequently purely because of their size. Comparing two matrices allows us to get a preliminary intuition of the extent to which co-editing patterns are non-random.
We establish that language dyads do not edit articles about the same concept (co-occur) by chance. Large editions share concepts more frequently than expected: although in the data EN-DE and EN-FR overlap in 45% of cases, only 15% is expected by the null model. To little surprise, the amount of overlap between editions in the data decreases with the size of the editions. One notable exception is the Japanese edition which, despite being among the ten largest Wikipedias, co-occurs with other top editions noticeably less frequently. Similarly, the Uzbek edition, being among the ten smallest in the dataset, shows high concept overlap with large editions. By simply plotting frequencies of co-occurrences, we do not observe any local blocks or clusters, neither among large nor small editions (see Figure A1 in Additional file 1).
These overlap differences are statistically significant, and the null model explains only 1,386 out of 11,990 language pairs (11% of observed data, 95% confidence level). Such low explained variation suggests that concept overlap is not random and cannot be explained only by edition sizes. Instead, there are non-random, possibly cultural processes, that influence which languages cover which concepts on Wikipedia. Having evidence that the data contain a signal, we continue our investigation by performing network analysis.
4.2 Inferring the network of shared interest
We look for the languages that are consistently interested in editing articles on the same topics by comparing the differences between observed and expected co-editing activity on each concept. We give a z-score to every language pair, and compare it to the threshold of significance to filter out insignificant pairs. This logic is demonstrated in Figure 1. The result is a weighted undirected network of languages, where languages are connected based on shared information interest.
We first compute the empirical weight \(w_{ij}^{c}\) of a link between languages i, j which co-edit a concept c:
$$ w_{ij}^{c} = k_{i}^{c} k_{j}^{c}. $$
(1)
Here, \(k_{i}^{c}\) is the number of edits to the concept c in the language edition i, which we use as a proxy to the amount of editing work invested in the concept. This is done across all concepts and language permutations. To determine which links are statistically significant, and which exist purely by chance or due to size effects, we construct a null model where we assume that links between languages i and j are random.
Let the total editing probability of a language be \(p_{i} = \frac{1}{M} \sum_{c}k_{i}^{c}\), where M is the total number of edits for all concepts and language editions. Then the expected probability \(\mathrm {E}[w_{ij}^{c}]\) that languages i and j co-edit the same concept c is:
$$ \mathrm {E}\bigl[w_{ij}^{c}\bigr] = n_{c} (n_{c} - 1) p_{i} p_{j}, $$
(2)
where \(n_{c}\) is the total number of edits to a concept from all language editions. To compare the difference between observed and expected link weights, we compute a z-score \(z_{ij}^{c}\) for each concept and pair of languages i, j, defined as
$$ z_{ij}^{c} = \frac{w_{ij}^{c} - \mathrm {E}[w_{ij}^{c}]}{\sigma_{ij}^{c}}, $$
(3)
where \(\sigma_{ij}^{c}\) is the standard deviation of the expected link weight [42].
Finally, to find the cumulative z-score for a pair of languages i, j, we sum their z-scores over all concepts
$$ z_{ij} = \sum_{c}z_{ij}^{c}. $$
(4)
The relationship between i and j is significant if the cumulative probability of their total z-score, \(z_{ij}\) in the right tail falls beyond the p-value \(p = 1 - 0.05 / N\), where N is the total number of languages. We use the Bonferroni correction [43] to account for the multiple comparisons and size effects in the data. This corresponds to a z-score of 3.32. Since z-scores are sums across many independent variables, their distribution can be approximated by the normal distribution, and the threshold for link significance in the right tail is \(t = 3.32 \sqrt{L}\), where \(L = 3{,}066{,}736\) is the number of concepts. We create a link between a pair of languages i, j if the observed z-score, \(z_{ij}\), is above the threshold t [42].
We use the resulting z-scores to build a network of shared topical interests, where the edges are weighted by the similarity of interest, quantifies via z-scores. In summary, this approach allows for discovering significant language pairs of shared interest, accounting for editions of different sizes, and avoiding over-representing the large editions [42].
Other methods exist to extract significant weights in graphs. For example, [44] used the hypergeometric distribution for finding the expected link weights for bipartite networks and measured the global p-value. Serrano et al. [45] used a disparity filtering method to infer significant weights in networks. Similar to our work, [46] proposed pair-wise connection probability by the configuration model and used the p-value to measure statistical significance of the links.
The network consists of 110 nodes (language editions) and 11,986 undirected edges, and is a complete graph. This means that most languages show at least some similarity in the concepts they edit, however the strength of similarity differs highly across language pairs. The distribution of edge weights is highly skewed with the lowest z-score between Korean and Buginese and the highest z-score between Javanese and Indonesian.
4.3 Clustering the network of significant shared interests
We use the Infomap algorithm [47] to identify language communities that are most similar in their interests. We release a random walker on the network, and allow it to travel across links proportional to their weights. By measuring how long the random walker spends in each part of the network, we are able to identify clusters of languages with strong internal connections [47]. Additionally, we compare these results with the Louvain clustering algorithm [48] and establish that both methods show high agreement.
Our cluster analysis suggests that no language community is completely separated from other communities, and in fact, there are significant topics of common interest between almost any two language pairs. We reveal 21 clusters of two and more languages, plus 9 languages that are identified as separate clusters (see SI for full information on the clusters). Notably, English forms a self-cluster, and this independent standing means little interest similarity between English and other languages. This is an interesting finding in the light of the recent discussions on whether English is becoming a global language and the most suitable lingua franca for cross-national communication [49].
The resulting network is visualised in Figure 2. The links within clusters are weighted according to the amount of positive deviation of z-score per language pair from the threshold of randomness. Stronger weights indicate higher similarity. The links are significant at the 99% level. The inter-cluster links should be interpreted with care in the context of this study, as they are weighted according to the aggregated strength of connection between all nodes of both clusters. The network is undirected since it depicts mutual topical interest of both language communities, which is inherently bidirectional. For visualisation purposes, we display only the strongest inter-cluster links and 23 language clusters. Cluster membership information is detailed in Table A1 in Additional file 2.
Cluster interpretation. Visual inspection of language clusters suggests a number of hypotheses which might explain such network configuration. For example, (1) geographical proximity might explain the Swedish-Norwegian-Danish-Faroese-Finnish-Icelandic cluster (light blue), since those are the languages mostly spoken in the Nordic countries. Other groups of languages form around (2) a local lingua franca, which is often an official language of a multilingual country, and include other regional languages which are spoken as second- and even third language within the local community. This way, Indonesian and Malay form a cluster with Javanese and Sundanese (brown), which are two largest regional languages of Indonesia. Similarly, one of the largest clusters in the network (purple) consists of 11 languages native to India, where cases of multilingualism are especially common, since one might need to use different languages for contacts with the state government, with the local community, and at home [49]. Another interesting example is the cluster of languages primarily spoken in the Middle Eastern countries (yellow), which apart from geographical proximity are closely intertwined due to (3) a shared religious tradition. Finally, some clusters illustrate (4) the recent changes in sociopolitical situation, which can also be partially traced through bilingualism. Following the civil war of the 1990s in former Yugoslavia, its former official Serbo-Croatian language is now replaced by three separate languages: Serbian, Croatian, and Bosnian (green cluster). Notably, there is still a separate Serbo-Croatian Wikipedia edition. To give another example, Russian held a privileged position in the former Soviet Union, being the language of the ideology and a priority language to learn at school [49]. Even twenty years after the dissolution of the Soviet Union, Russian remains an important language of exchange between the post-Soviet countries. Similarity of interests between speakers of Russian and the languages spoken in nearby countries, as seen in the magenta cluster, comes as little surprise.
We use this anecdotal interpretation of the clusters to inform our hypotheses about the mechanisms that affect the formation of co-editing similarities. In the next section we will build on these initial interpretations and formulate them as quantifiable hypotheses. To evaluate the validity of the hypotheses, we will compare their plausibility against one another using statistical inference approach.