Linguistic neighbourhoods: explaining cultural borders on Wikipedia through multilingual co-editing activity

Samoilenko, Anna; Karimi, Fariba; Edler, Daniel; Kunegis, Jérôme; Strohmaier, Markus

doi:10.1140/epjds/s13688-016-0070-8

EPJ Data Science

Table 1 Formalisation of hypotheses to explain the probability of language dyads to co-edit a Wikipedia article about the same concept

From: Linguistic neighbourhoods: explaining cultural borders on Wikipedia through multilingual co-editing activity

Hypothesis and formalisation	Notation	Description	Data source
H0: Uniform hypothesis \(t_{ij} = 1 \)	–	All co-occurrences are equally probable, i.e. every edition i covers the same concept as edition j with a constant probability	–
H1: Shared language family \(t_{ij} = \|f_{i} \cup f_{j}\| \)	\(f_{i}\) is the set of branches describing the full language family profile of language i, \(t_{ij}\) is the count of shared branches in the family tree of i and j	Language communities of linguistically related languages will show more co-editing similarity	The data on language family classification was taken from English Wikipedia infoboxes of articles on each of 110 languages, such as ‘Hebrew language’
H2: Bilingual population within a country \(t_{ij} = \frac{1}{N_{ij}} \sum_{A} p(i)_{A} p(j)_{A} \)	\(p(i)_{A}\), \(p(j)_{A}\) are proportions of speakers of i, j in a country A, \(N_{ij}\) is the number of countries where i, j are co-spoken	Multilingual editors belong to multiple cultural communities and might serve as bridges between them. The more bilinguals speaking i and j live in the same country, the higher the transition belief	Territory-language information was downloaded from [51], and is based on the data from the World Bank, Ethnologue, FactBook, and other sources, including per-country census data
H3: Geographical proximity of languages \(t_{ij} = \frac{1}{N_{ij}} \sum_{A,B} \frac{ d_{\mathrm {min}} }{ d_{AB}} \)	\(N_{ij}\) is the number of country permutations where i or j are spoken as primary language, \(d_{AB}\) is Euclidean distance between each pair of countries, and \(d_{\mathrm{min}}\) is the smallest distance between countries in the dataset	The smaller the distance between speakers of i and j living in separate countries, the higher the chances for languages i, j to cover the same concept. We consider one (primary) language per country	Distance between countries is computed as Euclidean distance in kilometers between country capitals [52]
H4: Gravity law - demographic force attracting language communities \(t_{ij} = \frac{1}{N_{ij}} \sum_{A,B} \frac{m_{A,i} m_{B,j}}{d_{AB}^{2}} \)	\(m_{A,i}\) is the number of speakers of the primary language i in a country A, \(d_{AB}\) is Euclidean distance between each pair of counties, \(N_{ij}\) is the number of country pairs where i or j are spoken as primary language	The larger the language-speaking population and the smaller the distance between the countries A, B, the more the attraction between i and j. Based on the countries’ primary languages	Country population data is taken from CIA Factbook [52]
H5: Shared religion \(t_{ij}= \left\{\begin{array}{l@{\quad}l} 1, & \text{if } r_{i}=r_{j}\\ 0, & \text{otherwise} \end{array} \right.\)	\(r_{i}\) is the dominating religion of a language community. It is defined as the most common religion in the list of countries whose primary language is i	Cultures which profess the same religion will show consistent interest in the same topics	The data on world religions was taken from the most recent 2010 Report on Religious Diversity provided by the Pew Research Center [53]

The hypotheses aim to explain the values of link weights (z-scores) in the network of co-editing similarity (see Figure 2 for illustrative purposes). The transition probability matrices are square with dimensions N = 110, corresponding to the number of language editions studied. The diagonal is empty, since self-loops are not allowed. The value \(t_{ij}\) expresses the hypothesised probability of Wikipedia language editions i and j to cover the same concept. After construction of the hypotheses matrices, the matrices undergo Laplacian smoothing of weight 1 (for HypTrails hypotheses testing only), and are further normalised row-wise. The process is illustrated in Figure 3. The results of hypothesis testing are represented in Figure 4 for the HypTrails approach, and in Figure 2 for the MRQAP approach, and are discussed in Sections 5.2 and 5.3 correspondingly.

Back to article page