Skip to main content
Figure 1 | EPJ Data Science

Figure 1

From: Mapping language literacy at scale: a case study on Facebook

Figure 1

Creating online language literacy estimate. Our methodology produces online language literacy estimate (OLLE) through three major steps: 1. (A)–(C) The set of LoFF words (lower-frequency frequent words) is algorithmically determined based on the vocabulary popularity in the language corpus; the red bands in (A)–(C) indicate the selected sets of LoFF words in the three most widely used languages in our data (English, Spanish, Arabic). 2. (D)–(F) Normalized total occurrence of LoFF words in Facebook dataset from each country is used as a language-specific online literacy estimate for that country. (D)–(F) show the strong correlations found between our estimates and countries’ officially reported literacy rates in English, Spanish, and Arabic, respectively. 3. (G) The calibrated global estimates, OLLE, are generated after addressing language group bias and shown here with a strong correlation with reported literacy rates (Spearman’s rank correlation coefficient \(\rho =0.78\), 95% CI \([0.69, 0.84]\), \(p<0.001\).) Error bounds represent the 95% confidence intervals. In (D)–(G), each dot represents a country, with x value indicating the country’s raw (D)–(F) or calibrated (G) literacy estimate and y value the country’s officially reported literacy rate.

Back to article page