Estimating educational outcomes from students’ short texts on social media

Digital traces have become an essential source of data in social sciences because they provide new insights into human behavior and allow studies to be conducted on a larger scale. One particular area of interest is the estimation of various users’ characteristics from their texts on social media. Although it has been established that basic categorical attributes could be effectively predicted from social media posts, the extent to which it applies to more complex continuous characteristics is less understood. In this research, we used data from a nationally representative panel of students to predict their educational outcomes measured by standardized tests from short texts on a popular Russian social networking site VK. We combined unsupervised learning of word embeddings on a large corpus of VK posts with a simple, supervised model trained on individual posts. The resulting model was able to distinguish between posts written by high- and low-performing students with an accuracy of 94%. We then applied the model to reproduce the ranking of 914 high schools from 3 cities and of the 100 largest universities in Russia. We also showed that the same model could predict academic performance from tweets as well as from VK posts. Finally, we explored predictors of high and low academic performance to obtain insights into the factors associated with different educational outcomes.


Introduction
It is already well-known that various human attributes, from socio-economic status to sexual preferences, can be predicted from digital traces on social media platforms.However, in most cases, predictive models rely on specific features of a particular source of digital trace data.For instance, it was shown in Ref. [1] that a wide range of human traits could be effectively predicted from likes on Facebook, while in Ref. [2] the number of Twitter followers was used (along with other features) to predict users' income.The use of features such as likes or number of followers means that the application of the resulting model is limited to a particular platform, e.g.Facebook or Twitter.Some forms of digital traces are, in contrast, ubiquitous across multiple platforms.For instance, short texts written by users can be found in the form of tweets, posts on Facebook, comments on message boards, reviews on e-commerce sites, etc.In this paper, we explore the predictive power of short texts with respect to the academic performance of users.First, we build a model that predicts the educational outcomes of users from their posts on a popular Russian social networking site, VK, and then test its predictive performance in different contexts.In particular, we use it to predict the rankings of schools and universities based on the public posts of their students on VK.We also test the generalizability of the model by applying it to users' tweets rather then VK posts.
We use data from a representative Russian panel study titled "Trajectories in Education and Careers" (TrEC) [3] that tracks 4,400 students who participated in the Programme for International Student Assessment (PISA) [4] in 2012.In addition to survey data, this data set contains information about public posts on VK for those participants who agreed to share their VK data (N = 3,483).We use this data to build a model that predicts the PISA scores of users based on the content of their posts.We combine unsupervised learning of word embeddings on a large corpus of VK posts (1.9B tokens) with a simple supervised model trained on individual posts (see Methods for details).
In the next step, we collect data about the academic performance of students from the 100 largest Russian universities and from 914 schools from 3 cities (see Methods).The performance is measured as results of the Unified State Examination (USE), a mandatory examination for all high school graduates in Russia.This performance is available in aggregated form, i.e. for each school the average USE score of its graduates is known and for each university the average USE score of its enrollees is known.We then find VK users who indicated in their profiles that they are from these schools and universities (N = 154, 637) and apply our model to their public posts.This allows us to rank schools and universities according to their estimated scores from social media posts and then compare the result with an actual ranking based on USE scores.
To further test the generalizability of the model, we select users who indicate a link to their Twitter account in their VK profiles.We download the tweets of these users and check if their academic performance can still be predicted by using the same model applied to tweets instead of VK posts.
Finally, we use our model to explore the variation in the language of users with different educational outcomes.To this end, we compute predicted scores for individual words.This approach is similar to the open-vocabulary method [5], however, the use of continuous word representations makes the results less sensitive to the frequency at which a word appears in the training data set.In fact, it even allows computing of reliable scores for words that are absent in the training data set as long as they are frequent enough in the corpus on which word embeddings are trained.

TrEC data
We used data from the Russian Longitudinal Panel Study of Educational and Occupational Trajectories (TrEC) [3].The study tracks 4,400 students from 42 Russian regions who took the PISA test in 2012 [4].We used PISA reading scores as a measure of students' academic performance.PISA defines reading literacy as "understanding, using, reflecting on and engaging with written texts in order to achieve one's goals, to develop one's knowledge and potential, and to participate in society" and considers it a foundation for achievement in other subject areas within the educational system and also a prerequisite for successful participation in most areas of adult life [6].PISA scores are scaled so that the OECD average is 500 and the standard deviation is 100, while every 40 score points roughly correspond to the equivalent of one year of formal schooling [4].
In 2018, publicly available information from the social networking site VK was collected for 3,483 TrEC participants who provided informed consent for the use of this data for research purposes.Note that while the initial sample was representative of the 9th-grade high school Russian students in 2012, the social network data is not necessarily representative.There were no publicly available posts for 498 users.The median number of public posts for remaining users was 35.We removed posts that contain URLs from our data set to account for potentially automated postings, and we also excluded re-posts and posts with no text.This resulted in the final data set of 130,575 posts from 2,468 users.

Model
We trained the fastText model [7] on the VK corpus (1.9B tokens, vocabulary size is 2.5M) to obtain vector representations of Russian words (the model is available at [8]).We then represented each post as a 300-dimensional vector by averaging over all its constituent words.We used this representation to train a linear regression model to predict the PISA scores of the posts' authors.The model was trained on individual posts and the predicted value for users was computed as an average predicted value of all their posts.

Schools and university data
VK provides an application programming interface (API) that enables the downloading of information systematically from the site.In particular, downloading user profiles from particular educational institutions and within selected age ranges is possible.For each user, obtaining a list of their public posts is also possible.According to the VK Terms of Service: "Publishing any content on his/her own personal page, including personal information, the User understands and accepts that this information may be available to other Internet users, taking into account the architecture and functionality of the Site."The VK team confirmed that their public API could be used for research purposes.
We created a list of high schools in Saint Petersburg (N = 601), Samara (N = 214), and Tomsk (N = 99) and then accessed the IDs of users who indicated on VK that they graduated from one of these schools.We removed profiles with no friends from the same school, profiles that already belong to the TrEC data set, and users who indicated several schools in their profiles.The public posts of the remaining users were downloaded and our model was applied to them to get a prediction of the users' academic performance.We then estimated the educational outcomes of a school by averaging the predictions of its students' performance.Overall, 1,064,371 posts from 38,833 users were used at this stage of analysis.The same procedure was performed to obtain a prediction of academic performance for students from the 100 largest universities in Russia (N users = 115, 804; N posts = 6, 508, 770).Moscow State University was excluded from the analysis as it is known to be a default choice for bots and fake profiles.Even the application of the aforementioned filtering method does not allow reliable data to be obtained for this university, i.e. there is still an order of magnitude more user profiles than the real number of students for a given cohort.
We used data from the web portal "Schools of Saint Petersburg" [9] to obtain the average performance of schools' graduates in the Unified State Examination (USE).This is a mandatory state examination that all school graduates should pass in Russia.The USE scores for Samara were provided by the web portal "Zeus" [10].The USE scores of the Tomsk schools along with the data on university enrollees [11] were collected by the Higher School of Economics.This information was used to check if the scores predicted from social media data correspond to the ranking of schools and universities based on their USE results.Note that this means that the model was tested not only on a different set of users but that the measure of academic performance was also different from the training settings.

Twitter data
VK allows users to indicate links to other social media accounts, including Twitter, in their profiles.Only a small proportion of users provide links to their Twitter accounts.In our sample information, about 665 Twitter accounts were available for the Saint-Petersburg data set (less for other cities) and 2,836 Twitter accounts were available for the university data set.This allowed the analysis to be performed only for the university data set.The latest tweets of these 2,836 users were downloaded via Twitter's API.Note that unlike tweets, VK posts are not limited to 140 or 280 characters.However, most VK posts are short texts (85.2% of the posts are less than 140 characters and 92.6% of the posts are less than 280 characters in our sample).

Exploring differential language use
Thanks to the simplicity of our model, it is possible to obtain meaningful scores for individual words.By construction, the post score is equal to the average scores of its constituent words.This means that it is enough to explore the individual word scores to interpret the model.We ranked all 2.5M words from our dictionary based on the predicted scores and used this ranking to compare the language of users with different educational outcomes.This approach is similar to the open-vocabulary approach [5,12] because it operates on a level of individual words and allows one to find relations that are not captured with traditional closed dictionaries.The main difference is that our approach uses continuous, rather than discrete, word representations and incorporates rich knowledge about the language structure learned from the training of unsupervised word embeddings.

Prediction
We first explored the predictive power of common text features with respect to academic performance.We found a small negative effect for the use of capitalized words (P = 2 × 10 −3 ), emojis (P = 7 × 10 −3 ), and exclamations (P = 0.05), as seen in Fig. 1.The use of Latin characters (P = 5 × 10 −3 ), average post length (P = 2 × 10 −4 ), word length (P = 4 × 10 −4 ), and vocabulary size (P < 10 −10 ) are positively correlated with academic performance.The strongest correlation was found for the information entropy of users' texts (Pearson's r = 0.20, P < 10 −15 ).We used a TF-IDF model to obtain a base-line prediction of academic performance from the users' posts.We selected the 1000 most common unigrams and bigrams from our corpus, excluding stop words, for the Russian language.We then applied a TF-IDF transformation to represent posts as 1000-dimensional vectors and then trained a linear regression model on individual posts to predict the academic performance of their authors.The correlation between predicted and real scores is r = 0.285.Here, and for the following models, we report results on the user level obtained using leave-one-out cross-validation, i.e. scores for posts of a certain user were obtained from the model trained on posts of all other users.We obtained significantly better results with a model that used word-embeddings (see Methods).We also find that embeddings trained on the VK corpus outperform models trained on the Wikipedia and Common Crawl corpora (Table 1).0.420 The predictive power of a model depends on the number of posts available for each user (see Fig. 2).If only one post is available per user, the predictive power is rather low (r = 0.237).However, it increases with the number of posts available, reaching r = 0.541 for 20 posts per user.

Transfer
Figure 3 shows the correlation between the predicted performance of schools (a-c) and universities (d) and the USE scores of their graduates or enrollees.In all four cases, we find a relatively strong signal despite the fact that the VK sample might not be representative and that academic performance was measured differently than in training settings, being available only in aggregated form and from secondary sources.
Intriguingly, we find that the substitution of VK posts by tweets doesn't substantially alter the resulting performance (see Fig. 4)).For fair comparison, we use VK data only for those users for whom Twitter data was also available.This is why the performance of the model is substantially lower than in Fig. 3d, where all available VK data was used.

Differential language use
We explored the resulting model by selecting 400 words with the highest and lowest scores that appear at least 5 times in the training corpus.A t-SNE representation of the embeddings produced by our model helped [13] to identify several thematic clusters (Fig. 5).High performing clusters include English words (above, saying, yours, must), words related to literature (Dandelion, Bradbury, Fahrenheit, Orwell, Huxley, Faulkner, Nabokov, Brodsky, Camus, Mann, Shelley, Shakespeare), words related to reading (read, reread, published, book, volume), words related to physics (universe, hole, string, theory, quantum, Einstein, Newton, Hawking), and words related to thinking processes including various synonyms of "thinking" and "remembering." Low performing clusters include common spelling errors and typos, names of popular computer games, words related to military service (army, to serve, military oath), horoscopes (Aries, Sagittarius), and cars and road accidents (traffic collision, General Administration for Traffic Safety, wheels, tuning).The use of continuous word representations allows one to compute scores even for words that were not present in the training data set.We computed the scores for all 2.5M words in our vector model and made it available for further exploration [8].This could be used for exploratory analysis to get insights into differential use of language with respect to academic performance and could be applied to various domains, from literature to politics or food (Fig. 6)

Discussion
In traditional cross-validation settings, one data set is randomly split into two parts: the model is trained on one part, and then its performance is assessed using the other part.In our case, we use two different data sources (VK and Twitter) and different measures of academic performance (PISA scores and USE scores).Some additional limitations were that the PISA scores were collected in 2012, while most of the posts were written much later, and that the USE scores were available only at the aggregated level.Despite these limitations, we found a relatively strong signal in the data.For instance, we were able to explain 69% of the variation in universities' scores using information about VK posts of users from these universities.While the result for tweets was significantly lower (R 2 = 0.26), this was probably at least partly due to the smaller sample size.Also, note that the prediction of individual scores substantially depends on the number of available posts.If only one post per user is used for prediction, then 6% of the variation in their academic performance could be explained by our model.This number rises to 29% if 20 posts are used.
The ability to predict the ranking of educational organizations might seem trivial given that direct ranking information is readily available.However, USE scores are measured only once per cohort, making it extremely hard to estimate any added value provided by an educational organization.A comparison of predicted scores for the same users at different points in time might shed light on factors contributing to the students' progress.
We demonstrate how domain-specific unsupervised learning of word embeddings allows predictive models to be trained using relatively small labeled data sets.One reason is that even words that are rare in the training or testing data sets could be valuable for prediction.For instance, even if the word "Newt" never occurs in the training data set, the model could assign a higher score to posts containing it.This would happen if the model learns from the training data set that words from the Harry Potter universe are indicative of high performing students and learns from the unsupervised training that "Newt" belongs to this category, i.e. this word is close to other Harry Potter related words in the vector space.This might make the use of continuous word representation preferable to common approaches relying on counting word frequencies.As our approach does not depend on a particular language, source of texts, or target variable (i.e.academic performance could be substituted by income or depression), it could be applied to a wide variety of settings.
Our results also suggest that models trained on text data could be successfully transferred from one data source to another.While this certainly might be useful in some applications, it also means there is a greater risk to users' privacy.If users of platform A do not disclose an attribute X on it, then there is no data to train a model to predict X from digital traces on platform A. However, if X is disclosed on platform B, and both platforms collect short texts from users, then it becomes possible to predict X from digital traces on A given access to data from B. In recent years, face recognition technology has raised particular privacy concerns because of its potential omnipresence and the inability of people to hide from it.In a similar way, digital traces in the form of short texts are ubiquitous, and our results suggest that they allow, if not to identify a person, then at least to predict potentially sensitive private attributes.

Figure 1 :
Figure 1: Pearson correlation between common text features and academic performance.The use of capitalized words, emojis and exclamations (average number per post normalized by the post length in tokens) is negatively correlated with performance.The use of Latin characters, average post and word length, vocabulary size and entropy of users' texts are positively correlated with academic performance.

Figure 2 :
Figure 2:The predictive power of the model depending on the number of posts used per user.First, users with at least 20 posts were selected.Then, for each user, N of their posts were selected to predict their academic performance (N = 1, ..., 20).The shaded region corresponds to the bootstrapped 90% confidence interval.

Figure 3 :
Figure 3: Correlation between the predicted and real performance of schools and universities.Pearson's correlation coefficients between predicted school scores and the USE scores of their graduates were computed for Saint-Petersburg (a), Samara (b) and Tomsk (c).The correlation between predicted university scores and the USE scores of their enrollees was also computed for the 100 largest Russian universities (d).

Figure 4 :
Figure 4: Comparison of predictions based on VK and Twitter data.While estimates from Twitter and VK vary for individual universities the overall performance of the model is similar for both cases.Note that the performance of the model is rather low due to the limited number of users per university for whom both VK and Twitter data is available.

Figure 5 :
Figure 5: t-SNE representation of the words with the highest and lowest scores from the training data set.High performing clusters (orange) include English words and words related to literature, physics, or thinking processes.Low performing clusters (green) include spelling errors and words related to horoscopes, military service, or cars and road accidents.

Figure 6 :
Figure 6: Ranking of selected words by their predicted score (translated from Russian).The application of the model to words from different domains confirms its face validity.

Table 1 :
Predictive power of the models measured as Pearson correlation between real and predicted outcomes.The results were computed using leave-one-out cross-validation.