The presence of occupational structure in online texts based on word embedding NLP models

Research on social stratification is closely linked to analysing the prestige associated with different occupations. This research focuses on the positions of occupations in the semantic space represented by large amounts of textual data. The results are compared to standard results in social stratification to see whether the classical results are reproduced and if additional insights can be gained into the social positions of occupations. The paper gives an affirmative answer to both questions. The results show fundamental similarity of the occupational structure obtained from text analysis to the structure described by prestige and social distance scales. While our research reinforces many theories and empirical findings of the traditional body of literature on social stratification and, in particular, occupational hierarchy, it pointed to the importance of a factor not discussed in the main line of stratification literature so far: the power and organizational aspect.


Introduction
Analysis and positioning of occupations in social space has a long history in social research.
Social stratification models use occupations as a standard way of operationalizing the position of people in the society. Most of the stratification models rely on massive survey data.
However, the developments of information technology, in particular data science and natural language processing (NLP), and also the rapid growth of computing capacity provide new types of data sources. NLP methodslike word embedding used in this analysisopen up the opportunity to examine the society through written/digitalized texts.
The language used by a social group, mirrors the group's cultural frame of mind (Kozlowski et al. 2019). These texts inform us about the ways of thinking, feeling and knowledge of people. (Evans-Aceves 2016). Billions of digitalized or originally digital texts are available for analysis, which all depict mentality, opinion and values. Sources of texts vary from social media posts, through online newspapers and forums to whole books of classic literature or scientific papers. Thus, the analysis of these huge corpuses can help the understanding of people's perceptions and ways of thinking in a given culture about any kind of topic.
Our paper focuses on the positions of occupations in the semantic space represented by large amounts of textual data. The results are compared to standard results in social stratification to see whether the classical results are reproduced and if additional insights can be gained into the social positions of occupations. The paper gives an affirmative answer to both questions.
The main contribution of this paper is that social structures, in particular, stratification of occupationsestablished so far based on purposively collected data -, do exist and can be derived from large text corpora using methods of unsupervised learning. Further, the most important factors organizing this stratification can be implied, not from theoretical considerations, rather from the semantic space depicted in the text corpora.
In the first part of the paper we briefly introduce a review how social scientists measure the position of people in the society. We also discuss the basics of NLP and especially word embedding models and give a short review on how occupations have been analysed so far using NLP methods. In the Data and Methods chapter, we describe the large digitalized corpora we have used in the analysis and the specification of the model, with which we have extracted the latent dimensions of occupations from these corpora. This part is followed by the analysis and the results. The paper closes with a discussion of how these findings reinforce and extend our understanding of the societal positions of occupations.

Theoretical Background
Occupations and social structure Social class and social stratification are widely used concepts from the early years of sociology. Some variants of these concepts are theory driven, others rely on empirical data. Some of them use categories to describe people's positions in the social structure, others apply continuous scales. In social stratification research, occupation is routinely used to link the positions of the individuals to their memberships in a social stratum. In industrialized societies, occupation is a very strong indicator of social standing and as it tends to be more stable than income, it serves as a much better proxy for the position of an individual (Connelly et al 2016). Thus, the goal of these researches is to classify the occupations in a way, which mirrors the stratification of the society. There are multiple approaches exist regarding the measurement of occupational position in social space. Some theories use occupation for the creation of vertical hierarchies with continuous scales (Ganzeboom-Treiman 1996); others use it for the creation of discrete stratification categories with both horizontal and vertical dimensions (Goldthorpe et al 1982, Rose-Harrison 2007. Researchers used various measurements for the classification of occupations to create their stratification models. Based on Bukodi et al (2011), we can divide these measurements to two types: one type of the measurements uses subjective indicators, the other type works with objective indicators. The scale of Goldthorpe and Hope (1972) belongs to the former category. They applied a synthetic scale of subjective opinions to measure the general desirability of occupations. Treiman (1977) also used questions on subjective perceptions and from these he created the Standard International Occupational Prestige Scale (SIOPS), which is a widely used analytical scale. The International Socio-Economic Index (ISEI) (Ganzeboom-Treiman 1996) and the Cambridge scale (Prandy-Lambert 2013, andMeraviglia et. al 2016) are good examples for the other type of the scales, which use objective data in their measurement. ISEI builds on educational level and mean income of the occupations to create their hierarchy. The Cambridge scale uses the marriage-table based social distance of occupations to map their hierarchy. Chan and Goldthorpe (2004) applied similar methodology, but their one was built on close friendship data, and not marriage tables. In their interpretation the scale measures the hierarchy of social status. Meraviglia and her colleagues (2016) argue that all continuous measure of social stratification are the indicators of the same latent dimensions.
But which characteristics of the occupations matter? The answer varies from one social stratification model to the other. The Erikson-Goldthorpe-Portocarero (Erikson et al 1979) (EGP) modelwhich is one of the well-known occupation-based stratification modelsis built on the on the employment relations in labour market. The market and the work situation (e.g. level of income, economic security, authority level) are the dimensions, which determine the class position. (Connelly et al 2016) Along with education, income also plays an important role in the construction of the ISEI scale. In the case of the SIOPS scale, occupations are ordered by their prestige, which is measured by the subjective judgement of respondents of large-scale surveys.
In this paper, we explore how occupation structure could be measured through online texts.
Our approach is a data-driven one, as we unfold the different layers of occupational structure in online digitalized texts, and not on purposively collected data. From this viewpoint, the closest model from the abovementioned ones is the Cambridge Scale. However, we do not focus on the social ties, but rather on the semantic ties of the occupations. In the next subchapter, we introduce those novel text mining techniques, with which we can examine the semantic ties of the occupations and through these, study the structure of the society.

Text as data and word embedding models
Process produced data, like text messages, phone calls, the usage of public transport with digital tickets, social media posts, bank transfers all leave digital marks in databases of different systems. These data are not generated by the users with the understanding that they will be part of some analyses, thus, these data mirror the behaviour of individuals better than data from classical surveys or other research, where self-reported responses can be biased by the interview situation, by social desirability and by limitations to recall past events. (Lazer and Radford 2017) For that very reason, the analyses of these data can be exceptionally interesting for social research.
This information is stored in very diverse formats, from pictures, through videos or voices, to numbers and the majority of these data are stored or can be transformed into textual formats. Text analysis has always had an important place in the field of sociology. From the line-by-line reading and analysis at the birth of the science, through coding and linking the text by the researcher (Bales 1950) to digital and partly automatized coding of smaller corpora (Hays 1960), it was always part of sociologywhich, according to Savage and Burrows (2007), defined its expertise through its own methods. However, these classic analytical methods could not handle large-scale corpora with thousands of millions of words. The methodological knowledge needed for the analysis of large text data had to be imported from computational linguistics, data-and computer science. Parallel with the increase of in the amount of digital data, computational power and artificial intelligence have also developed. New methods, which aim the processing of large digital corpora, emerge and are continuously elaborated. These methods have to be incorporated by sociologists, otherwise, they would miss the opportunity of interpreting such sources of data.
Just like partly automatized methods of earlier times, automated text analysis and natural language processing combine qualitative and quantitative approaches. The latest methods provide the deepness of qualitative analysis with the advantage of large number of observations in quantitative analysis. However, one of the consequences, that these textual data mostly record observed behaviour, is that its structure and relevance (its 'noisiness') is not as appropriate for analyses as of data collected by traditional techniques. The phase of data cleaning and structuring includes important decisions of the researcher. These decisions can influence the inner and outer validity of the results, thus the very detailed documentation and the description of the arguments behind these decisions are extremely important for making the research transparent.
Simpler methods of text analysis only focus on the words of the corpus, as if they had no relations with the surrounding words and sentences, but more complex methods can also take the structure of the text into account. Some of these methods are based on the 'bag-of-words' model, which means that words are treated together with their environments, namely a given number of words around them. The size (the number of words) of the environment is defined by the researcher and can be any positive integer, though too wide environments can cause loss of context. The examination of the environment is proceeded for each word, as a sliding window through the whole corpus, and the result of the method is based on the complex cooccurrences of words.
The method we used in our analysis is a neural network-based word embedding model (Mikolov et al 2013). This method helps the researcher to understand the deeper meaning of texts with modelling the semantic meaning of words. The position of a word is defined by its context, which approach has non-computerized linguistic theoretical base, originated by Firth (1957). The word embedding model projects the position of each word of a corpus to a low dimension vector space. The most popular method Word2Vec (Mikolov et al 2013) uses a neural network based logistic classifier to estimate the word positions. The words of the corpus are positioned in this semantical vector space, where we can calculate the contextual proximity of words. This proximity does not only reply on the co-occurrences of the words, but also on the co-occurrences of the contexts of words. Several word embedding methods are available (e.g. Word2Vec by Mikolov et al (2013), Glove by Pennington et al (2014), and Fasttext by Joulin et al (2016)) to train textual data, and to establish proximities 1 . In either method, the positions a word defines its meaning in the semantic space. Two words with similar environments will be close to each other, thus, words with similar meaning will be nearby.
Proximities of words are frequently defined by the cosine of the angle formed by the vector of the words. Standard metrics like Euclidean distance could be misleading here, because the length of each word vector is strongly correlating with the frequency of the word within the corpora (and it also depends on the context variability) (Schakel -Wilson 2015). Kozlowski et al (2019) showed, that these proximities can be successfully used for the analyses of culture. The starting point of their analyses was based on the theoretical foundation, that language (and texts) mirrors the way of thinking of those, who uses them. Thus, the analysis of written texts makes researchers able to draw conclusions about the society the texts originate from. They showed that in word embedding methods, we can create dimensions of social inequality with the proximity of words, which represent the two extreme values of a given inequality (e.g., poorrich; malefemale). Mirroring this proximity to other words, we can

Data and Methods
In the previous section we presented the basics of word embedding models and showed how these models can be used to analyze social phenomena. In this research we use pre-trained word vector models. These widely used word vectors are publicly available, which makes our results reproducible. The embeddings are trained on large scale corpuses, which is important as previous research showed that the accuracy and validity of word embedding (measured on word analogies) highly depend on the corpus size (Mikolov 2013). These pre-trained vector spaces are frequently used in NLP tasks. But previous studies have also confirmed that these vector space models can be used well to study social processes and social context as well.
Researchers have validated with surveys, that vector space models trained on large and general corpus can be used to measure cultural patterns (Kozlowski et al 2019) or even stereotypes against social groups (Joseph -Morgan 2020).
We used three pre-trained vector spaces in the analysis. The first vector model we used was trained on the English language texts of the Common Crawl (CC) corpus i , a huge web archive, which contains raw web page data, metadata and text extractions. The raw web pages can be everything, from a news site, through a blog or a page of a university, to pages like Amazon Books. As the authors state, they provide "a copy of the internet". It consists of one petabyte of data, collected between 2011 and 2017. The word embedding model was trained on the English language pages of this corpus. As the data do not contain geo-location of the websites, they might include web sites from all over the world. In the initial corpus 600 billion tokens were identified and the vector space consists of 2 million words positioned in a 300dimensional space ii . The training of the corpus was realized by Fasttext algorithm (Joulin et al 2016) The second vector space we used is the Wikinews, which was trained on a combined corpus of the English Wikipedia (saved in 2017), the UMBC WebBase corpus, and another corpus, which contains all the news from stamt.org. The UMBC corpus contains high quality English paragraphs derived from the Stanford WebBase project and contains 100 million web pages from 2007. Statmt.org contains political and economic commentary crawled from the web site Project Syndicate. The combined corpus is quite diverse and has 16 billion tokens. The vector space consists of one million words, positioned in a 300-dimensional space and was trained by Fasttext algorithm (Joulin et al 2016). Thus, the number of dimensions and the training method of the two vector spaces were the same.
We used a third vector space, which was also built on combined corpus of the Wikinews sources, but in this third vector space, during the training phase of the model, sub-word information was also taken into account. It means that partly identical or words or words with the same root like sociology and society tend to be closer to each other in this vector space. We will refer to this vector space later as Wikinews Subwords. On this vector space, we utilize the the innovation of the fasttext algorithm, namely that it can account for sub-word information.
Although the first two vector-spaces were also trained by fasttext algorithm, sub-word information were not taken into account there: thus, the method was closer to a word2vec solution (which cannot handle subword information).
The pre-trained vectors we used in this paper are trained on general English corpora. We could not narrow the geographical focus, as we do not know the geographical distribution of the authors of texts. However, based on other results in this topic (see Treimann 1977), there are no significant differences between the prestige scores of different developed countries.
Altogether 234 occupations (see table A1 for the list) were selected for the analysis and we used the most common 200,000 words of each vector space. In the ISCO classification, more than 7000 occupations are listed, but the number of one-world length occupations was around 750. We manually checked all these occupations and selected those more than 200, which were not extra unique or rare (like chieftain). These occupations cover both the vertical and horizontal aspects of occupational space. Although we tried to create a gender-balanced occupational list, male occupations are overrepresented based on our qualitative estimations.
Some of the pre-selected occupations were not among the most common 200,000 words, so we had to omit them. At the end, from these 234 occupations, 204 occupations were detected in CC, and 207 in Wikinews (202 occupations were available in both corpora). We located the position of each 204 and 207 occupations in the vector spaces. The same methods were applied for each vector spaces (CC, Wikinews. Wikinews Subwords): the cosine-similarities of each pairs of occupations were computed in the 300-dimensional vector space. These cosinesimilarities are the ones, which represent the semantic closeness of the occupations. Table 1 shows a small part of the similarity As we have mentioned in the Introduction, one of the main goals of our research was to extract the most important dimensions, which structure the occupations in the semantic field. To fulfil this goal, we applied factor analysis with rotation on the similarity matrixinstead of the often applied correlation matrixas input. As a robustness test, we repeated our computations with different factor analysis methods and different rotation techniques, and the difference between the results were quite small. The presented results are based on a minres (Minimum Residual) factor analysis technique and varimax rotation (Revelle 2018). The following analyses are based on the factor loadings resulted of this methodology.
Due to the exploratory nature of the research, we have not had strong assumption on the number of factors to be extracted. The decision on the number of factors was based on empirical tests and also on practical considerations. We decided to select more than 1 factor as we wanted to understand the most important dimensions behind the structure of the occupations, and not only the main dimension, At the same time, we decided to select maximum of 5 factors, in order to keep the interpretability. Average residuals for the similarity matrix (RMSR value) and Chisquare based fit indices were used to test the statistical validity of the models, and external measures (like ISEI scale) were applied for the comparison of the results to test criterion validity. Overall, we found that all the 2, 3, 4 and 5-factor solutions are worth to investigate.
In the later analysis, we detail the 3-factor solution as it looked the most promising one.
We used different methods for the robustness test of the models. We compared the consistency

Example 1.
Last night the SENATOR went to the theatre This evening the TYPIST wanted to go bowling.
We can assume, that different cultural activities are closer to specific occupationsas occupation strongly correlates with status, power and money. A senator might also play bowling but has higher probability to go to the opera or to the theater, than a typist.

Example 2.
Half of the company's DATA_SCIENTISTS graduated from Ivy League schools.
The plan of the WAITRESS was to attend evening school next year.
The above described situation is the same in the second example. Usually a waitress does not graduate from an Ivy League school, and data scientists do not attend evening schools.
Above the intuitive understanding of these examples, we tested them on our data. We tested the closeness of occupations to certain activities with the cosine similarity of the words of the occupation and the activity. In the CC vector space, the cosine similarity of the occupation senator with the word theatre is 0.21, the same measure for the typist is 0.12. For bowling, the senator's cosine similarity is 0.05 but the typist's value is 0.16. Thus, the senator is closer to the high-end cultural activity, while the typist is closer to the more popular one. These results strengthen the intuitive assumption, namely that in these contexts, the presence of different occupations has different likelihoods.
At the same time, it is important to emphasize the different logic of word embedding similarity and similarities of occupational hierarchies created by social scientists.

Common Crawl
First, we present the results from the Common Crawl corpus. From the list of the occupations, the doctor was the most frequent item. Overall, it was the 1496th most frequent word in the list of words contained by the corpus. Driver, writer, cook, judge, editor, lawyer, professor or attorney were also frequent. We can observe a pattern, that those occupations are more frequent in this corpus, which have higher prestige.
As we have mentioned above, first, we calculated the cosine similarity of the 204 occupations, which were in the most frequent 200,000 words of the vector space. Then we used this similarity matrix as an input to extract factors, based on which we detected the main structural dimensions of the occupational semantic space. We tested the model for different number of factors. In the case of the two-factor solution the Root Mean Square Residual (RMSR) was 0.07. The explained variances of the two factors were quite similar. In the case of both dimensions, knowledge is an important factor. Based on the occupations with the highest loading on a given factor, the first dimension is closer to the domain of the media (e.g., commentator, editor), and the second is closer to the domain of science. (See Table A2  We have also tested the correlation of the factors of the two-and three-factor models. We found that the correlation of the first factors of the two-and three factor solution was 0.9, and the correlation between the second factors was the same. Table 3 shows the occupations with the highest and lowest factor loadings on a given factor of the three-factor model. Interpreting the three factors, we found that the first two factors were quite similar, but with some important differences. In the case of the first factor the role of 2 A usual way to create a factor model is to start from a raw data source, calculate the covariance/correlation matrix and then calculate the factor loadings and estimate the factor scores based on these loadings. In this paper, we start from a similarity matrix and calculate the factor loadings. As we do not have raw data here, we could not calculate the factor scores. That has one important implication. Rotated factor scores are statistically independent, but factor scores are not. That is why we have a strong correlation between the extracted factors.
institutional power seems to be more importantthe chancellor or the dean are good examples for that. The second factor is structured more on the basis of knowledge and educational level associated with the occupations, while the third factor is built up by the dimensions the power levels and organizational capacities of the occupations.  Table 3 Occupations with highest and lowest loadings, 3-factor solution, CC corpus For a deeper understanding of the results we further analyzed the first dimension of the threefactor solution. In the rest of the paper, we refer to this dimension as Occupation Semantic Position Scale (OSPS). Figure 1 shows the scatterplot of the ISEI and the OSPS scales. We calculated for all pairs of occupations, if they are in the same rank order in the two scales. The result of this calculation shows, that in 75 percent of the occupation pairs, the order was the same. Thus, we can assume, that the that proximity of occupations in the online texts strongly correlates with the expected educational level and the average income of the selected occupation, which are the basic dimensions of the ISEI prestige score.
We have to emphasize that word embedding method is an unsupervised one, which means that the researchers do not put external information to the model. According to this, we haven't used the ISEI prestige scores as an input of the model, neither we optimized varimax rotation for that. Thus, these results are only based on the information contained in the online texts.
At the same time, we have found remarkable differences. Some occupations like doctor, dentist, pharmacist or solicitor were positioned quite low in the OSPS, while high on the ISES. The reason of it is that the position of an occupation on the OSPS does not only depend on the prestige of the occupation, but rather affected by the reflection of the domain, which surrounds the occupation. For example, being a dentist is a high prestige job, paired with high educational level and high income, but (1) being sick is not a positive situation (which feelings can be mirrored in the texts) and (2) everybody can be sick, irrespective of their social status: health care professionals provide services to the general public, which means they have links to all levels of the social structure. As health-related occupations are all affected by these circumstances, this can be the reason that they are scored lower. In the case of the 4-factor model, the first three dimensions were quite similar to those that we have found in the 3-factor solution. The main structuring dimension of the fourth factor was gender: occupations with the five highest loadings were receptionist, waitress, babysitter, manicurist and hairdresser. In the 5-factor model, we still haven't detected wage as an organizing dimension of any factors. What we have found was that health-related occupations score high on the fifth dimensionlike a domain-specific one. We could also observe that as we increase the number of factors in the model, the correlation of the first factor with the ISEI becomes lower and lower.

Wikinews
To test the robustness of our results we repeated our analysis on a different corpus, namely on the Wikinews corpus. In this corpus, the most frequent occupation was the editor, but judge, politician, or lawyer were also frequent, such as journalist, writer and singer. Most of these are higher prestige occupations, which are related to the domains of politics, media and culture.
For comparison, we run the same factor analyses as on the CC-based embedding. The results were more similar than we expected. In the case of the 3-factor solution, the correlations of the first factor scores of the two corpora was 0.97, the correlation of the second factors was 0.93 and of the third factors it was 0.82. These results suggest that the factors in the two corpora show similar structure of the occupations.
With a more detailed analysis we could find minor differences between the first factors of the Wikinews and CC corpus. Some manual-labor occupations like locksmith and dishwasher got higher scores in the Wikinews corpus and some of the literature and art related occupations, like poet, novelist, composer or painter scored higher in the CC corpus.  Table 4 Occupations with highest and lowest loadings, 3-factor solution, Wikinews The interpretation of the first three dimensions is quite similar to the ones in the CC corpus.
The first factor shows a mixed organizing pattern built of power and knowledge. In the case of the second factor, the science related occupations scores high. The dimension behind the third factor is about power level and organizational capacity. The correlation of the first dimension with the ISEI score was 0.71 (see Figure A1) and we found that with the above described methodology, 74 percent of the occupation pairs are in the same order in the Wikinews based first factor hierarchy and on the ISEI scale. In addition to the similarities, we also found differences: some animal-and farm related occupations (e.g., breeder, fisher, planter) score much higher on the semantic scale, and some health-related occupations (e.g., doctor, surgeon, dentist, pharmacist) score higher on the ISEI scale.
We have also tested the 4-and 5-factor solutions here. Similar to the result of the CC corpus, the 4th factor can be interpreted as the gender dimension: occupations like nanny, hairdresser, receptionist, babysitter or waitress score high there. Just as in the CC corpus, the 5th dimension was a domain related one. It is interesting, however, that in the current (Wikinews) corpus, it was not the health domain, which characterized the scale, but the domain of media and culture, with highly scored occupations like novelist, poet, singer, composer, dramatist, lyricist, or writer.

Wikinews with sub-word information
The last word embedding we tested was also built on the Wikinews corpus, but the training phase of this model also took into account sub-word information. With this solution, partly identical words or words with the same root are closer to each other in the vector space. The same 3-factor solution was applied here and the interpretation of the three factors is on the whole the same as in the previous cases. (For more details about these factors, see Table A3 in the Appendix).
The interpretation of the factors showed that institutional power is an important aspect in the first factor, but knowledge also matters there. The second factor was related to the knowledge and educational level associated with the occupations, while the third factor was scaled on the power levels and organizational capacities of the occupations. This later factor is close to the domain of politics.
The correlation of the first dimension with the ISEI score was 0.78. According to the rank order, 77 percent of the occupation pairs were the same on both scales, namely in the first factor of this corpus and the ISEI. The occupations, which are much higher on the semantic scale are rancher, planter, and astrologist. Other occupations are underestimated compared to the ISEI: such as in the case of the CC corpus, these are domain specific occupations. Some of them are health-related occupations, such as dentist, doctor, pharmacist and surgeon; some are financial occupations, like banker or accountant; and some are judicial system related occupations like judge, lawyer or solicitor.
We also tested the 4-and 5-factor solution here. The 4th factor showed the gender dimension again with high scores at occupations like nanny, hairdresser, receptionist, babysitter and waitress. The 5th factor was again a domain-related one, namely the domain of media and culture with high scores at occupations like novelist, poet, singer, composer, dramatist, lyricist, and writerjust like in the case of the Wikinews corpus.

Robustness -Stability of occupational positions in different vector spaces
The correlation of the factor loadings across different embeddings seems to be really strong. were the following: masseur, dishwasher, rheumatologist, manicurist, zookeeper, editor, bender, locksmith, dentist, and tanner. We cannot observe a clear organizing principle, but some of these occupations are quite rare now, like the tanner or the bender.
We calculated the correlation of Wikinews frequency of occupations and the stability measure, and its value was 0.59. This result is parallel with earlier findings, namely that those words are stable across time, which are frequent (Hamilton -Leskovec -Jurafsky 2016). Our results show that it is not only applicable for temporal analysis, but also for the analysis of different corpora (and embeddings) created approximately at the same time. Stability also positively correlated with the ISEI score (Pearson r=0.36, p=0.00). The direction of the correlation suggests that the positions of more prestigious occupations are more stable across corpora, but this result should be treated with caution, as this effect partly exist because more prestigious occupations are also more frequent (at least in the two corpora we used). However, even after controlling for the frequencies of the words, the correlation still remains significant (r=0.19, p=0.000) between ISEI score and stability.

Discussion
We raised two questions about the usefulness of word embedding based semantic analysis related to the description of occupational structure in particular occupational rankings. Are the results comparable with standard results and is it possible to gain additional insights about the social positions of occupations? Both questions raised at the beginning of the paper have been given affirmative answers. The results show fundamental similarity of the social structure obtained from text analysis to the structure described by Ganzeboom and Treiman (1996). But a more detailed analysis also reveals some differences.
Our paper focused more on methodological aspects and we put less emphasis on the substantive analysis of the results. But the firstsuperficialanalysis revealed an interesting dimension of the occupation structure: the power and organizational aspect. As far as we know the importance of this factor is not discussed in the main line of stratification literature in sociology.
It has been widely discussed (Johnson 2016) that power is an important component of the prestige of an occupation. But our results indicate the interplay between knowledge and organizational capacity. In the 3-factor solution, each is characterized by the presence of one or both of these, and power presents itself as a combination of knowledge and organizational capacity. It is not a surprise that knowledge, also in itself, is a fundamental dimension, but it does seem quite novel, that organizational capacity, also in itself, is a contributing dimension. Freidson (1984) distinguishes two types of elites: knowledge and administrative elites in his classic work. Waring (2014) re-apprised Freidson model and added two extra elite types, corporate and governance elite. Our third factor mirror the importance of this governance elite as an important factor that structure the occupational space.
The results proved quite stable, as repeating the analyses on two different corpora yielded strongly similar results. Correlations of the factors between the two corpora were high and substantively significant. After the alignment of the second corpus on the first one, we found strong similarities in the positions of the occupations across corpora. Although we don't have data for measuring other stability indicators, but we know from other studies (Hamilton -Leskovec -Jurafsky 2016) that concept stability is lower for words, which are frequently used in different environmentsthat is called polysemy in linguistic. It is also known that the position of a concept changes over time (Kozlowski et al. 2019), so further analysis may also take into account the time period during which the original corpora were collected.
We decided to use pre-trained corpora in this paper and not trained unique word embeddings.
These pre-trained corpora are available for everyone, so it is pretty easy to reproduce our results and make further steps in this area. One shortcoming of this approach that we could not narrow the geographical focus of the results, and we could not influence what type of texts are included in the training set. However, previous studies showed (Treimann 1977) that prestige scores are highly correlated in developed countries. So our general approach might not lead to significant biases. The fact also confirms the validity of the results, that results coming from different corpora and word embedding was similar.
Nevertheless, it could be a logical step to repeat this analysis with self-trained wordembeddings, where we have stronger control of the selected texts. Training our models has a further advantage; we could pre-process the texts before calculating the vector spaces. For social science analysis, pre-processed texts could work better as the information is focused here, and there is less noise in those texts. We could also add bi-grams to the model, which might be essential to catching the two-word length occupations like "social scientist." Further studies are needed to understand how pre-preprocessing influences word embedding features and how this affects any social science-related analysis.
social sciences although it has already been demonstrated that unsupervised learning methods such as the analysis of word embeddings are able to find interesting patterns and generate new hypothesis (Nelson 2020). Both qualitative and quantitative approaches are needed to fully exploit this potential in understanding societies.  Table A3. Occupations with highest loadings, 3-factor solution, Wikinews_subwords Figure A1 Scatterplot of word embedding based occupation prestige score (Occupation Semantic Position Scale -OSPS) from the CC vector space and ISEI