### 3.1 The higher education space

The Higher Education Space (HES) relates pairs of degree programs that exhibit a positive and statistically significant co-occurrence relationship in the applicants’ preferences list [30–33]. To that end, we start by estimating the strength of the relationship between two degree programs using the methods of *ϕ*-correlation and then through two statistical tests we discard links whose statistical significance doesn’t allow us to claim the existence of such relationships either due to small sample size or because the observer correlation can be just due to pure chance.

The *ϕ*-correlation index between pairs of degree programs, *i* and *j*, which we define as \(\phi _{ij}\), can be commputed as:

$$ \phi _{ij}=\frac{M_{ij}Z-M_{i}M_{j}}{\sqrt{M_{j}M_{i}(Z-M_{i})(Z-M _{j})}}, $$

(1)

where \(M_{ij}\) corresponds to the observed number of co-occurrences between degree programs *i* and *j*, \(M_{i}\) is the total number of co-occurrences a degree program *i* participates (\(M_{i}=\sum_{i} M _{ij}\)), and *N* is the total number of co-occurrences in the dataset (\(Z=\sum_{i} M_{i}/2\)). Positive/Negative values of \(\phi _{ij}\) indicate that increasing numbers in the prevalence of each degree program are likely to result in an increase/decrease in the number of co-occurrences between them. We discard all negatively correlated relationships, since these edges indicate pairs of degree programs whose co-occurrence pattern cannot be explained by the prevalence of each degree program alone.

Next we filter links by performing two statistical tests. The first tests whether measured *ϕ*-correlations can be explained by pure chance alone, while the second tests if the identified *ϕ* values are different from zero given the sample size and the associated standard error.

The first test is performed by comparing the observed \(\phi _{ij}\) with a null-distribution obtained from an ensemble of \(N=1000\) randomly generated networks. Each random network is generated by shuffling the preferences of the candidates in each year while maintaining constant the number of preferences of each candidate and the number of times each degree program was chosen in a year [34]. For each randomization *k* we compute the \(\tilde{\phi }^{k}_{ij}\) associated a pair of degree programs. The ensemble of such values form the null distribution \(\tilde{\varPhi }_{ij} = \{\tilde{\phi }_{ij}^{1}, \tilde{\phi }_{ij}^{2},\ldots,\tilde{\phi }_{ij}^{N}\}\). Using statistical inference methods [35] we estimate the p-value associated with \(\phi _{ij}\) by calculating the upper tail probability of obtaining a value equal or greater than \(\phi _{ij}\) from the cumulative frequency of the null-distribution \(\tilde{\varPhi }_{ij}\). We discard links with a significance of *p*-value >0.05.

Secondly, since the magnitude of observations varies across different degree programs we use a *t*-test to infer whether the positive correlations are significantly distinguishable from zero. To that end, we compute:

$$ t_{ij}=\phi _{ij}\frac{\sqrt{D-2}}{\sqrt{1-\phi _{ij}^{2}}}, $$

(2)

where \(D-2\) represents the degrees of freedom, which we take as \(D=\max(M_{i},M_{j})\) [30]. We consider only links that are statistically significant with *p*-value ≤0.05 (\(t _{ij} = 2.06\) for \(D=25\), one tailed).

To sum up, we discard links with negative *ϕ*-correlation and that fail two statistic significance tests with *p*-value ≤0.05. The first tests whether the identified *ϕ*-correlations are not just the result of pure chance, while the second discards links that due to the number of observations do not allow us to claim that the *ϕ*-correlation are significantly different from zero. Finally, we discard self-connections from the analysis, as we are interested only in relationships between different-degree programs.

Figure 1 shows the HES network structures for Portugal and Chile. Nodes represent degree programs and are colored according to the nine groups of the first level of the ISCED classification: Arts and Humanities (dark blue), Social Sciences (dark green), Sciences (dark purple), Engineering (dark yellow), Agriculture (pink), Education (red), Services (light purple), and Health (light blue). The size of the nodes is proportional to the number of observations.

The PHES network (Fig. 1(a)) results from all application preferences between 2008 and 2015, since no major and significant changes occurred in the system during that time interval. By contrast, the CHES network analysis is divided into two periods, due to the 2012’s addition of nine new universities (see Additional file 1). The first period (Fig. 1(b)) considers applications between 2006 and 2011, while the second (Fig. 1(c)) analysis those between 2012 and 2017.

The PHES and CHES networks are sparse (between 2% and 5% of the maximum number of relationships possible) and highly clustered (clustering coefficient measures between 0.46 and 0.49) when compared to random networks with similar density of links. The high clustering coefficient invites the use of network science methods (*e.g.*, modularity-based network partition algorithms) to derive a classification/grouping of degree programs (see Fig. 2 and related discussion bellow). Each network exhibits a diameter between 6 and 11 links, and an average path length (APL) between 3.94 and 4.22. Both CHES networks have fewer nodes than the PHES network (177 and 179 against 312) but relatively similar connectivity per degree program—7.44 and 6.72 against 8.51. There are common elements in all three networks, *viz.* the existence of three main clusters: one dominated by degree programs in Engineering; a second one that involves degree programs in Biology, Sciences, and Health; and a third with a strong representation of degree programs in Arts and Humanities, and Social Sciences.

Overall, the HES space is characterized by a doughnut-shaped structure with a few degree programs occupying a central region connecting opposite sides of the network. This topology is not new and similar networks were obtained when mapping science and research areas [36, 37]. Nonetheless, the above structures can have relevant implications for higher education policy development. For example, the centric role of Economics and Management (Commercial Engineering in Chile) connecting the Engineering, Arts and Humanities and Social Sciences clusters might hint to potential trans-disciplinary crossings when designing future changes in the system [38–40].

As mentioned above, the high clustering levels in all three networks invite for a classification/grouping of degree programs based on the network structure of the HES. Figure 2 shows the best partitions obtained using the Louvain algorithm [41], where nodes of the PHES (a) and CHES ((b) and (c)) are colored according to the partition they belong.^{Footnote 9} The best PHES partition has a modularity of 0.71 and explains 88% of the intra-group connectivity. When compared with the ISCED classification, these values correspond to an improvement of 42% in modularity and of 27% in intra-group connectivity. Likewise, the best partition of both the CHES networks exhibit a modularity of 0.67 explaining 83% of the intra-group connectivity with an improvement of 67.5% over the ISCED classification, see Fig. 2(d)–(f).

The International Standard Classification of Education (ISCED) [26, 27] was developed in order to facilitate comparative statistics between different countries. It is also commonly used in academic studies and nationwide reports of the state of higher education. The ISCED premise is to group degree programs according to the their course content and does not represent the applicants nor educators perspective. Such premise contrasts with the data-driven and network-based approach derived here, which stems only from applicants perspective.

Figure 2(d)–(f) shows the composition of each HES group according to the ISCED classification of its constituents. Colors among similar groups (\(C_{1}\) to \(C_{8}\)) of different HES are kept consistent to ease comparison. Groups of similar color match groups located in similar regions of the PHES and CHES. For example, group 1 (\(C_{1}\)) in the PHES is composed of 14 degree programs from the Science Education Field, 1 degree program from the Agriculture field, 42 degree programs from the Engineering field, and 2 degree programs from the Services field. Communities have been named in order to make their composition comparable across the CHES and PHES networks, when possible. The observed diversity of ISCED scientific fields in each community shows that administrators and policy makers should take care in devising policies based on sectoral analyses developed by scientific fields only. This is specially relevant for policies aimed at solving access inefficiencies of the higher education system. This note of caution will be reinforced by the results found for positive assortment in the HES—see next section below.

As expected, there are differences and similarities among the three HES. Firstly, the number of communities differs between the PHES (8) and the CHES (between 6 and 9) which might be explained by the size of each network and degree program diversity (see Additional file 1 for more details about each system). Secondly, the organization of the CHES network seems to have changed in the second time period, becoming more similar to the PHES network. This conjecture is backed-up by the number of communities and visual inspection requiring future validation, but raises interesting questions: 1) does globalization of higher education [42–45] lead different HES to evolve towards similar structures? and 2) since these structures are based on applicants’ choices, are they adapting quickly to societal transformations and is policy on higher education able to follow suit?

### 3.2 Feature assortment in the higher education space

The Higher Education Space (HES) is estimated uniquely based on the applicants’ choices and completely nescient about particular features that characterize each degree program. Thus, the emergence of three coherent and similar networks, in two different countries and for different time periods, naturally leads to the question of what explains the emergence of these same structures? The answer likely lies in a multiplicity of factors, some of which we briefly explore here by matching the HES network structures with available data on descriptive features of degree programs—e.g. gender balance or unemployment levels (*cf.* Sect. 2.2). It is important to keep in mind that other factors involved in the applicants’ choices can certainly help to explain the structure of the HES. However, due to data limitations and the scope of this manuscript such exploration is left for future work.

Figure 3(a)–(c) shows the PHES (a) and the CHES (b) and (c) where each degree program is colored according to the gender balance in 2015 (Fig. 3(a)), 2011 (Fig. 3(b)), and 2017 (Fig. 3(c)). Orange (Gray) tones identify an above average representation of female (male) applicants. The distribution of Gender prevalence among degree programs is not random or uniform but, in fact, it is clustered, resulting in the predominance of one gender over the other in particular regions of the HES. Similar patterns are observed for all other features such as application scores, unemployment levels, demand-supply ratio, mobility, and first-year dropout rates (see Additional file 1).

Figure 3(d)–(p) explores, quantitatively, these clustering patterns (*i.e.*, positive assortment) over the HES. To that end, we compute, for each feature, the autocorrelations between pairs of degree programs at different distances in the HES network (*i.e.*, measured by the minimum number *n* of links that form a path from one degree program to the other). Bars represent the autocorrelation averaged over all observation years, and error bars the standard error in the estimation of the coefficients. For example, an autocorrelation of 0.75 at \(n = 1\) for gender dominance, means that degree programs separated by one link exhibit, in average, 75% of the proportion of Female students of a focal degree program. Positive (negative) autocorrelation coefficients are shown in green (red). Bars in light colors indicate an autocorrelation that is not significantly different from zero (failed a *t*-test with \(p > 0.05\)).

These positive/negative relationships between pairs of degree programs seem to ascertain previous findings [3, 4], in that some groups of students tend to choose similar preferences based on similar determinants of choice. For example, a positive assortment in gender balance (Fig. 3(d)–(f)) confirms the existence of different preferences between gender groups, as found in [5–7, 46]. But more importantly, and a non-trivial finding of this approach, is to be able to show *How* and *Where* these similarities spread through the network and how neighbouring degree programs (nodes) influence or contaminate each other. In other words, how features spillover throughout the network structure of the HES. Returning to the gender balance example, Fig. 3(d)–(f) confirms what was already noticeable by visual inspection—the more female applicants apply to a degree program, the more female applicants are observed in neighboring degree programs, when compared with the average prevalence of female applicants in the entire system. This relationship is positive, significant up to two links, and holds for both Portugal and Chile. Positive autocorrelations, up to two neighbours, are also found, although not so strong, for application scores (Fig. 3(g)–(i)) and demand-supply ratio (Fig. 3(j)–(l)), in both countries.

Due to data availability, autocorrelation patterns for unemployment levels (Fig. 3(m)) and First Year Drop-Out rates (Fig. 3(p)) are calculated for the PHES only. Both show similar behavioural patterns as in the previous features, although the positive relationship in unemployment levels extends to three-links of distance instead of two. Again, due to data constraints, the Student Mobility feature is only analyzed for the CHES (Fig. 3(n)–(o)). The positive relationship observed in the geographical mobility seemingly vanishes quicker with the network distance, although it remains statistically significant at distance = 2 being zero for larger distances. Two possible explanations for the lack of a positive autocorrelation away from the first neighbors can be: (1) most applicants assign a small weight to distance as a factor in the choice of a degree program, and (2) the majority of applicants has a tendency to apply to degree programs that minimize the distance to their local of origin. Although previous research seems to support the second hypothesis [47–52], a more in-depth future analysis is needed to answer this question conclusively.

In sum, all features exhibit positive autocorrelations that extend up to two/three links of separation. The Higher Education Space captures information embedded in the interplay between degree programs, which is revealed by studying the preference patterns of applicants. These results are a natural outcome of all the information applicants’ carry at the moment of their choices [53] (i.e., either contextual information used in the decision making or inherent characteristics of applicants), which in turn modulates the topology of the HES.

### 3.3 Temporal variations in features

In the previous section, we have shown *How* and *Where* certain degree programs are positively correlated, in several features, as a function of the network distance between them. In this section, we examine how temporal changes in these features can spillover throughout the HES. By understanding the *When* of the autocorrelations patterns, it is possible, for instance, to perceive how external shocks propagate through the system. As an example, we take the particular case of the building sector in Portugal—one of the most affected by the financial crisis that hit the country between 2010 and 2014 (a crisis that was preceded by a downward path since the beginning of the millennium and the global financial crisis of 2008 [54]).

Figure 4(a)–(b) shows, for the PHES, the temporal variation in the demand-supply ratio for Civil Engineering (a) and Architecture (b) between 2008 and 2015. Also shown (light gray) are the temporal variations of their closest direct neighbors in the Higher Education Space network (averaged is highlighted in red). After the economic and financial crisis, the construction industry was one of the most negatively affected [55, 56]. *A priori* (without knowing the structure of the network), one could expect that both Civil Engineering and Architecture would suffer a similar impact on their demand-supply ratio given their close market relationship. However, a closer inspection of Fig. 4(a)–(b) shows that the negative impact on the demand for Civil Engineering is not observed for Architecture. More importantly, in both cases, the variations are consistent with the average behaviour of the nearest connected degree programs (temporal spillovers). This confirms and reinforces the above finding where both belong to two different clusters (architecture being closest to degree programs in Arts and Humanities than to Engineering), *c.f.* Fig. 1.

The spatial autocorrelation patterns, concerning the temporal variations of features, help to explain how the observed changes affect entire regions of the network in different ways and in different time periods. For example, a clearly discernible pattern in Fig. 4(c)–(d) reveals that variations in the demand-supply ratio reversed from one part of the network to the other in two distinct time periods (2010/11—Fig. 4(c) and 2014/15—Fig. 4(d)). These temporal spillovers are confirmed by the autocorrelation patterns of the yearly time variations of each feature, over all degree programs in the PHES (Fig. 5(a)–(b)). There are positive effects in time that remain up to two links of separation in the demand-supply ratio and application scores, suggesting that, not only these two features changed over time (thus reacting to conjunctural changes) but also that those changes spillover to their neighbours.

However, we do not find autocorrelation patterns among the temporal variations for all features. Certain features, such as the demand-supply ratio (Fig. 5(a), (d), and (g)) and application scores (Fig. 5(b), (e), and (h)), show a synchronous variation over time, suggesting that it responds to contextual changes. On the other hand, gender balance (Fig. 5(c), (f), and (i)) do not change over time, suggesting that it is likely to respond to more long-term structural changes, e.g., cultural mechanisms, and other socio-economic factors. Moreover, it is important to state that such effects although relevant in magnitude don’t appear to be significant for the application scores in the CHES network (see Fig. 5(e) and (h)).

### 3.4 Measuring unemployment similarity

Thus far, we have identified prevailing autocorrelation patterns of features describing degree programs and applicants in both spatial distribution and temporal variations. But, how informative is the Higher Education Space on the higher education system? For example, can we explain the expected unemployment levels of degree programs just by looking at its connections in the HES?

To explore this question we use a propensity score matching identification strategy. We define the treatment as the link between two degree programs in the HES. Thus, degree programs in the treatment group are necessarily connected in HES and the degree programs in the control groups are not connected in HES. Then, we compare the difference in unemployment levels in the treatment group against several control groups. To generate the control groups we match to each pair in the treatment group, a second unconnected pair of degree programs with an equivalent level of similarity in terms of features. Thus, we built five control groups: (1) gender level, (2) application scores, (3) demand-supply levels, (4) a control group with degree programs of the same ISCED field, and (5) all features combined. In addition, we built a randomly sampled control group, where pairs of nodes are taken at random disregarding any similarity.

In Fig. 6, rows show the average of the absolute difference in unemployment levels between pairs of degree programs for each control group. In all cases, the differences are smaller for the treatment group (vertical black line) when compared to the control groups (all differences are statistically significant—*t*-test between the averages of the two groups with an upper-bound *p*-value of 0.001). These findings support the hypothesis that the HES represents a similarity mapping between degree programs from an applicants’ perspective, that is not possible to access by estimating similarities using traditional features alone (e.g. gender, application scores or demand-supply). In other words, the network structure of the Higher Education Space captures information that enables us to improve our understanding of the higher education systems.

We should note that nodes in these networks do not incorporate any information about the institutions. These specificities can potentially change the results of the current model, especially in those cases where factors, such as the prestige of higher education institutions, the societal value of degree programs (e.g. medicine), and the relative location of institutions to their recruitment base can greatly impact the applicants’ choices [58] and consequently, the structural organization of the HES.