Novel embeddings improve the prediction of risk perception

We assess whether the classic psychometric paradigm of risk perception can be improved or supplanted by novel approaches relying on language embeddings. To this end, we introduce the Basel Risk Norms, a large data set covering 1004 distinct sources of risk (e.g., vaccination, nuclear energy, artificial intelligence) and compare the psychometric paradigm against novel text and free-association embeddings in predicting risk perception. We find that an ensemble model combining text and free association rivals the predictive accuracy of the psychometric paradigm, captures additional affect and frequency-related dimensions of risk perception not accounted for by the classic approach, and has greater range of applicability to real-world text data, such as news headlines. Overall, our results establish the ensemble of text and free-association embeddings as a promising new tool for researchers and policymakers to track real-world risk perception. Supplementary Information The online version contains supplementary material available at 10.1140/epjds/s13688-024-00478-x.


Data collection
We followed common procedures used in the risk perception literature to obtain data for the psychometric paradigm [e.g., 1,2].The pre-registration for the study is available at https://osf.io/6m7xr.In what follows, we investigate the sensitivity of our results to various factors surrounding data collection.We focus on two main factors: the impact of psychometric item ordering-which could affect both predictive accuracy and interitem correlations-and the impact of training set size (with a focus on predictive accuracy).

Impact of psychometric item ordering
In our survey, half of the participants received the psychometric items in the order presented below (order 1) for each risk and the other half received them in the reverse order (order 2).The main reason for doing this was to investigate whether ordering actually impacts participant responses, which, to our knowledge, has not been done before, and could affect data quality.
1. Voluntary-Involuntary-Are individuals exposed to this risk voluntarily or involuntarily?
2. Immediate-Delayed-Is death from this risk immediate or delayed? 3. Known-Unknown-Is this risk known or unknown to the individuals exposed to this risk?4. Known-Unknown (Sci.)-Is this risk known or unknown to science? 5. Controllable-Uncontrollable-Is this risk controllable or uncontrollable for the individual exposed to the risk?6. New-Old-Is this risk new or old? 7. Chronic-Catastrophic-Is this a risk that kills one person at a time (chronic) or a risk that kills large numbers of people at once (catastrophic)?8. Calm-Dread-Is this a risk that individuals can reason about calmly or is it one that they have great dread for?9. Not-fatal-Fatal-How fatal are the consequences of this risk?
To evaluate potential differences between the two orderings, we carried out several analyses.First, we focus on the psychometric ratings alone.To investigate whether psychometric ordering had a statistically significant impact on responses, we take the individual ratings for each risk source and psychometric item, split them into two groups (order 1 and order 2), and run an independent-samples t-test on each pair of groups.This amounted to 9,036 t-tests (1,004 risks times 9 psychometric items), of which 11.6% of the groups significantly differed for α = .05.This is twice the number of type I errors expected, suggesting a small influence of ordering on average responses.
Four out of the nine items (Immediate-Delayed, Voluntary-Involuntary, Calm-Dread, and Known-Unknown) account for almost 60% of the significant differences.However, overall, the difference in the average responses was small (average Cohen's d = .09).
Furthermore, the average ratings in the nine psychometric items showed very high Pearson correlations of, on average, 0.88.
We further evaluated the robustness of the inter-item correlation between the two orderings because this has implications for the sensitivity of principal component analyses (PCA) often performed within the psychometric paradigm [cf.1].
Figure 1 shows the correlations across risks between psychometric item ratings for both orderings.We observed very similar patterns of correlations but also small differences ranging from δ < .001(Immediate-Delayed and Chronic-Catastrophic) to δ = .21(Immediate-Delayed and Known-Unknown), with an overall average absolute difference of δ = .08.
Finally, we evaluated potential differences in the accuracy of predicting risk perception (See Figure 2).We observed that Psychometric 2 achieved a 6.4 percentage Fig. 2 Investigating the impact of psychometric item ordering on performance.Psychometric 1 is obtained from participants that received the following order 1 (as listed in text).Psychometric 2 participants received the reverse order.Psychometric is an aggregate of orders 1 and 2 (as used in the main analysis), and Psychometric 1 & Psychometric 2 is the concatenation of both orderings.Error bars are adjusted 95% confidence intervals [3].
points higher accuracy than Psychometric 1 and a 1.1 percentage points higher accuracy than the aggregate psychometric model using both orders.This means that the reversed order is better at capturing risk perception than the original order.This may have contributed to the higher performance of the psychometric model in the Basel Risk Norms compared to the data of [2] because the latter relied only on the first ordering.The notable differences in predictive accuracy between the two orders have two noteworthy implications.First, other orderings of psychometric items could result in even larger predictive accuracy for the psychometric model.Second, the two orderings may capture distinct aspects of risk perception, suggesting that they might best be used in tandem rather than aggregated.To test the latter, we evaluated the concatenation of both orderings, Psychometric 1 & Psychometric 2, as a predictive model.We observed that the concatenated model outperformed the aggregate model by 1.6 percentage points, which is a small but significant effect (t = 4.00, p < .001).
inter-item correlations, and predictive accuracy.However, the differences between orderings were overall small in magnitude.Furthermore, although the slightly higher accuracy of the concatenated model compared to the aggregate model may justify using the concatenated from the perspective of predictive accuracy, this choice would disadvantage our analysis in other ways.Specifically, it would limit interpretability, given that we possess no information on how the item ordering affects the content of the responses to the psychometric items, and comparability to previous work including, in particular, the study by [2].We believe that the small gains in accuracy do not outweigh these costs, and so chose to use the aggregate model.

The impact of training set size on predictive accuracy
In planning the data collection of the Basel Risk Norms, we investigated the potential of increasing predictive accuracy by increasing the training set size.We trained different models on different portions of the data of [2] and recorded the accuracy of predicting risk perception (see Figure 3; green lines).The analysis showed significant potential for higher accuracy, with accuracy values increasing systematically with larger training set sizes.The increasing accuracy is likely due to a decreasing role of model overfitting.This potential for increased accuracy suggested by the reanalysis of the data of [2] was largely realized by the larger Basel Risk Norms. Figure 3   the accuracy of the psychometric model is systematically higher for the Basel Risk Norms compared to the data of [2] for any training set size.This difference likely reflects the substantial increase in reliability due to a larger number of ratings.Third, embedding accuracies for small training sets are worse for the Basel Risk Norms than the data of [2] when considering all risks and better when considering only the risks shared across data sets.These results are consistent with the higher risk rating reliabilities of the Basel Risk Norms but also suggest that the newly introduced risks may result in a larger diversity of risks, making it harder to generalize from train to test set.
Overall, by increasing the size of the risk set, we boosted the performance of all models thus permitting a fairer comparison of model performance due to less model overfitting.

Model comparison
In this section, we provide additional information concerning the sensitivity of our model comparison results to various analytic choices.We first justify our decision to focus only on the results of a linear regression algorithm (elastic net) in the main paper, instead of more flexible nonlinear methods such as the popular gradient boosting.We next motivate our decision to use a groupwise scaling technique during pre-processing, instead of more traditional approaches to scaling preceding regularized regression such as standardization.Finally, we provide a comprehensive statistical analysis of the differences between all pairwise model combinations for completeness.

Elastic net versus gradient boosting
In addition to elastic net regression, we evaluated the predictive accuracies of the different models using Scikit-Learn's gradient boosting regressor [4].Gradient boosting is a popular nonlinear algorithm that builds an additive model out of regression trees in a forward stagewise fashion.In many cases, gradient boosting can outperform linear models, especially when more training samples are available.
We observed that for all but one model gradient boosting was at best equal and, in many cases, clearly worse than the linear model (see Figure 4).The only exception was the low-dimensional psychometric model, which saw a small increase in the predictive accuracy of 2.6 percentage points on the Basel Risk Norm data.Interestingly, we also see the impact of the increased training set size, with the additional risks in our norm set reducing the relative advantage of elastic net over gradient boosting.This indicates Fig. 4 Pairwise differences between elastic net and gradient boosting using 10x10-fold crossvalidation.Cross-validation via [2]'s risk norms (306 risks) are colored cyan and points obtained using the Basel Risk Norms (1004 risks) are colored purple.Error bars are adjusted 95% confidence intervals [3].
that perhaps with a sufficient number of samples, the more flexible gradient boosting model could outperform elastic net.
Overall, regularized linear regression emerged as the superior model, which is consistent with the relatively low ratio of data points to features.

Evaluating embedding scaling approaches
When relying on regularization techniques, such as elastic net regularization, it is common practice to standardize the predictors to even out their contribution to the regularization penalty.However, we based our analysis on unstandardized embeddings.
We did this to allow for a fair comparison between the free-association and text embeddings.The free associations embedding (SWOW ) was trained using singular value decomposition, which by design allocates variance very unevenly across the embedding dimensions.Standardizing SWOW would thus imply removing an important prior on the importance of embedding dimension, which can result in reduced predictive accuracy.To quantify the potential negative effect of standardization on SWOW and a Fig. 5 Pairwise differences between standardized and unstandardized models using 10x10-fold crossvalidation and elastic net regression.Cross-validation via [2]'s risk norms (306 risks) are colored cyan and points obtained using the Basel Risk Norms (1004 risks) are colored purple.Error bars are adjusted 95% confidence intervals [3].
potentially positive effect for the other embedding models, we explicitly compared the predictive accuracy for every model with standardized and unstandardized dimensions for both risk norm sets (Bhatia, 2019, and Basel Risk Norms).
As can be seen in Figure 5, standardizing did indeed negatively impact the SWOW ) for both norm sets (Bhatia, 2019: t = −3.09,p = .003,Basel Risk Norms: t = −3.68,p < .001).In terms of the text embeddings, the effect of standardizing was mixed, with a negative effect for GloVe on [2]'s data (t = −3.04,p = .003),and smaller positive effects on the Basel Risk Norms for GloVe (t = 2.26, p = .027)and fastText (t = 2.05, p = .043).Psychometric was not significantly affected.In light of these findings, we chose not to standardize the models in our analysis.

Statistical tests
The comparison of models was carried using the procedure described in [3] (see also, [5]).It involves calculating the differences in model performance across the same 100 (10x10) train-test splits for each pair of models and testing the null hypothesis that the mean difference equals zero using an adjusted paired t-test that accounts for the dependence between train-test splits.
To give an overview of all possible model comparisons, Figure 6 shows the differences in R-squared predictive accuracy for all pairs of individual and ensemble models (y-axis models minus x-axis models) with nonsignificant differences displayed as white.
Several important insights emerge from the patterns of results.First, the patterns of results are highly similar between the data of [2] and the Basel Risk Norms, with one exception being the large number of significant results for Basel Risk Norms due to the higher reliability and larger data set size.Second, ensembles containing the psychometric model outperform ensembles without the psychometric model, as indicated by the strong bright rectangle in the bottom left corners.Third, there is only one model not significantly different from the psychometric model-GloVe & SWOW -attesting to the strong performance of SWOW in capturing important aspects of risk perception.

Word norms
Our interpretability analysis identified unaccounted dimensions of risk by relying on a set of word norms.For this purpose, we selected a set of norms from [6] that we hypothesized to be related to risk perception.Table 1 provides an overview of these norms and lists the individual sources.As reported in the main text, these norms are able to predict 64.3% of risk perception variance (with 32% of the norm data imputed using Word2Vec to deal with missing norm data on certain risks), establishing their usefulness for revealing the key aspects of risk perception.

Figure 3
Evaluating how test performance varies with training set size for 3 data (sub-)sets: (i) Basel Risk Norms (All), which refers to our full data set of 1,004 risks, (ii) Basel Risk Norms (Bhatia Set), referring to our data limited to the same 306 risks as used in [2], and [2] (iii).Test sets are composed of all remaining risks in the data.Train-test splits were sampled randomly (i.e., bootstrapped), with 10 repetitions per model per training set size.Error bars are 95% confidence intervals.

Figure 4
Pairwise differences between elastic net and gradient boosting using 10x10-fold crossvalidation.Cross-validation via [2]'s risk norms (306 risks) are colored cyan and points obtained using the Basel Risk Norms (1004 risks) are colored purple.Error bars are adjusted 95% confidence intervals [3].

Figure 5
Pairwise differences between standardized and unstandardized models using 10x10fold cross-validation and elastic net regression.Cross-validation via [2]'s risk norms (306 risks) are colored cyan and points obtained using the Basel Risk Norms (1004 risks) are colored purple.Error bars are adjusted 95% confidence intervals [3].

Figure 6
Heatmap illustrating the differences in 10x10-fold cross-validation R-squared between all pairwise model combinations using elastic net regression (y-axis models minus xaxis models).White squares reflect mean differences that do not significantly differ from zero.The top panel shows the results for the data of [2] and the bottom panel the results for the Basel Risk Norms.Fig. 6 Heatmap illustrating the differences in 10x10-fold cross-validation R-squared between all pairwise model combinations using elastic net regression (y-axis models minus x-axis models).White squares reflect mean differences that do not significantly differ from zero.The top panel shows the results for the data of [2] and the bottom panel the results for the Basel Risk Norms.

Fig. 1
Fig. 1 Investigating the impact of psychometric item ordering on inter-item correlations.A. Order 1 inter-item correlations.B. Order 2 inter-item correlations.C. Order 1 minus order 2 inter-item correlations.
also shows the accuracies of the different models for the Basel Risk Norms, which demonstrate clear performance increases for the larger training set sizes.Three additional results concerning the relationship between training set size and predictive accuracy in the Basel Risk Norms are worth noting.First, the accuracies appear to taper off for larger training set sizes.One important implication of this is that comparisons between the low-dimensional psychometric model and the highdimensional embeddings models are fairer using the larger Basel Risk Norms.Second,

Fig. 3
Fig. 3 Evaluating how test performance varies with training set size for 3 data (sub-)sets: (i) Basel Risk Norms (All), which refers to our full data set of 1,004 risks, (ii) Basel Risk Norms (Bhatia Set), referring to our data limited to the same 306 risks as used in [2], and [2] (iii).Test sets are composed of all remaining risks in the data.Train-test splits were sampled randomly (i.e., bootstrapped), with 10 repetitions per model per training set size.Error bars are 95% confidence intervals.