Understanding peace through the world news

Peace is a principal dimension of well-being and is the way out of inequity and violence. Thus, its measurement has drawn the attention of researchers, policymakers, and peacekeepers. During the last years, novel digital data streams have drastically changed the research in this field. The current study exploits information extracted from a new digital database called Global Data on Events, Location, and Tone (GDELT) to capture peace through the Global Peace Index (GPI). Applying predictive machine learning models, we demonstrate that news media attention from GDELT can be used as a proxy for measuring GPI at a monthly level. Additionally, we use explainable AI techniques to obtain the most important variables that drive the predictions. This analysis highlights each country’s profile and provides explanations for the predictions, and particularly for the errors and the events that drive these errors. We believe that digital data exploited by researchers, policymakers, and peacekeepers, with data science tools as powerful as machine learning, could contribute to maximizing the societal benefits and minimizing the risks to peace. Supplementary Information The online version contains supplementary material available at 10.1140/epjds/s13688-022-00315-z.

• Militarization contains: "Military expenditure as a percentage of GDP", "Number of armed services personnel per 100,000 people", "Volume of transfers of major conventional weapons as recipient (imports) per 100,000 people", "Volume of transfers of major conventional weapons as supplier (exports) per 100,000 people", "Financial contribution to UN peacekeeping missions", "Nuclear and heavy weapons capabilities", and "Ease of access to small arms and light weapons".

II. SUPPLEMENTARY NOTE 2: TOPICS OF GDELT
The GDELT event categories we use are related to 20 topics, as described below. For each topic, we provide a short description and a few examples of event categories: Make Public Statement refers to public statements expressed verbally or in action, such as "Make statement", "Make pessimistic comment", and "Decline comment". Appeal refers to requests, proposals, suggestions and appeals, such as "Appeal for material cooperation", "Appeal for economic cooperation", and "Appeal to others to settle dispute". Express Intent To Cooperate refers to offer, promise, agree to, or otherwise indicate willingness or commitment to cooperate, such as "Express intent to engage in material cooperation" and "Express intent to provide material aid". Consult refers to consultations and meetings, such as "Discuss by telephone" and "Host a visit". Engage In Diplomatic Cooperation refers to initiate, resume, improve, or expand diplomatic, nonmaterial cooperation or exchange, such as "Sign formal agreement" and "Praise or endorse". Engage In Material Cooperation refers to initiate, resume, improve, or expand material cooperation or exchange, such as "Cooperate economically" and "Share intelligence or information". Provide Aid refers to provisions and extension of material aid, such as "Provide economic aid" and "Provide humanitarian aid". Yield refers to yieldings and concessions, such as "Accede to requests or demands for political reform", "De-escalate military engagement", and "Return, release". Investigate refers to non-covert investigations, such as "Investigate crime, corruption" and "Investigate human rights abuses". Demand refers to demands and orders, such as "Demand political reform" and "Demand settling of dispute". Disapprove refers to the expression of disapprovals, objections, and complaints, such as "Criticize or denounce" and "Complain officially". Reject refers to rejections and refusals, such as "Reject request or demand for material aid" and "Reject mediation". Threaten refers to threats, coercive or forceful warnings with serious potential repercussions, such as "Threaten with military force" and "Threaten with administrative sanctions". Protest refers to civilian demonstrations and other collective actions carried out as protests such as "Demonstrate or rally" and "Conduct strike or boycott". Exhibit Force Posture refers to military or police moves that fall short of the actual use of force, such as "Exhibit military or police power" and "Increase military alert status". Reduce Relations refers to reductions in normal, routine, or cooperative relations, such as "Reduce or break diplomatic relations" and "Halt negotiations". Coerce refers to repression, violence against civilians, or their rights or properties, such as "Arrest, detain" and "Seize or damage property". Assault refers to the use of different forms of violence, such as "Conduct non-military bombing" and "Abduct, hijack, take hostage". Fight refers to uses of conventional force and acts of war, such as "Use conventional military force" and "Fight with small arms and light weapons". Engage In Unconventional Mass Violence refers to uses of unconventional force that are meant to cause mass destruction, casualties, and suffering, such as "Engage in ethnic cleansing" and "Detonate nuclear weapons".

Linear regression
Linear regression, one of the simplest and most widely used regression techniques, calculates the estimators of the regression coefficients (the predicted weights) by minimizing the sum of squared residuals [1]. One of its main advantages is the ease of interpreting results.

Elastic Net
Elastic Net is a regularized variable selection regression method. One of the essential advantages of Elastic Net is that it combines penalization techniques from the Lasso and Ridge regression methods into a single algorithm [2]. Lasso regression penalizes the sum of absolute values of the coefficients (L1 penalty), Ridge regression penalizes the sum of squared coefficients (L2 penalty), while Elastic Net imposes both L1 and L2 penalties. This means that Elastic Net can completely remove weak variables, as Lasso does, or reduce them by bringing them closer to zero, as Ridge does. Therefore, it does not lose valuable information, but still imposes penalties to lessen the impact of certain variables.

Decision Tree
Decision trees are used to visually and explicitly represent decisions, in the form of a tree structure. A decision tree is called regression tree when the dependent variable takes continuous values [2]. The goal of using a regression tree is to create a training model that can predict the value of the dependent variable by learning simple decision rules inferred from the training data. The regression tree induction algorithm divides the dataset into smaller data groups, while simultaneously an associated decision tree is incrementally developed. The final tree consists of decision nodes and leaf nodes. A decision node has two or more branches, each representing values for the variable tested. A leaf node represents a decision on the value of the dependent variable. The topmost decision node, called the root node, corresponds to the most important variable. The main difference between a regression tree and a decision tree is that for regression problems, the objective function is to minimize the variance in each partition.

Support Vector Regression (SVR)
SVR [3] is a regression learning approach which, comparing to other regression algorithms that try to minimize the error between the real and predicted value, uses a symmetrical loss function that equally penalizes high and low misestimates. In particular, it forms a tube symmetrically around the estimated function (hyperplane), such that the absolute values of errors less than a certain threshold are penalised both above and below the estimate, but those within the threshold do not receive any penalty. The most commonly used kernels, for finding the hyperplane, is the Radial Basis Function (RBF) kernel, that we also use for our analysis. One of the main advantages of SVR is that its computational complexity does not depend on the dimensionality of the input space. Moreover, it has excellent generalization capability, and provides high prediction accuracy.

Random Forest
Random Forest limits the risk of a Decision Tree to overfit the training data [2]. As the names "Tree" and "Forest" imply, a Random Regression Forest is essentially a collection of individual Regression Trees that operate as a whole. A Regression Tree is built on the entire dataset, using all the variables of interest. On the contrary, Random Forest builds multiple Regression Trees from randomly selecting observations and specific variables and then averages over all trees' prediction. Individually, predictions made by Regression Trees may not be accurate, but combined, are, on average, closer to the true value.

Extreme Gradient Boosting (XGBoost)
XGBoost [4] is a scalable machine learning regression system for tree boosting. It uses a gradient descent algorithm and incorporates a regularized model to prevent overfitting. Comparing to Random Forest that builds each tree independently and combines them in parallel, XGBoost uses boosting, combining weak learners (usually decision trees with only one split, called decision stumps) sequentially, so that each new tree corrects the errors of the previous one. In particular, XGBoost corrects the previous mistakes made by the model, learns from it and its next step enhances the performance until there is no scope of further improvements. Its main advantage is that it is fast to execute and gives high accuracy.

IV. SUPPLEMENTARY NOTE 4: HYPERPARAMETERS
The hyperparameters we tune for Elastic Net are α, which is the relative importance of the L1 (LASSO) and L2 (Ridge) penalties, and λ, which is the amount of regularization used in the model. For Decision Tree, we tune the complexity parameters maxdepth, which is the maximum depth of the tree), minsamplessplit, which is the minimum number of samples required to split an internal node, and minsamplesleaf , which is the minimum number of samples required to be at a leaf node. For Random Forest, similarly to Decision Tree, we tune the maxdepth, the minsamplessplit, and the minsamplesleaf . We also tune the nestimators, which accounts for the number of number of trees in the model, and the maxf eatures, which corresponds to the number of variables to consider when looking for the best split. For XGBoost, we tune the nestimators, similarly to Random Forest, and the maxdepth, similarly to Decision Tree. We also tune the learningrate, a value that in each boosting step, shrinks the weight of new variables, preventing overfitting or a local minimum, and colsample b ytree, which represents the fraction of columns to be subsampled, it is related to the speed of the algorithm and it prevents overfitting. Last, for SVR RBF model we tune the regularization parameter C, which imposes a penalty to the model for making an error, and gamma parameter, which defines how far the influence of a single training example reaches.

V. SUPPLEMENTARY NOTE 5: PERFORMANCE INDICATORS
We consider the following indicators to assess the performance of the prediction models with respect to the groundtruth GPI values. Our notation is as follows: y t denotes the observed value of the GPI at time t, x t denotes the predicted value by the model at time t,ȳ denotes the mean or average of the values y t and similarlyx denotes the mean or average of the values x t . Pearson Correlation, a measure of the linear dependence between two variables during a time period [t 1 , t n ], is defined as: (1) Root Mean Square Error (RMSE), a measure of prediction accuracy that represents the square root of the second sample moment of the differences between predicted values and actual values, is defined as: Mean Absolute Percentage Error (MAPE), a measure of prediction accuracy between predicted and true values, is defined as: VI. SUPPLEMENTARY NOTE 6: LINEAR MODELS RESULTS The median Pearson Correlation for the Linear models for the 1-month-ahead predictions is 0.069, and the median MAPE is 39.273. These results demonstrate that Linear models show lower performance not only from the XGBoost models (0.521, and 1.593, respectively), but also from the Elastic Net models (0.327, and 1.997, respectively), already from the 1-month-ahead predictions.