This section describes the proposed Bayesian model, and details on model training and inference. Since Gaussian Processes (GP) form the basis of our model, a brief background is provided.
4.1 Model intuition
We propose a semi-parametric model given as: \({\textbf{y}} = {\mathbf{B}}{\textbf{x}} + f(\textbf{x},{\mathbf{s}},u, t) + \epsilon \). The first term models the linear relationship between EO covariates (x) and the targets (y), where B is the coefficient matrix for the linear model. The multiple targets of regression correspond to household energy access indicators (e.g., electricity, gas etc.) The second term employs a non-linear functional mapping based on GP between an augmented covariate vector and y. The augmented covariate vector includes x, the spatio-temporal coordinates (s, t), and an urban-rural indicator (u).
GPs belong to the class of Bayesian models, where the choice of kernel functions enables one to learn highly nonlinear relationships between the covariates and target variables [40]. GPs can be made more flexible and interpretable by combining (adding or multiplying or convolving) different kernels, where each kernel models a certain effect within individual covariates.
We propose a specialized kernel for our GP model, with the following form:
$$ {\mathbf{K}}_{mo} = \underbrace{{\mathbf{K}}_{c} + ({ \mathbf{K}}_{sp} * {{\mathbf{K}}}_{ur}) + { \mathbf{K}}_{t}}_{ \text{covariate effect}} \odot \underbrace{{ \mathbf{K}}_{\ell}}_{ \substack{\text{multi-target}\\ \text{effect}}}. $$
(3)
The first kernel in (3) models three types of effects in an additive form: a EO covariate effect \({\mathbf{K}}_{c}\), a spatial auto-correlation effect with urban-rural delineation \({\mathbf{K}}_{sp} * {{\mathbf{K}}}_{ur}\) which assigns more weight to spatially proximal and similar microregions (i.e. in EO data, an urban location might derive some similarity from nearby rural locations and also from nearby other urban locations), and a temporal recency effect which assigns more weight to recent observations \({\mathbf{K}}_{t}\). The second kernel \({\mathbf{K}}_{\ell}\) provides the multi-target formalism by exploiting correlations across different targets.
The rationale for using such specialized kernel is that additive kernels are known to extrapolate well to unseen test data [41, 42], and we empirically demonstrate better performance of our model compared to existing works.
Model training involves estimating the optimal values for the coefficient matrix, B, and the hyper-parameters associated with the kernel \({\mathbf{K}}_{mo}\) in (3), and is done by maximizing the marginalized log-likelihood of the training data. Elastic-net regularization is employed on the linear model to prevent learning from spurious features and to avoid overfitting on limited training data [43]. We perform out of sample spatial and temporal validation to test our model’s generalizability.
4.2 Model details
Notation
For a given microregion, indexed by c, the covariate vector, target vector, spatial coordinates, and the urban-rural indicator, are denoted by \({\mathbf{x}}_{t}^{(c)}\), \({\mathbf{y}}_{t}^{(c)}\), \({\mathbf{s}}^{(c)}\), and \(u^{(c)}\), respectively, and are collectively denoted as \({\mathbf{z}}^{(c)}_{t}\). Note that the covariate vectors and target vectors are also indexed by time t, denoting the corresponding years. Each individual target will be denoted by \(y^{(c)}_{to}\). For notational simplicity, we will drop the superscript c to denote a typical microregion, unless needed. In general, we will use a lower-case bold symbol to denote a vector, upper-case bold symbol to denote a matrix, and a lower-case normal symbol to denote a scalar value. Collections (or sets) of entities will denoted using calligraphic symbols, e.g., \(\mathcal{X}\), \(\mathcal{Y}\). The oth entry of a vector, e.g., x, will be denoted as \(x_{o}\).
4.2.1 Model description
The proposed semi-parametric model is written as:
$$ {\mathbf{y}}_{t} = {\mathbf{B}} {\mathbf{x}}_{t} + f({ \mathbf{x}}_{t},{\mathbf{s}},u, t) + \epsilon, $$
(4)
where B is the coefficient matrix for the linear component and ϵ denotes the unexplained noise and is modeled as a zero-mean Gaussian random variable, i.e., \(\epsilon \sim N(0,\sigma ^{2}_{n})\). The function \(f()\) captures the non-linear dependencies between the covariates and the residual vector, \(\boldsymbol{\delta}_{t}\), where \(\boldsymbol{\delta}_{t} = ({\mathbf{y}}_{t} - {\mathbf{B}}{\mathbf{x}}_{t})\), and is modeled using a Gaussian Process.
Background on Gaussian Processes (GP)
GP is a Bayesian formulation to learn non-parametric, non-linear functions, through the use of kernels. A GP allows placing a stochastic prior on the function \(f({\mathbf{z}}_{t})\), where \({\mathbf{z}}_{t} \equiv ({\mathbf{x}}_{t},{\mathbf{s}},u, t)\). The GP prior is completely specified by a mean function, \(m(\cdot )\), and a positive-definite kernel function \(k(\cdot ,\cdot )\). The mean function represents the expected value of \(f()\), i.e., \(m({\mathbf{z}}_{t}) = \mathbb{E}[f({\mathbf{z}})]\), and is often set to 0, i.e., \(m({\mathbf{z}}_{t}) = 0\). The kernel function defines the covariance between any two realizations of \(f()\), i.e.,
$$ k\bigl({\mathbf{z}}_{t},{\mathbf{z}}^{\prime}_{t} \bigr) = \mathbb{E}\bigl[f({\mathbf{z}}_{t})f\bigl({ \mathbf{z}}^{\prime}_{t}\bigr)\bigr] $$
(5)
assuming a zero mean function.
The definition of GP specifies that for any finite collection of inputs, \(\mathcal{Z} = ({\mathbf{z}}^{c_{1}}_{t_{1}},{\mathbf{z}}^{c_{2}}_{t_{2}}, \ldots ,{\mathbf{z}}^{c_{n}}_{t_{n}})\) the vector of function values, \({\mathbf{f}}(\mathcal{Z}) = (f({\mathbf{z}}^{c_{1}}_{t_{1}}),f({\mathbf{z}}^{c_{2}}_{t_{2}}), \ldots , f({\mathbf{z}}^{c_{n}}_{t_{n}}))\), follow a multivariate Gaussian distribution, i.e.,
$$ {\mathbf{f}}(\mathcal{Z}) \sim N({\mathbf{0}},{\mathbf{K}}_{\mathcal{Z},\mathcal{Z}}), $$
(6)
where \({\mathbf{K}}_{\mathcal{Z},\mathcal{Z}}\) is a \((n \times n)\) covariance matrix, such that the ijth entry is equal to \(k({\mathbf{z}}^{c_{i}}_{t_{i}},{\mathbf{z}}^{c_{j}}_{t_{j}})\).
For a single output, indexed by o, a GP regression model (GPR) can be defined by assuming that the targets are modeled as:
$$ \boldsymbol{\delta}_{o} \sim N\bigl({\mathbf{f}}(\mathcal{Z}),\sigma ^{2}_{n}{ \mathbf{I}}\bigr), $$
(7)
where I is the \((n \times n)\) identity matrix. Using (6) and (7), one can marginalize out \({\mathbf{f}}(\mathcal{Z})\), such that:
$$\begin{aligned} \begin{aligned} p(\boldsymbol{\delta}_{o}\vert \mathcal{Z}) & = \int p\bigl( \boldsymbol{\delta}_{o}\vert {\mathbf{f}}( \mathcal{Z})\bigr)p\bigl({\mathbf{f}}( \mathcal{Z})\bigr)\,d{\mathbf{f}} \\ & = N\bigl(0,{\mathbf{K}}_{\mathcal{Z},\mathcal{Z}} + \sigma ^{2}_{n}{ \mathbf{I}}\bigr). \end{aligned} \end{aligned}$$
(8)
4.2.2 Choice of kernel function
Our kernel function is formulated as follows:
$$ k\bigl({\mathbf{z}}^{c_{i}}_{t_{i}},{\mathbf{z}}^{c_{j}}_{t_{j}} \bigr) = k_{f}\bigl({\mathbf{x}}^{c_{i}}_{t_{i}},{ \mathbf{x}}^{c_{j}}_{t_{j}}\bigr) + \bigl(k_{sp}\bigl({ \mathbf{s}}^{c_{i}},{\mathbf{s}}^{c_{j}}\bigr) \times k_{ur} \bigl(u^{c_{i}},u^{c_{j}}\bigr)\bigr) + k_{t}(t_{i},t_{j}), $$
(9)
where \(k_{f}\), \(k_{sp}\), \(k_{ur}\) and \(k_{t}\) denote the kernels that capture the similarity in covariates, spatial autocorrelation, urban-rural delineation and temporal recency. We use squared exponential kernel function for \(k_{f}\), \(k_{sp}\) and \(k_{t}\), which is the most widely used kernel function because of its ability to learn smooth non-linear functional relationships [40]. The individual kernel specifications are given as follows:
$$\begin{aligned}& k_{f}\bigl({\mathbf{x}}^{c_{i}}_{t_{i}},{ \mathbf{x}}^{c_{j}}_{t_{j}}\bigr) = \sigma _{f}^{2} \exp \biggl({- \frac{ \Vert {\mathbf{x}}^{c_{i}}_{t_{i}}-{\mathbf{x}}^{c_{j}}_{t_{j}} \Vert ^{2}}{2\ell _{f}^{2}}} \biggr), \end{aligned}$$
(10)
$$\begin{aligned}& k_{sp}\bigl({\mathbf{s}}^{c_{i}},{\mathbf{s}}^{c_{j}} \bigr) = \sigma _{sp}^{2} \exp \biggl({- \frac{ \Vert {\mathbf{s}}^{c_{i}}-{\mathbf{s}}^{c_{j}} \Vert ^{2}}{2\ell _{sp}^{2}}} \biggr), \end{aligned}$$
(11)
$$\begin{aligned}& k_{t}(t_{i},t_{j}) = \sigma _{t}^{2} \exp \biggl({- \frac{(t_{i}-t_{j})^{2}}{2\ell _{t}^{2}}} \biggr). \end{aligned}$$
(12)
The urban-rural delineation is modeled by \(k_{ur}\), which is specified as the following categorical kernel:,
$$ k_{ur}\bigl(u^{c_{i}},u^{c_{j}}\bigr) = \textstyle\begin{cases} 1 & \text{if } u^{c_{i}} = u^{c_{j}}, \\ 0 & \text{otherwise}. \end{cases} $$
(13)
The scalars \(\sigma _{f}\), \(\ell _{f}\), \(\sigma _{sp}\), \(\ell _{sp}\), \(\sigma _{t}\), \(\ell _{t}\) are the hyper-parameters of the kernel functions and are estimated from the data, as described later.
Feature selection
To perform feature selection on EO data, we employ Automatic Relevance Determination kernel (ARD) on our model. ARD kernels are effective in selecting a smaller explanatory subset of features from a large set of irrelevant features by regularizing the solution space using a data-dependent prior [44]. Note that the feature kernel in (10) uses a single global characteristic length scale (\(\ell _{f}\)). However, for ARD each feature has a different characteristic length scale, denoted by \(\ell _{fr}\) for the rth feature. The feature kernel for ARD is given as:
$$ k_{f}\bigl({\mathbf{x}}^{c_{i}}_{t_{i}},{ \mathbf{x}}^{c_{j}}_{t_{j}}\bigr) = \sigma _{f}^{2} \exp \biggl({-\frac{1}{2}\bigl({\mathbf{x}}^{c_{i}}_{t_{i}}-{ \mathbf{x}}^{c_{j}}_{t_{j}}}\bigr)^{ \top }P^{-1} \bigl({\mathbf{x}}^{c_{i}}_{t_{i}}-{\mathbf{x}}^{c_{j}}_{t_{j}} \bigr) \biggr),\qquad P = \operatorname{diag}(\ell _{f1},\ell _{f2},\ldots ). $$
(14)
The inverse of the length scales of each feature, i.e., \(\frac{1}{\ell _{fr}}\), is used a proxy for feature relevance [40].
4.2.3 Handling multiple targets
The GP regression model described above can only handle a single target. Since the problem studied in this paper involves multiple targets, we present the following scheme, adopted from [45], to exploit the correlations among the targets in the regression model. In this formulation, each instance consisting of a covariate vector and m length target vector is converted into m instances with a scalar target value. We introduce an additional discrete covariate, ℓ, which corresponds to the index of the target. For example, a covariate and a m length target vector pair given as \(\langle{\mathbf{z}}^{(c)}_{t}; \boldsymbol{\delta}^{(c)}_{t}\rangle \) is transformed into m pairs as follows:
$$ \bigl\langle {\mathbf{z}}^{(c)}_{t}; \boldsymbol{\delta}^{(c)}_{t}\bigr\rangle \Rightarrow \textstyle\begin{cases} \langle ({\mathbf{z}}^{(c)}_{t}, 1); \delta ^{(c)}_{t1}\rangle, \\ \langle ({\mathbf{z}}^{(c)}_{t}, 2); \delta ^{(c)}_{t2}\rangle, \\ \vdots \\ \langle ({\mathbf{z}}^{(c)}_{t}, m); \delta ^{(c)}_{tm}\rangle. \end{cases} $$
(15)
Note that the target is transformed into a scalar. We denote the augmented covariate vector as \(\bar{\mathbf{z}}^{(c)}_{to} \equiv ({\mathbf{z}}^{(c)}_{t}, o)\). The extra covariate is handled by multiplying the kernel function, \(k()\), in (9) with a target-specific kernel function, \(k_{\ell}()\), to obtain the final kernel function:
$$ \bar{k}\bigl(\bar{\mathbf{z}}^{c_{i}}_{t_{i}},\bar{ \mathbf{z}}^{c_{j}}_{t_{j}}\bigr) = k\bigl({ \mathbf{z}}^{c_{i}}_{t_{i}},{ \mathbf{z}}^{c_{j}}_{t_{j}}\bigr) \times k_{\ell}(o_{i},o_{j}). $$
(16)
Note that the resulting covariance matrix for an augmented single-target data set can be expressed as:
$$ {\mathbf{K}}_{\bar{\mathcal{Z}},\bar{\mathcal{Z}}} = {\mathbf{K}}_{\mathcal{Z}, \mathcal{Z}} \otimes { \mathbf{K}}_{\ell }, $$
(17)
where ⊗ denotes the Kronecker product between the \((n \times n)\) covariance matrix, \({\mathbf{K}}_{\mathcal{Z},\mathcal{Z}}\) and the \((m \times m)\) matrix \({\mathbf{K}}_{\ell}\), such that \(k_{\ell}(o_{i},o_{j}) = {\mathbf{K}}_{\ell}[o_{i},o_{j}]\). For GP, \({\mathbf{K}}_{\bar{\mathcal{Z}},\bar{\mathcal{Z}}}\) needs to be a positive-definite, which means that \({\mathbf{K}}_{\ell}\) should also be positive-definite.
The \(m^{2}\) entries in \({\mathbf{K}}_{\ell}\) can be thought of as the hyper-parameters of the kernel function in (16) and can be learnt from the training data. However, instead of treating each entry as a hyper-parameter, we consider a parameterization of \({\mathbf{K}}_{\ell}\) using fewer hyper-parameters. In particular, we consider a spherical parameterization [46] of \({\mathbf{K}}_{\ell}\), given as follows:
$$\begin{aligned} {\mathbf{K}}_{\ell} = {\mathbf{S}}^{\top }{\mathbf{S}}, \end{aligned}$$
(18)
where S is an upper triangular matrix of size \((m \times m)\), whose oth column contains the spherical coordinates in \(\mathbb{R}^{o}\) of a point on the hypersphere, \(\mathbb{R}^{(o - 1)}\), followed by \((m-o)\) zeros. For example, for \(m = 4\):
(19)
Here, \(\phi ^{(1)}, \phi ^{(1)}, \ldots \) are the hyper-parameters that parameterize the matrix S. For m targets, one would require \(\frac{m(m - 1)}{2}\) hyper-parameters to specify S. The spherical parameterization has three advantages. First, it allows us to parameterize a \((m \times m)\) matrix using only \(\frac{m(m - 1)}{2}\) hyper-parameters. Second, it ensures that the resulting matrix \({\mathbf{K}}_{\ell}\) is positive-definite. And finally, the off-diagonal entries of \({\mathbf{K}}_{\ell}\) encode the correlation among the targets and can be interpreted as such after training the model.
4.2.4 Model training
The parameters of the proposed model consist of the coefficient matrix for the linear model, B, the variance term for the observational likelihood in (7), \(\sigma _{n}\), the kernel hyperparameters, \(\ell _{f}\), \(\sigma _{f}\), \(\ell _{sp}\), \(\sigma _{sp}\), \(\ell _{t}\), \(\sigma _{t}\) (see (10), (11), (12)), and the spherical coordinates in the upper-triangular entries of S.
We assume that the training data consists of n instances, \(\mathcal{Z} = ({\mathbf{z}}^{(c_{1})}_{t_{1}},{\mathbf{z}}^{(c_{2})}_{t_{2}}, \ldots ,{\mathbf{z}}^{(c_{n})}_{t_{n}})\), where each \({\mathbf{z}}^{(c_{i})}_{t_{i}} \equiv ({\mathbf{x}}^{(c_{i})}_{t_{i}},{\mathbf{s}}^{(c_{i})},u^{(c_{i})},t_{i})\), and the corresponding targets \(\mathcal{Y} = ({\mathbf{y}}^{(c_{1})}_{t_{1}},{\mathbf{y}}^{(c_{2})}_{t_{2}}, \ldots ,{\mathbf{y}}^{(c_{n})}_{t_{n}})\). The linear coefficient matrix B is first estimated using a regularized least squares estimation procedure, with the loss function defined as:
$$ J({\mathbf{B}}) = \frac{1}{2n} \Vert {\mathbf{Y}} - {\mathbf{XB}} \Vert ^{2}_{F} + \alpha \lambda \Vert {\mathbf{B}} \Vert ^{2}_{F} + \alpha (1-\lambda ) \vert { \mathbf{B}} \vert , $$
(20)
where \(\Vert \cdot \Vert ^{2}_{F}\) and \(\vert \cdot \vert \) denote the square of the Frobenius norm and the \(l_{1}\) norm of a matrix, respectively. X is the covariate matrix consisting of the covariate vectors, i.e., \({\mathbf{X}} = ({\mathbf{x}}^{(c_{1})}_{t_{1}}, {\mathbf{x}}^{(c_{2})}_{t_{2}}, \ldots , {\mathbf{x}}^{(c_{n})}_{t_{n}})^{\top}\), and Y is the target matrix consisting of the target vectors, i.e., \({\mathbf{Y}} = ({\mathbf{y}}^{(c_{1})}_{t_{1}}, {\mathbf{y}}^{(c_{2})}_{t_{2}}, \ldots , {\mathbf{y}}^{(c_{n})}_{t_{n}})^{\top}\). While the first term in (20) is the standard least squares loss, the second and third terms act as an elastic-net regularizer on the coefficients, which is employed to reduce the impact of spurious features and to avoid overfitting [47], where a model performs well for in-sample data, but does poorly for out-of-sample points. The scalars α and λ are known as the regularization parameters and are tuned using cross-validation on the training data. In this study, the tuned values for α and λ are 0.1 and 0.5, respectively. The optimization of the loss function in (20) is done using a coordinate descent algorithm.
After estimating the optimal coefficients in B, the hyperparameters associated with the GP are estimated by maximizing the marginal log-likelihood of the residuals, using the marginal likelihood in (8). For each training instance, the residual vector is defined as \(\boldsymbol{\delta}^{(c_{i})}_{t_{i}} = {\mathbf{y}}^{(c_{i})}_{t_{i}} - { \mathbf{B}}^{\top}{\mathbf{x}}^{(c_{i})}_{t_{i}}\). Let \(\bar{\mathcal{Z}}\) denote the training data set in which every training instance is augmented according to (15). Let \(\bar{\boldsymbol{\delta}}\) be the vector containing all the scalar targets. Given that the marginalized conditional probability distribution, \((\bar{\boldsymbol{\delta}}\vert \bar{\mathcal{Z}})\) is a multivariate Gaussian with zero mean and covariance as \(({\mathbf{K}}_{\bar{\mathcal{Z}},\bar{\mathcal{Z}}} + \sigma _{n}^{2}I)\) (see (8)), the marginalized log-likelihood can be expressed as:
$$\begin{aligned} \begin{aligned} \log{p(\bar{\boldsymbol{\delta}}\vert \bar{ \mathcal{Z}})} = {}& - \frac{1}{2}\bar{\boldsymbol{\delta}}^{\top} \bigl({\mathbf{K}}_{\bar{\mathcal{Z}}, \bar{\mathcal{Z}}}+ \sigma _{n}^{2}I \bigr)^{-1}\bar{\boldsymbol{\delta}} \\ &{} -\frac{1}{2}\log \bigl\vert \bigl({\mathbf{K}}_{\bar{\mathcal{Z}}, \bar{\mathcal{Z}}} + \sigma _{n}^{2}I\bigr) \bigr\vert - \frac{nm}{2} \log{2\pi}. \end{aligned} \end{aligned}$$
(21)
The marginalized log-likelihood is maximized with respect to the kernel hyperparameters and \(\sigma _{n}\), using stochastic gradient descent [48].
4.2.5 Model inference
To infer any target for a microregion at a new time instance, we use the GP formulation to estimate the posterior distribution for the target, conditioned on the training data set, \((\mathcal{Z}, \mathcal{Y})\). Let the covariates for the test instance be denoted as \({\mathbf{z}}_{*} = ({\mathbf{x}}_{*},{\mathbf{s}}_{*},u_{*},t_{*})\). For the oth target, the posterior distribution of \(y_{*o}\) is a Gaussian distribution, whose mean, \(\bar{y}_{*o}\), and variance, \(\operatorname{var}[y_{*o}]\) are given by the following expressions [40]:
$$\begin{aligned}& \bar{y}_{*o} = {\mathbf{b}}_{o}^{\top }{ \mathbf{x}}_{*} + {\mathbf{k}}_{*}^{\top}\bigl({ \mathbf{K}}_{\bar{\mathcal{Z}},\bar{\mathcal{Z}}} + \sigma _{n}^{2}I \bigr)^{-1} \bar{\boldsymbol{\delta}}, \end{aligned}$$
(22)
$$\begin{aligned}& \operatorname{var}[y_{*o}] = k_{**} - {\mathbf{k}}_{*}^{\top} \bigl({\mathbf{K}}_{ \bar{\mathcal{Z}},\bar{\mathcal{Z}}} + \sigma _{n}^{2}I \bigr)^{-1}{\mathbf{k}}_{*} + \sigma ^{2}_{n}, \end{aligned}$$
(23)
where \({\mathbf{b}}_{o}\) corresponds to the oth column of the coefficient matrix, B. The vector \({\mathbf{k}}_{*}\) contains the kernel function evaluation between every augmented training instance and the test instance, and the scalar \(k_{**}\) is the kernel function evaluation for the test instance with itself.