This section describes the proposed Bayesian model, and details on model training and inference. Since Gaussian Processes (GP) form the basis of our model, a brief background is provided.

### 4.1 Model intuition

We propose a semi-parametric model given as: \({\textbf{y}} = {\mathbf{B}}{\textbf{x}} + f(\textbf{x},{\mathbf{s}},u, t) + \epsilon \). The first term models the linear relationship between EO covariates (**x**) and the targets (**y**), where **B** is the coefficient matrix for the linear model. The multiple targets of regression correspond to household energy access indicators (e.g., electricity, gas etc.) The second term employs a non-linear functional mapping based on GP between an augmented covariate vector and **y**. The augmented covariate vector includes **x**, the spatio-temporal coordinates (**s**, *t*), and an urban-rural indicator (*u*).

GPs belong to the class of *Bayesian models*, where the choice of kernel functions enables one to learn highly nonlinear relationships between the covariates and target variables [40]. GPs can be made more flexible and interpretable by combining (adding or multiplying or convolving) different kernels, where each kernel models a certain effect within individual covariates.

We propose a specialized kernel for our GP model, with the following form:

$$ {\mathbf{K}}_{mo} = \underbrace{{\mathbf{K}}_{c} + ({ \mathbf{K}}_{sp} * {{\mathbf{K}}}_{ur}) + { \mathbf{K}}_{t}}_{ \text{covariate effect}} \odot \underbrace{{ \mathbf{K}}_{\ell}}_{ \substack{\text{multi-target}\\ \text{effect}}}. $$

(3)

The first kernel in (3) models three types of effects in an additive form: a *EO covariate effect* \({\mathbf{K}}_{c}\), a spatial auto-correlation effect with urban-rural delineation \({\mathbf{K}}_{sp} * {{\mathbf{K}}}_{ur}\) which assigns more weight to *spatially proximal* and similar microregions (i.e. in EO data, an urban location might derive some similarity from nearby rural locations and also from nearby other urban locations), and a *temporal recency* effect which assigns more weight to recent observations \({\mathbf{K}}_{t}\). The second kernel \({\mathbf{K}}_{\ell}\) provides the *multi-target formalism* by exploiting correlations across different targets.

The rationale for using such specialized kernel is that additive kernels are known to extrapolate well to unseen test data [41, 42], and we empirically demonstrate better performance of our model compared to existing works.

Model training involves estimating the optimal values for the coefficient matrix, **B**, and the hyper-parameters associated with the kernel \({\mathbf{K}}_{mo}\) in (3), and is done by maximizing the marginalized log-likelihood of the training data. Elastic-net regularization is employed on the linear model to prevent learning from spurious features and to avoid overfitting on limited training data [43]. We perform out of sample spatial and temporal validation to test our model’s generalizability.

### 4.2 Model details

### Notation

For a given microregion, indexed by *c*, the covariate vector, target vector, spatial coordinates, and the urban-rural indicator, are denoted by \({\mathbf{x}}_{t}^{(c)}\), \({\mathbf{y}}_{t}^{(c)}\), \({\mathbf{s}}^{(c)}\), and \(u^{(c)}\), respectively, and are collectively denoted as \({\mathbf{z}}^{(c)}_{t}\). Note that the covariate vectors and target vectors are also indexed by time *t*, denoting the corresponding years. Each individual target will be denoted by \(y^{(c)}_{to}\). For notational simplicity, we will drop the superscript *c* to denote a typical microregion, unless needed. In general, we will use a lower-case bold symbol to denote a vector, upper-case bold symbol to denote a matrix, and a lower-case normal symbol to denote a scalar value. Collections (or sets) of entities will denoted using calligraphic symbols, e.g., \(\mathcal{X}\), \(\mathcal{Y}\). The *o*th entry of a vector, e.g., **x**, will be denoted as \(x_{o}\).

#### 4.2.1 Model description

The proposed semi-parametric model is written as:

$$ {\mathbf{y}}_{t} = {\mathbf{B}} {\mathbf{x}}_{t} + f({ \mathbf{x}}_{t},{\mathbf{s}},u, t) + \epsilon, $$

(4)

where **B** is the coefficient matrix for the linear component and *ϵ* denotes the unexplained noise and is modeled as a zero-mean Gaussian random variable, i.e., \(\epsilon \sim N(0,\sigma ^{2}_{n})\). The function \(f()\) captures the non-linear dependencies between the covariates and the residual vector, \(\boldsymbol{\delta}_{t}\), where \(\boldsymbol{\delta}_{t} = ({\mathbf{y}}_{t} - {\mathbf{B}}{\mathbf{x}}_{t})\), and is modeled using a Gaussian Process.

### Background on Gaussian Processes (GP)

GP is a Bayesian formulation to learn non-parametric, non-linear functions, through the use of kernels. A GP allows placing a stochastic prior on the function \(f({\mathbf{z}}_{t})\), where \({\mathbf{z}}_{t} \equiv ({\mathbf{x}}_{t},{\mathbf{s}},u, t)\). The GP prior is completely specified by a mean function, \(m(\cdot )\), and a positive-definite kernel function \(k(\cdot ,\cdot )\). The mean function represents the expected value of \(f()\), i.e., \(m({\mathbf{z}}_{t}) = \mathbb{E}[f({\mathbf{z}})]\), and is often set to 0, i.e., \(m({\mathbf{z}}_{t}) = 0\). The kernel function defines the covariance between any two realizations of \(f()\), i.e.,

$$ k\bigl({\mathbf{z}}_{t},{\mathbf{z}}^{\prime}_{t} \bigr) = \mathbb{E}\bigl[f({\mathbf{z}}_{t})f\bigl({ \mathbf{z}}^{\prime}_{t}\bigr)\bigr] $$

(5)

assuming a zero mean function.

The definition of GP specifies that for any finite collection of inputs, \(\mathcal{Z} = ({\mathbf{z}}^{c_{1}}_{t_{1}},{\mathbf{z}}^{c_{2}}_{t_{2}}, \ldots ,{\mathbf{z}}^{c_{n}}_{t_{n}})\) the vector of function values, \({\mathbf{f}}(\mathcal{Z}) = (f({\mathbf{z}}^{c_{1}}_{t_{1}}),f({\mathbf{z}}^{c_{2}}_{t_{2}}), \ldots , f({\mathbf{z}}^{c_{n}}_{t_{n}}))\), follow a multivariate Gaussian distribution, i.e.,

$$ {\mathbf{f}}(\mathcal{Z}) \sim N({\mathbf{0}},{\mathbf{K}}_{\mathcal{Z},\mathcal{Z}}), $$

(6)

where \({\mathbf{K}}_{\mathcal{Z},\mathcal{Z}}\) is a \((n \times n)\) covariance matrix, such that the *ij*th entry is equal to \(k({\mathbf{z}}^{c_{i}}_{t_{i}},{\mathbf{z}}^{c_{j}}_{t_{j}})\).

For a single output, indexed by *o*, a GP regression model (GPR) can be defined by assuming that the targets are modeled as:

$$ \boldsymbol{\delta}_{o} \sim N\bigl({\mathbf{f}}(\mathcal{Z}),\sigma ^{2}_{n}{ \mathbf{I}}\bigr), $$

(7)

where **I** is the \((n \times n)\) identity matrix. Using (6) and (7), one can marginalize out \({\mathbf{f}}(\mathcal{Z})\), such that:

$$\begin{aligned} \begin{aligned} p(\boldsymbol{\delta}_{o}\vert \mathcal{Z}) & = \int p\bigl( \boldsymbol{\delta}_{o}\vert {\mathbf{f}}( \mathcal{Z})\bigr)p\bigl({\mathbf{f}}( \mathcal{Z})\bigr)\,d{\mathbf{f}} \\ & = N\bigl(0,{\mathbf{K}}_{\mathcal{Z},\mathcal{Z}} + \sigma ^{2}_{n}{ \mathbf{I}}\bigr). \end{aligned} \end{aligned}$$

(8)

#### 4.2.2 Choice of kernel function

Our kernel function is formulated as follows:

$$ k\bigl({\mathbf{z}}^{c_{i}}_{t_{i}},{\mathbf{z}}^{c_{j}}_{t_{j}} \bigr) = k_{f}\bigl({\mathbf{x}}^{c_{i}}_{t_{i}},{ \mathbf{x}}^{c_{j}}_{t_{j}}\bigr) + \bigl(k_{sp}\bigl({ \mathbf{s}}^{c_{i}},{\mathbf{s}}^{c_{j}}\bigr) \times k_{ur} \bigl(u^{c_{i}},u^{c_{j}}\bigr)\bigr) + k_{t}(t_{i},t_{j}), $$

(9)

where \(k_{f}\), \(k_{sp}\), \(k_{ur}\) and \(k_{t}\) denote the kernels that capture the similarity in covariates, spatial autocorrelation, urban-rural delineation and temporal recency. We use *squared exponential* kernel function for \(k_{f}\), \(k_{sp}\) and \(k_{t}\), which is the most widely used kernel function because of its ability to learn smooth non-linear functional relationships [40]. The individual kernel specifications are given as follows:

$$\begin{aligned}& k_{f}\bigl({\mathbf{x}}^{c_{i}}_{t_{i}},{ \mathbf{x}}^{c_{j}}_{t_{j}}\bigr) = \sigma _{f}^{2} \exp \biggl({- \frac{ \Vert {\mathbf{x}}^{c_{i}}_{t_{i}}-{\mathbf{x}}^{c_{j}}_{t_{j}} \Vert ^{2}}{2\ell _{f}^{2}}} \biggr), \end{aligned}$$

(10)

$$\begin{aligned}& k_{sp}\bigl({\mathbf{s}}^{c_{i}},{\mathbf{s}}^{c_{j}} \bigr) = \sigma _{sp}^{2} \exp \biggl({- \frac{ \Vert {\mathbf{s}}^{c_{i}}-{\mathbf{s}}^{c_{j}} \Vert ^{2}}{2\ell _{sp}^{2}}} \biggr), \end{aligned}$$

(11)

$$\begin{aligned}& k_{t}(t_{i},t_{j}) = \sigma _{t}^{2} \exp \biggl({- \frac{(t_{i}-t_{j})^{2}}{2\ell _{t}^{2}}} \biggr). \end{aligned}$$

(12)

The urban-rural delineation is modeled by \(k_{ur}\), which is specified as the following categorical kernel:,

$$ k_{ur}\bigl(u^{c_{i}},u^{c_{j}}\bigr) = \textstyle\begin{cases} 1 & \text{if } u^{c_{i}} = u^{c_{j}}, \\ 0 & \text{otherwise}. \end{cases} $$

(13)

The scalars \(\sigma _{f}\), \(\ell _{f}\), \(\sigma _{sp}\), \(\ell _{sp}\), \(\sigma _{t}\), \(\ell _{t}\) are the hyper-parameters of the kernel functions and are estimated from the data, as described later.

### Feature selection

To perform feature selection on EO data, we employ Automatic Relevance Determination kernel (ARD) on our model. ARD kernels are effective in selecting a smaller explanatory subset of features from a large set of irrelevant features by regularizing the solution space using a data-dependent prior [44]. Note that the feature kernel in (10) uses a single global characteristic length scale (\(\ell _{f}\)). However, for ARD each feature has a different characteristic length scale, denoted by \(\ell _{fr}\) for the *r*th feature. The feature kernel for ARD is given as:

$$ k_{f}\bigl({\mathbf{x}}^{c_{i}}_{t_{i}},{ \mathbf{x}}^{c_{j}}_{t_{j}}\bigr) = \sigma _{f}^{2} \exp \biggl({-\frac{1}{2}\bigl({\mathbf{x}}^{c_{i}}_{t_{i}}-{ \mathbf{x}}^{c_{j}}_{t_{j}}}\bigr)^{ \top }P^{-1} \bigl({\mathbf{x}}^{c_{i}}_{t_{i}}-{\mathbf{x}}^{c_{j}}_{t_{j}} \bigr) \biggr),\qquad P = \operatorname{diag}(\ell _{f1},\ell _{f2},\ldots ). $$

(14)

The inverse of the length scales of each feature, i.e., \(\frac{1}{\ell _{fr}}\), is used a proxy for feature relevance [40].

#### 4.2.3 Handling multiple targets

The GP regression model described above can only handle a single target. Since the problem studied in this paper involves multiple targets, we present the following scheme, adopted from [45], to exploit the correlations among the targets in the regression model. In this formulation, each instance consisting of a covariate vector and *m* length target vector is converted into *m* instances with a scalar target value. We introduce an additional discrete covariate, *ℓ*, which corresponds to the index of the target. For example, a covariate and a *m* length target vector pair given as \(\langle{\mathbf{z}}^{(c)}_{t}; \boldsymbol{\delta}^{(c)}_{t}\rangle \) is transformed into *m* pairs as follows:

$$ \bigl\langle {\mathbf{z}}^{(c)}_{t}; \boldsymbol{\delta}^{(c)}_{t}\bigr\rangle \Rightarrow \textstyle\begin{cases} \langle ({\mathbf{z}}^{(c)}_{t}, 1); \delta ^{(c)}_{t1}\rangle, \\ \langle ({\mathbf{z}}^{(c)}_{t}, 2); \delta ^{(c)}_{t2}\rangle, \\ \vdots \\ \langle ({\mathbf{z}}^{(c)}_{t}, m); \delta ^{(c)}_{tm}\rangle. \end{cases} $$

(15)

Note that the target is transformed into a scalar. We denote the augmented covariate vector as \(\bar{\mathbf{z}}^{(c)}_{to} \equiv ({\mathbf{z}}^{(c)}_{t}, o)\). The extra covariate is handled by multiplying the kernel function, \(k()\), in (9) with a target-specific kernel function, \(k_{\ell}()\), to obtain the final kernel function:

$$ \bar{k}\bigl(\bar{\mathbf{z}}^{c_{i}}_{t_{i}},\bar{ \mathbf{z}}^{c_{j}}_{t_{j}}\bigr) = k\bigl({ \mathbf{z}}^{c_{i}}_{t_{i}},{ \mathbf{z}}^{c_{j}}_{t_{j}}\bigr) \times k_{\ell}(o_{i},o_{j}). $$

(16)

Note that the resulting covariance matrix for an augmented single-target data set can be expressed as:

$$ {\mathbf{K}}_{\bar{\mathcal{Z}},\bar{\mathcal{Z}}} = {\mathbf{K}}_{\mathcal{Z}, \mathcal{Z}} \otimes { \mathbf{K}}_{\ell }, $$

(17)

where ⊗ denotes the Kronecker product between the \((n \times n)\) covariance matrix, \({\mathbf{K}}_{\mathcal{Z},\mathcal{Z}}\) and the \((m \times m)\) matrix \({\mathbf{K}}_{\ell}\), such that \(k_{\ell}(o_{i},o_{j}) = {\mathbf{K}}_{\ell}[o_{i},o_{j}]\). For GP, \({\mathbf{K}}_{\bar{\mathcal{Z}},\bar{\mathcal{Z}}}\) needs to be a positive-definite, which means that \({\mathbf{K}}_{\ell}\) should also be positive-definite.

The \(m^{2}\) entries in \({\mathbf{K}}_{\ell}\) can be thought of as the hyper-parameters of the kernel function in (16) and can be learnt from the training data. However, instead of treating each entry as a hyper-parameter, we consider a parameterization of \({\mathbf{K}}_{\ell}\) using fewer hyper-parameters. In particular, we consider a *spherical parameterization* [46] of \({\mathbf{K}}_{\ell}\), given as follows:

$$\begin{aligned} {\mathbf{K}}_{\ell} = {\mathbf{S}}^{\top }{\mathbf{S}}, \end{aligned}$$

(18)

where **S** is an upper triangular matrix of size \((m \times m)\), whose *o*th column contains the spherical coordinates in \(\mathbb{R}^{o}\) of a point on the hypersphere, \(\mathbb{R}^{(o - 1)}\), followed by \((m-o)\) zeros. For example, for \(m = 4\):

\mathbf{S}=\left[\begin{array}{cccc}1& cos{\varphi}^{(1)}& cos{\varphi}^{(2)}& cos{\varphi}^{(4)}\\ 0& sin{\varphi}^{(1)}& sin{\varphi}^{(2)}cos{\varphi}^{(3)}& sin{\varphi}^{(4)}cos{\varphi}^{(5)}\\ 0& 0& sin{\varphi}^{(2)}sin{\varphi}^{(3)}& sin{\varphi}^{(4)}sin{\varphi}^{(5)}cos{\varphi}^{(6)}\\ 0& 0& 0& sin{\varphi}^{(4)}sin{\varphi}^{(5)}sin{\varphi}^{(6)}\end{array}\right].

(19)

Here, \(\phi ^{(1)}, \phi ^{(1)}, \ldots \) are the hyper-parameters that parameterize the matrix **S**. For *m* targets, one would require \(\frac{m(m - 1)}{2}\) hyper-parameters to specify **S**. The spherical parameterization has three advantages. First, it allows us to parameterize a \((m \times m)\) matrix using only \(\frac{m(m - 1)}{2}\) hyper-parameters. Second, it ensures that the resulting matrix \({\mathbf{K}}_{\ell}\) is positive-definite. And finally, the off-diagonal entries of \({\mathbf{K}}_{\ell}\) encode the correlation among the targets and can be interpreted as such after training the model.

#### 4.2.4 Model training

The parameters of the proposed model consist of the coefficient matrix for the linear model, **B**, the variance term for the observational likelihood in (7), \(\sigma _{n}\), the kernel hyperparameters, \(\ell _{f}\), \(\sigma _{f}\), \(\ell _{sp}\), \(\sigma _{sp}\), \(\ell _{t}\), \(\sigma _{t}\) (see (10), (11), (12)), and the spherical coordinates in the upper-triangular entries of **S**.

We assume that the training data consists of *n* instances, \(\mathcal{Z} = ({\mathbf{z}}^{(c_{1})}_{t_{1}},{\mathbf{z}}^{(c_{2})}_{t_{2}}, \ldots ,{\mathbf{z}}^{(c_{n})}_{t_{n}})\), where each \({\mathbf{z}}^{(c_{i})}_{t_{i}} \equiv ({\mathbf{x}}^{(c_{i})}_{t_{i}},{\mathbf{s}}^{(c_{i})},u^{(c_{i})},t_{i})\), and the corresponding targets \(\mathcal{Y} = ({\mathbf{y}}^{(c_{1})}_{t_{1}},{\mathbf{y}}^{(c_{2})}_{t_{2}}, \ldots ,{\mathbf{y}}^{(c_{n})}_{t_{n}})\). The linear coefficient matrix **B** is first estimated using a regularized least squares estimation procedure, with the *loss function* defined as:

$$ J({\mathbf{B}}) = \frac{1}{2n} \Vert {\mathbf{Y}} - {\mathbf{XB}} \Vert ^{2}_{F} + \alpha \lambda \Vert {\mathbf{B}} \Vert ^{2}_{F} + \alpha (1-\lambda ) \vert { \mathbf{B}} \vert , $$

(20)

where \(\Vert \cdot \Vert ^{2}_{F}\) and \(\vert \cdot \vert \) denote the square of the *Frobenius* norm and the \(l_{1}\) norm of a matrix, respectively. **X** is the covariate matrix consisting of the covariate vectors, i.e., \({\mathbf{X}} = ({\mathbf{x}}^{(c_{1})}_{t_{1}}, {\mathbf{x}}^{(c_{2})}_{t_{2}}, \ldots , {\mathbf{x}}^{(c_{n})}_{t_{n}})^{\top}\), and **Y** is the target matrix consisting of the target vectors, i.e., \({\mathbf{Y}} = ({\mathbf{y}}^{(c_{1})}_{t_{1}}, {\mathbf{y}}^{(c_{2})}_{t_{2}}, \ldots , {\mathbf{y}}^{(c_{n})}_{t_{n}})^{\top}\). While the first term in (20) is the standard least squares loss, the second and third terms act as an *elastic-net* regularizer on the coefficients, which is employed to reduce the impact of spurious features and to avoid overfitting [47], where a model performs well for in-sample data, but does poorly for out-of-sample points. The scalars *α* and *λ* are known as the regularization parameters and are tuned using cross-validation on the training data. In this study, the tuned values for *α* and *λ* are 0.1 and 0.5, respectively. The optimization of the loss function in (20) is done using a coordinate descent algorithm.

After estimating the optimal coefficients in **B**, the hyperparameters associated with the GP are estimated by maximizing the marginal *log-likelihood* of the residuals, using the marginal likelihood in (8). For each training instance, the residual vector is defined as \(\boldsymbol{\delta}^{(c_{i})}_{t_{i}} = {\mathbf{y}}^{(c_{i})}_{t_{i}} - { \mathbf{B}}^{\top}{\mathbf{x}}^{(c_{i})}_{t_{i}}\). Let \(\bar{\mathcal{Z}}\) denote the training data set in which every training instance is augmented according to (15). Let \(\bar{\boldsymbol{\delta}}\) be the vector containing all the scalar targets. Given that the marginalized conditional probability distribution, \((\bar{\boldsymbol{\delta}}\vert \bar{\mathcal{Z}})\) is a multivariate Gaussian with zero mean and covariance as \(({\mathbf{K}}_{\bar{\mathcal{Z}},\bar{\mathcal{Z}}} + \sigma _{n}^{2}I)\) (see (8)), the marginalized log-likelihood can be expressed as:

$$\begin{aligned} \begin{aligned} \log{p(\bar{\boldsymbol{\delta}}\vert \bar{ \mathcal{Z}})} = {}& - \frac{1}{2}\bar{\boldsymbol{\delta}}^{\top} \bigl({\mathbf{K}}_{\bar{\mathcal{Z}}, \bar{\mathcal{Z}}}+ \sigma _{n}^{2}I \bigr)^{-1}\bar{\boldsymbol{\delta}} \\ &{} -\frac{1}{2}\log \bigl\vert \bigl({\mathbf{K}}_{\bar{\mathcal{Z}}, \bar{\mathcal{Z}}} + \sigma _{n}^{2}I\bigr) \bigr\vert - \frac{nm}{2} \log{2\pi}. \end{aligned} \end{aligned}$$

(21)

The marginalized log-likelihood is maximized with respect to the kernel hyperparameters and \(\sigma _{n}\), using stochastic gradient descent [48].

#### 4.2.5 Model inference

To infer any target for a microregion at a new time instance, we use the GP formulation to estimate the posterior distribution for the target, conditioned on the training data set, \((\mathcal{Z}, \mathcal{Y})\). Let the covariates for the test instance be denoted as \({\mathbf{z}}_{*} = ({\mathbf{x}}_{*},{\mathbf{s}}_{*},u_{*},t_{*})\). For the *o*th target, the posterior distribution of \(y_{*o}\) is a Gaussian distribution, whose mean, \(\bar{y}_{*o}\), and variance, \(\operatorname{var}[y_{*o}]\) are given by the following expressions [40]:

$$\begin{aligned}& \bar{y}_{*o} = {\mathbf{b}}_{o}^{\top }{ \mathbf{x}}_{*} + {\mathbf{k}}_{*}^{\top}\bigl({ \mathbf{K}}_{\bar{\mathcal{Z}},\bar{\mathcal{Z}}} + \sigma _{n}^{2}I \bigr)^{-1} \bar{\boldsymbol{\delta}}, \end{aligned}$$

(22)

$$\begin{aligned}& \operatorname{var}[y_{*o}] = k_{**} - {\mathbf{k}}_{*}^{\top} \bigl({\mathbf{K}}_{ \bar{\mathcal{Z}},\bar{\mathcal{Z}}} + \sigma _{n}^{2}I \bigr)^{-1}{\mathbf{k}}_{*} + \sigma ^{2}_{n}, \end{aligned}$$

(23)

where \({\mathbf{b}}_{o}\) corresponds to the *o*th column of the coefficient matrix, **B**. The vector \({\mathbf{k}}_{*}\) contains the kernel function evaluation between every augmented training instance and the test instance, and the scalar \(k_{**}\) is the kernel function evaluation for the test instance with itself.