The final module focuses on the computation of the probability distribution for the number of individuals in the target population conditioned on the number of individuals detected by the network and some auxiliary information. Our first observation is that this auxiliary information is absolutely necessary to provide a meaningful inference on the target population due to similar identifiability reasons as those mentioned in Sect. 6.1 to introduce the deduplication module. This auxiliary information will be basically telco market information in the form of penetration rates (ratio of number of devices to number of individuals in the target population) and register-based population data. This information will provide the necessary link between the number of individuals at the network level and at the target population level. This combination of data sources is indeed desirable not only to produce better and more accurate estimates but also to provide more coherent information among diverse data sources. However, notice that this data integration must avoid imposing findings from one data source on the other data source thus precluding new insights about the target population.
In more concrete terms, register-based population figures offer information about society from a concrete demographic perspective (residential population) with a given degree of spatial and time breakdown. Mobile network data, however, provides the opportunity to reach unprecedented spatial and time scales as well as a complementary view on the population (present population). The integration of sources, in our view, must be careful with these differences bringing similarities and contrasts at the same time into the statistical analysis. In this line of thought, we propose to use hierarchical models (i) to produce probability distributions, (ii) to integrate data sources, and (iii) to account for the uncertainty and the differences of concepts and scales.
We propose a two-staged modelling exercise. Firstly, we assume that there exists an initial time instant \(t_{0}\) in which both the register-based target population and the actual population can be assimilated in terms of their physical location. We can assume, e.g., that at 6:00am all devices stay physically at the residential homes declared in the population register. This assumption will trigger the first stage in which we compute a probability distribution for the number of individuals \(\mathbf{N}_{t_{0}}\) of the target population in all regions in terms of the number of individuals \(\mathbf{N}_{0}^{\mathrm{net}}\) detected by the network and the auxiliary information. Secondly, we assume that individuals displace over the geographical territory independently of the MNO, i.e. subscribers of MNO 1 will show a displacement pattern similar to those of MNO 2. This assumption will trigger the second stage in which we provide a probability distribution for the number of individuals \(\mathbf{N}_{t}\) for later times \(t> t_{0}\).
Regarding the origin-destination matrix, we can use the same assumptions to infer the number of individuals moving from one region to another at time instant t, also providing credible intervals as an accuracy indicator.
7.1 Present population at the initial time \(t_{0}\)
For ease of notation we shall drop the time index in this section. The auxiliary information is provided by the penetration rates \(P_{r}^{\mathrm{net}}\) of the MNO and the register-based population \(N_{r}^{\mathrm{reg}}\) at each region r. We shall combine \(N_{r}^{\mathrm{net}}\), \(P_{r}\), and \(N_{r}^{\mathrm{reg}}\) to produce the probability distribution for \(\mathbf{N}=(N_{1},\dots ,N_{R})^{T}\). We follow the approach used in the species abundance problem in Ecology [63]. This approach clearly distinguishes between the state and the observation process. The state process is the underlying dynamical process of the population and the observation process is the procedure by which we get information about the location and timestamp of each individual in the target population. The different available auxiliary information will be integrated using different levels in the hierarchy of the statistical model.
The first level makes use of the detection probability \(p_{r}\) of individuals detected by the telecommunication network in each region r. We model
$$ N_{r}^{\mathrm{net}}\simeq \mathrm{Binomial} (N_{r}, p_{r} ). $$
(21)
Model (21) makes the only assumption that the probability of detection \(p_{r}\) for all individuals in region r is the same. This probability of detection amounts basically to the probability of an individual of being a subscriber of the given mobile telecommunication network. This assumption will be further discussed below. As a first approximation, we may think of \(p_{r}\) as a probability related to the penetration rate \(P_{r}\) of the MNO in region r.
As an overview of the hierarchy of models, we shall firstly consider only the observation process, i.e. no population dynamics (state process) is modelled. In the hierarchy, we shall be introducing deeper degrees of uncertainty on the detection probabilities \(p_{r}\). Finally, we shall introduce also the state process modelling \(N_{r}\).
At the first level, we shall consider the detection probability \(p_{r}\) as an external parameter taken e.g. from the national telecommunication regulator (not really the case). The posterior probability distribution for \(N_{r}\) in terms of \(N^{\mathrm{net}}_{r}\) and \(p_{r}\) will be given by
$$ \mathbb{P} \bigl(N_{r}|N_{r}^{\mathrm{net}} \bigr)= \textstyle\begin{cases} 0 & \mbox{if } N_{r} < N_{r}^{\mathrm{net}}, \\ \mathrm{negbin} (N_{r} - N_{r}^{\mathrm{net}};1 - p_{r}, N_{r}^{ \mathrm{net}}+1 ) & \mbox{if } N_{r} \geq N_{r}^{\mathrm{net}}, \end{cases} $$
where \(\mathrm{negbin} (k; p, r )\equiv \binom{k+r-1}{k}p^{k}(1-p)^{r}\) denotes the probability mass function of a negative binomial random variable of values \(k\geq 0\) with parameters p and r. Once we have a distribution, we can provide a point estimator, a posterior variance, a posterior coefficient of variation, a credible interval, and as many indicators as possible computed from the distribution. For example, if we use the MAP criterion (the posterior mode) or the posterior mean we can provide as point estimators
$$\begin{aligned}& \widehat{N}_{r}^{\mathrm{MAP}} = N_{r}^{\mathrm{net}} + \biggl\lfloor \frac{(1-p_{r})\cdot N_{r}^{\mathrm{net}}}{p_{r}} \biggr\rfloor , \end{aligned}$$
(22a)
$$\begin{aligned}& \widehat{N}_{r}^{\mathrm{mean}} = N_{r}^{\mathrm{net}} + \frac{(1-p_{r})\cdot (N_{r}^{\mathrm{net}} + 1)}{p_{r}}. \end{aligned}$$
(22b)
Let us now introduce the second level focused on the uncertainty in the detection probability \(p_{r}\). A priori, we can think of a detection probability \(p_{kr}\) per individual k in the target population and try to device some model to estimate \(p_{kr}\) in terms of auxiliary information (e.g. sociodemographic variables, income, etc.). We would need subscription information related to these variables for the whole target population, which is unattainable. Instead, we may consider that the detection probability \(p_{kr}\) shows a common part for all individuals in region r plus some additional unknown terms, i.e. something like \(p_{kr}=p_{r}+\mathrm{noise}\). At a first stage, we propose to implement this idea by modeling \(p_{r}\simeq \mathrm{Beta} (\alpha _{r},\beta _{r} )\) and choosing the hyperparameters \(\alpha _{r}\) and \(\beta _{r}\) according to the penetration rates \(P_{r}^{\mathrm{net}}\) and the register-based population figures \(N_{r}^{\mathrm{reg}}\).
Notice that the penetration rate is also subjected to the problem of device duplicities (individuals having two or more devices). To deduplicate, we make use of the duplicity probabilities \(p_{d}^{(i)}\) computed in Sect. 4 under the same assumptions (at most two devices per individual) and of the posterior location probabilities \(\bar{\gamma }_{dr}\) in region r for each device d. Notice that we have also dropped out the time subscript for ease of notation, since we are currently focusing on the initial time \(t_{0}\). We define
$$\begin{aligned}& \Omega _{r}^{(1)}=\frac{\sum_{d=1}^{D}\bar{\gamma }_{dr}\cdot p_{d}^{(1)}}{\sum_{d=1}^{D}\bar{\gamma }_{dr}}, \end{aligned}$$
(23a)
$$\begin{aligned}& \Omega _{r}^{(2)}=\frac{\sum_{d=1}^{D}\bar{\gamma }_{dr}\cdot p_{d}^{(2)}}{\sum_{d=1}^{D}\bar{\gamma }_{dr}}. \end{aligned}$$
(23b)
The deduplicated penetration rates are defined as
$$ \tilde{P}_{r}^{\mathrm{net}}= \biggl(\Omega _{r}^{(1)} + \frac{\Omega _{r}^{(2)}}{2} \biggr)\cdot P_{r}^{\mathrm{net}}. $$
(23c)
To get a feeling on this definition, let us consider a very simple situation. Let us consider \(N_{r}^{(1)} = 10\) individuals in region r with 1 device each one, \(N_{r}^{(2)}=3\) individuals in region r with 2 devices each one, and \(N_{r}^{(0)} = 2\) individuals in region r with no device. Let us assume that we can measure the penetration rate with certainty, so that \(P_{r}^{\mathrm{rm}}=\frac{16}{15}\). The devices are assumed to be neatly detected by the HMM (i.e. \(\tilde{\gamma }_{dr}=1-O(\epsilon )\)) and duplicities are also inferred correctly (\(p_{d}^{(2)}=O(\epsilon )\) for \(d^{(1)}\) and \(p_{d}^{(2)}=1-O(\epsilon )\) for \(d^{(2)}\)). Then \(\Omega _{r}^{(1)}=\frac{10}{16} + O(\epsilon )\) and \(\Omega _{r}^{(2)} = \frac{6}{16} + O(\epsilon )\). The deduplicated penetration rate will then be \(\bar{P}_{r}^{\mathrm{net}}=\frac{13}{15} + O(\epsilon )\), which can be straightforwardly understood as a detection probability for an individual in this network in region r.
Let us now denote by \(N_{r}^{\mathrm{reg}}\) the population of region r according to an external population register. Then, we fix
$$\begin{aligned}& \alpha _{r}+\beta _{r} = N_{r}^{\mathrm{reg}}, \end{aligned}$$
(24a)
$$\begin{aligned}& \frac{\alpha _{r}}{\alpha _{r} + \beta _{r}} = \tilde{P}_{r}^{ \mathrm{net}}, \end{aligned}$$
(24b)
which immediately implies that
$$\begin{aligned}& \alpha _{r} = \tilde{P}_{r}^{\mathrm{net}}\cdot N_{r}^{ \mathrm{reg}}, \end{aligned}$$
(25a)
$$\begin{aligned}& \beta _{r} = \bigl(1 - \tilde{P}_{r}^{\mathrm{net}} \bigr) \cdot N_{r}^{ \mathrm{reg}}. \end{aligned}$$
(25b)
There are several assumptions in this choice. Firstly, on average, we assume that detection takes place with probability \(\tilde{P}_{r}^{\mathrm{net}}\). We find this assumption reasonable. Another alternative choice would be to use the mode of the beta distribution instead of the mean. Secondly, detection is undertaken over the register-based population. We assume some coherence between the official population count and the network population count. A cautious reader may object that we do not need a network-based estimate if we already have official data at the same time instant. We can make several comments in this regard:
-
As stated above, a degree of coherence between official estimates by combining data sources to conduct more accurate estimates is desirable. By using register-based population counts in the hierarchy of models, we are indeed combining both data sources. In this combination notice, however, that the register-based population is taken as an external input in our model. There exist alternative procedures in which all data sources are combined at an equal footing [64, 65]. We deliberately use the register-based population as an external source and do not intend to re-estimate it by combination with mobile network data.
-
Register-based populations and network-based populations show clearly different time scales. The coherence we demand will be forced only at the given initial time \(t_{0}\) after which the dynamics of the network will provide the time scale of the network-based population counts without further reference to the register-based population.
Thirdly, the penetration rates \(P_{r}^{\mathrm{net}}\) and the official population counts \(N_{r}^{\mathrm{reg}}\) come without error. Should this not be attainable or realistic, we would need to introduce a new hierarchy level to account for this uncertainty (see below). Lastly, the deduplicated penetration rates are computed as a deterministic procedure (using a mean point estimation), i.e. the deduplicated penetration rates are also subjected to uncertainty, thus we should also introduce another hierarchy level to account for this uncertainty.
Then, we can readily compute the posterior distribution for \(N_{r}\):
$$\begin{aligned} \mathbb{P} \bigl(N_{r}|N_{r}^{\mathrm{net}} \bigr) =& \textstyle\begin{cases} 0 & \mbox{if } N_{r} < N_{r}^{\mathrm{net}}, \\ \mathrm{betaNegBin} (N_{r}-N_{r}^{\mathrm{net}};N_{r}^{ \mathrm{net}} + 1, \alpha _{r} - 1, \beta _{r} )& \mbox{if } N_{r} \geq N_{r}^{\mathrm{net}}. \end{cases}\displaystyle \end{aligned}$$
(26)
It is a displaced beta negative binomial distribution
$$\mathrm{betaNegBin}(k; s, \alpha , \beta )\equiv \frac{\Gamma (k+s)}{k!\Gamma (s)} \frac{\mathrm{B}(\alpha + s,\beta + k)}{\mathrm{B}(\alpha ,\beta )} $$
with support in \(N_{r} \geq N_{r}^{\mathrm{net}}\) and parameters \(s = N_{r}^{\mathrm{net}} + 1\), \(\alpha = \alpha _{r} - 1\) and \(\beta =\beta _{r}\). Again, we can provide point estimates as well as posterior variances, credible intervals, etc. Under the MAP and the mean criterion we have
$$\begin{aligned}& \widehat{N}^{\mathrm{MAP}}= N_{r}^{\mathrm{net}} + \biggl\lfloor \frac{(1-\tilde{P}_{r}^{\mathrm{net}})\cdot N_{r}^{\mathrm{net}}}{\tilde{P}_{r}^{\mathrm{net}}} - \frac{N_{r}^{\mathrm{net}}}{N_{r}^{\mathrm{reg}}\cdot \tilde{P}_{r}^{\mathrm{reg}}} \biggr\rfloor , \\& \widehat{N}^{\mathrm{mean}}=N_{r}^{\mathrm{net}} + \frac{(N_{r}^{\mathrm{net}} + 1)\cdot (1-\tilde{P}_{r}^{\mathrm{net}})\cdot N_{r}^{\mathrm{reg}}}{\tilde{P}_{r}^{\mathrm{reg}}\cdot N_{r}^{\mathrm{reg}} - 1}. \end{aligned}$$
The uncertainty is accounted for by computing the posterior variance, the posterior coefficient of variation, or credible intervals in the usual way. Notice that when \(\alpha _{r},\beta _{r}\gg 1\) (i.e., when \(\min (\tilde{P}_{r}^{\mathrm{net}}, 1- \tilde{P}_{r}^{\mathrm{net}}) \cdot N_{r}^{\mathrm{reg}}\gg 1\)) the beta negative binomial distribution (26) reduces to the negative binomial distribution
$$\begin{aligned} \mathbb{P} \bigl(N_{r}|N_{r}^{\mathrm{net}} \bigr) =& \textstyle\begin{cases} 0 & \mbox{if } N_{r} < N_{r}^{\mathrm{net}}, \\ \mathrm{negbin} (N_{r}-N_{r}^{\mathrm{net}}; \frac{\beta _{r}}{\alpha _{r} +\beta _{r} - 1}, N_{r}^{\mathrm{net}} + 1 )& \mbox{if } N_{r} \geq N_{r}^{\mathrm{net}}. \end{cases}\displaystyle \end{aligned}$$
(28)
Note also that \(\frac{\beta _{r}}{\alpha _{r} + \beta _{r} -1}\approx 1 - \tilde{P}_{r}^{ \mathrm{net}}\) so that in this case we do not need the register-based population (this is similar to dropping out the finite population correction factor in sampling theory for large populations). In this case, under the MAP and the mean criterion for this distribution we have
$$\begin{aligned}& \widehat{N}^{\mathrm{MAP}}= N_{r}^{\mathrm{net}} + \biggl\lfloor \frac{(1-\tilde{P}_{r}^{\mathrm{net}})}{\tilde{P}_{r}^{\mathrm{net}}} \cdot N_{r}^{\mathrm{net}} \biggr\rfloor , \\& \widehat{N}^{\mathrm{mean}}=N_{r}^{\mathrm{net}} + \frac{(1-\tilde{P}_{r}^{\mathrm{net}})}{\tilde{P}_{r}^{\mathrm{net}}} \cdot \bigl(N_{r}^{\mathrm{net}} + 1 \bigr). \end{aligned}$$
Let us now introduce a further level of uncertainty by modelling also the hyperparameters \((\alpha _{r}, \beta _{r})\) so that the relationship between these parameters and the external data sources (penetration rates and register-based population counts) is also uncertain. We can go all the way down the hierarchy, assume a cross-cutting relationship between parameters and some hyperparameters and postulate
$$\begin{aligned}& N_{r}^{\mathrm{net}} \simeq \mathrm{Bin} (N_{r}, p_{r} ),\quad \mbox{for all } r=1,\dots ,R, \end{aligned}$$
(29a)
$$\begin{aligned}& p_{r} \simeq \mathrm{Beta} (\alpha _{r}, \beta _{r} ),\quad \mbox{for all } r=1,\dots ,R, \end{aligned}$$
(29b)
$$\begin{aligned}& \biggl(\mathrm{logit} \biggl( \frac{\alpha _{r}}{\alpha _{r}+\beta _{r}} \biggr), \alpha _{r} + \beta _{r} \biggr) \\& \quad \simeq \mathrm{N} \bigl( \mu _{\gamma r} \bigl(\gamma _{0},\gamma _{1};\bar{P}_{r}^{\mathrm{net}} \bigr), \tau ^{2}_{\gamma } \bigr)\times \mathrm{Gamma} \biggl(1+ \xi , \frac{N_{r}^{\mathrm{reg}}}{\xi } \biggr),\quad \mbox{for all } r=1, \dots ,R, \end{aligned}$$
(29c)
$$\begin{aligned}& \bigl(\log \gamma _{0},\gamma _{1},\tau ^{2}_{\gamma }, \xi \bigr) \simeq \mathrm{f}_{\gamma } \bigl(\log \gamma _{0},\gamma _{1},\tau ^{2}_{\gamma } \bigr)\times \mathrm{f}_{ \xi }(\xi ), \end{aligned}$$
(29d)
where we have denoted \(\mu _{\gamma r}(\gamma _{0},\gamma _{1};\bar{P}_{r}^{\mathrm{net}}) \equiv \log (\gamma _{0} [ \frac{\bar{P}_{r}^{\mathrm{net}}}{1-\bar{P}_{r}^{\mathrm{net}}} ]^{\gamma _{1}} )\) and \(f_{\gamma }\) and \(f_{\xi }\) stand for prior distributions.
The interpretation of this hierarchy is simple. It is just a beta-binomial model in which the beta parameters \(\alpha _{r}\), \(\beta _{r}\) are correlated with the deduplicated penetration rates. This correlation is expressed through a linear regression model upon their logits with common regression parameters across the regions, both the coefficients and the uncertainty degree. On average, the detection probabilities \(p_{r}\) will be the deduplicated penetration rates with uncertainty accounted for by hyperparameters \(\gamma _{0}\), \(\gamma _{1}\), \(\tau _{\gamma }^{2}\). For large population cells, the hyperparameter ξ drops out so that finally the register-based population counts \(N_{r}^{\mathrm{reg}}\) play no role in the model.
Under the specifications (29a)–(29d), after some tedious computations, we can show that the multivariate distribution for the number of individuals N in the target population conditional on the number of individuals \(\mathbf{N}^{\mathrm{net}}\) detected by the network is given by a continuous mixture:
$$ \mathbb{P} \bigl(\mathbf{N}|\mathbf{N}^{\mathrm{net}} \bigr) \propto \int _{\mathbb{R}^{R}}d^{R}\mathbf{y} \omega _{\mathrm{obs}} \bigl( \mathbf{y};\bar{\mathbf{P}}^{\mathrm{net}} \bigr)\prod _{r=1}^{R} \frac{\mathrm{negbin} (N_{r} - N_{r}^{\mathrm{net}}; 1 - p(y_{r}), N_{r}^{\mathrm{net}}+1 )}{p(y_{r})}, $$
(30)
where
-
\(\mathrm{negbin}(k;p,r)\) stands for the probability mass function of the negative binomial distribution for variable k and parameters p and r;
-
\(p(y_{r})\equiv \frac{e^{y_{r}}}{1+e^{y_{r}}}\);
-
\(\omega _{\mathrm{obs}}(\mathbf{y};\mathbf{P}^{\mathrm{net}})=\int _{ \Omega _{\beta }}\mathrm{d}\log \gamma _{0}\,\mathrm{d}\gamma _{1} \,\mathrm{d}\tau _{\gamma }^{2} \mathrm{f}_{\gamma } (\log \gamma _{0},\gamma _{1},\tau ^{2}_{\gamma } ) \mathrm{n} ( \mathbf{y};{\boldsymbol{\mu }}_{\gamma }(\gamma _{0},\gamma _{1}; \bar{\mathbf{P}}^{\mathrm{net}}), \boldsymbol{\Sigma }_{\gamma } )\) where
-
\(\mathrm{n}(\mathbf{x}; {\boldsymbol{\mu }}, \boldsymbol{\Sigma })\) stands for the probability density function of the multivariate normal distribution for variable x and mean μ and covariance matrix Σ.
-
\(\mu _{\gamma r}(\gamma _{0},\gamma _{1};\bar{P}_{r}^{\mathrm{net}})= \log (\gamma _{0} [ \frac{\bar{P}_{r}^{\mathrm{net}}}{1-\bar{P}_{r}^{\mathrm{net}}} ]^{\gamma _{1}} )\).
-
\(\boldsymbol{\Sigma }_{\gamma }=\tau _{\gamma }^{2} \mathbb{I}_{R\times R}\).
In this derivation, again the assumption \(\alpha _{r},\beta _{r}\gg 1\) is taken for granted. In rigour, we should have included \(\mathbf{P}^{\mathrm{net}}\) as conditioning random variables together with \(\mathbf{N}^{\mathrm{net}}\), but we have opted to keep the notation as simple as possible. To have an expression which can be computed we need to further specify the prior \(\mathrm{f}_{\gamma }\). As a first example, let us consider \(\gamma _{0}=\gamma _{1}=1\) and \(\tau _{\gamma }^{2}\to 0^{+}\). This amounts to having certainty about the values of \(\alpha _{r}\) and \(\beta _{r}\), as above, so that \(\omega _{\mathrm{obs}}(\mathbf{y};\bar{\mathbf{P}}^{\mathrm{net}})= \prod_{r=1}^{R}\delta (y_{y}-\log \bar{P}^{\mathrm{net}}_{r})\), where \(\delta (\cdot )\) stands for the Dirac delta function. Upon normalization expression (30) reduces to
$$ \mathbb{P} \bigl(\mathbf{N}|\mathbf{N}^{\mathrm{net}} \bigr)=\prod _{r=1}^{R} \mathrm{negbin} \bigl(N_{r}-N_{r}^{\mathrm{net}};1-\tilde{P}_{r}^{ \mathrm{net}}, N_{r}^{\mathrm{net}} + 1 \bigr). $$
(31)
The marginal distribution for region r reduces to (28), which was also obtained above through a direct reasoning.
Finally, we can also introduce the state process. The system is a human population and we can make a common modelling hypothesis to represent the number of individuals \(N_{r}\) in region r of the target population as a Poisson-distributed random variable in terms of the population density, i.e.
$$ N_{r}\simeq \mathrm{Poisson} (A_{r}\sigma _{r} ), $$
(32)
where \(\sigma _{r}\) stands for the population density of region r and \(A_{r}\) denotes the area of region r. We choose to model \(N_{r}\) in terms of the population density to make an auxiliary usage of some results already found in the literature [6]. Similarly to the observation process, we introduce the following hierarchy:
$$\begin{aligned}& N_{r}^{\mathrm{net}} \simeq \mathrm{Bin} (N_{r}, p_{r} ),\quad \mbox{for all } r=1,\dots ,R, \end{aligned}$$
(33a)
$$\begin{aligned}& N_{r} \simeq \mathrm{Poisson} (A_{r} \sigma _{r} ),\quad \mbox{for all } r=1,\dots ,R, \end{aligned}$$
(33b)
$$\begin{aligned}& p_{r} \simeq \mathrm{Beta} (\alpha _{r}, \beta _{r} ),\quad \mbox{for all } r=1,\dots ,R, \end{aligned}$$
(33c)
$$\begin{aligned}& \sigma _{r} \simeq \mathrm{Gamma} (1 + \zeta _{r}, \theta _{r} ),\quad \mbox{for all } r=1,\dots ,R, \end{aligned}$$
(33d)
where the hyperparameters will express the uncertainty about the register-based population and the detection probability. The values for \(\alpha _{r}\) and \(\beta _{r}\) are taken from (25a)–(25b). Regarding the hyperparameters \(\theta _{r}\) and \(\zeta _{r}\), notice that the modes of the gamma distributions are at \(\tau _{r}=\zeta _{r}\cdot \theta _{r}\) and the variances are given by \(\mathbb{V} (\tau _{r} )=(\zeta _{r} + 1)\cdot \theta _{r}^{2}\). We shall parameterise these gamma distributions in terms of the register-based population densities \(\sigma _{r}^{\mathrm{reg}}\) as
$$\begin{aligned}& \zeta _{r}\cdot \theta _{r} = \sigma _{r}^{\mathrm{reg}}+ \Delta \sigma _{r}, \\& \sqrt{(\zeta _{r} + 1)\cdot \theta _{r}^{2}} = \epsilon _{r} \cdot \sigma _{r}^{\mathrm{reg}}, \end{aligned}$$
where \(\epsilon _{r}\) can be viewed as the coefficient of variation for \(\sigma _{r}^{\mathrm{reg}}\) and \(\Delta \sigma _{r}\) can be interpreted as the bias for \(\sigma _{r}^{\mathrm{reg}}\). This parametrization implies that
$$\begin{aligned}& \theta _{r}(\Delta \sigma _{r},\epsilon _{r}) = \frac{\sigma _{r}^{\mathrm{reg}}}{2} \biggl(1+ \frac{\Delta \sigma _{r}}{\sigma _{r}^{\mathrm{reg}}} \biggr) \biggl[ \sqrt{1 + \biggl( \frac{2\epsilon _{r}}{1+ \frac{\Delta \sigma _{r}}{\sigma _{r}^{\mathrm{reg}}}} \biggr)^{2}}-1 \biggr], \\& \zeta _{r}(\Delta \sigma _{r},\epsilon _{r}) = \frac{2}{\sqrt{1+ (\frac{2\epsilon _{r}}{1+\frac{\Delta \sigma _{r}}{\sigma _{r}^{\mathrm{reg}}}} )^{2}}-1}. \end{aligned}$$
(34)
Under assumptions (33a)–(33d) and assuming \(\alpha _{r},\beta _{r}\gg 1\), as above, we get
$$ \mathbf{P} \bigl(\mathbf{N}|\mathbf{N}^{\mathrm{net}} \bigr)=\prod _{r=1}^{R} \mathrm{negbin} \biggl(N_{r}-N_{r}^{\mathrm{net}}; \frac{\beta _{r}}{\alpha _{r}+\beta _{r}}\cdot Q( \theta _{r}), N_{r}^{ \mathrm{net}}+ 1 + \zeta _{r} \biggr), $$
(35)
where \(Q(\theta _{r})\equiv \frac{A_{r}\theta _{r}}{1+A_{r}\theta _{r}}\). The interpretation of this hierarchy is also simple. It is just a Poisson-gamma model in which the gamma parameters have been chosen so that we account for the uncertainty in the register-based population figures \(N_{r}^{\mathrm{reg}}\). Usual point estimators are easily derived from (35):
$$\begin{aligned}& \widehat{N}_{r}^{\mathrm{MAP}} = N_{r}^{\mathrm{net}} + \biggl\lfloor \frac{(1-\tilde{P}_{r}^{\mathrm{net}})\cdot Q(\theta _{r})}{1 - (1-\tilde{P}_{r}^{\mathrm{net}})\cdot Q(\theta _{r})} \bigl(N_{r}^{\mathrm{net}}+\zeta _{r} \bigr) \biggr\rfloor , \\& \widehat{N}_{r}^{\mathrm{mean}} = N_{r}^{\mathrm{net}} + \frac{(1-\tilde{P}_{r}^{\mathrm{net}})\cdot Q(\theta _{r})}{1 - (1-\tilde{P}_{r}^{\mathrm{net}})\cdot Q(\theta _{r})} \cdot \bigl(N_{r}^{\mathrm{net}}+ 1 + \zeta _{r} \bigr). \end{aligned}$$
Accuracy indicators such as posterior variance or credible intervals are computed from the distribution (35) as usual. Expression (35) contains the uncertainty of both the observation and the state processes. In the limiting case \(\epsilon _{r}^{+}\to 0\) and \(\Delta \sigma _{r}\to 0\), i.e. having certainty about the state process, and with equations (25a)–(25b), we have the Poisson limit of the negative binomial distribution so that
$$ \mathbb{P} \bigl(\mathbf{N}|\mathbf{N}^{\mathrm{net}} \bigr)=\prod _{r=1}^{R} \mathrm{poisson} \bigl(N_{r}-N_{r}^{\mathrm{net}}; \bigl(1-\bar{P}_{r}^{ \mathrm{net}}\bigr)\cdot A_{r}\sigma _{r}^{\mathrm{reg}} \bigr). $$
(36)
The MAP estimator is trivially \(\hat{N}^{\mathrm{MAP}}=N_{r}^{\mathrm{net}}+ \lfloor (1-\bar{P}_{r})A_{r} \sigma _{r}^{\mathrm{reg}} \rfloor \) and the mean estimator is trivially \(\hat{N}^{\mathrm{mean}}=N_{r}^{\mathrm{net}}+(1-\bar{P}_{r})A_{r} \sigma _{r}^{\mathrm{reg}}\), both of which can be readily read as the sum of the individuals detected by the network and the individuals not detected by the network accounted for by the population register.
On the contrary, when \(\epsilon _{r}\to \infty \) (i.e. having no information at all about the state process), we have \(Q(\theta _{r})=1\) and \(\zeta _{r}=0\) so that
$$ \mathbb{P} \bigl(\mathbf{N}|\mathbf{N}^{\mathrm{net}} \bigr)= \prod _{r=1}^{R}\mathrm{negbin} \bigl(N_{r}-N_{r}^{\mathrm{net}};1- \bar{P}_{r}, N_{r}^{\mathrm{net}}+1 \bigr), $$
(37)
which is the same expression as (31), as expected, since having no information about the state process is equivalent to having only the observation process. Notice that we can also introduce more levels in the hierarchy regarding the state process:
$$\begin{aligned}& N_{r}^{\mathrm{net}} \simeq \mathrm{Binomial} (N_{r}, p_{r} ),\quad \mbox{for all } r=1, \dots ,R, \end{aligned}$$
(38a)
$$\begin{aligned}& N_{r} \simeq \mathrm{Poisson} (A_{r} \sigma _{r} ),\quad \mbox{for all } r=1,\dots ,R, \end{aligned}$$
(38b)
$$\begin{aligned}& p_{r} \simeq \mathrm{Beta} (\alpha _{r}, \beta _{r} ),\quad \mbox{for all } r=1,\dots ,R, \end{aligned}$$
(38c)
$$\begin{aligned}& \sigma _{r} \simeq \mathrm{Gamma} \biggl( \zeta +1, \frac{e^{\theta _{r}}}{\zeta } \biggr),\quad \mbox{for all } r=1,\dots ,R, \end{aligned}$$
(38d)
$$\begin{aligned}& \biggl(\mathrm{logit} \biggl( \frac{\alpha _{r}}{\alpha _{r}+\beta _{r}} \biggr), \alpha _{r} + \beta _{r} \biggr) \\& \quad \simeq \mathrm{N} \bigl( \mu _{\gamma r} \bigl(\gamma _{0},\gamma _{1};\bar{P}_{r}^{\mathrm{net}} \bigr), \tau ^{2}_{\gamma } \bigr)\times \mathrm{Gamma} \biggl(1+ \xi , \frac{N_{r}^{\mathrm{reg}}}{\xi } \biggr),\quad \mbox{for all } r=1, \dots ,R, \end{aligned}$$
(38e)
$$\begin{aligned}& \theta _{r} \simeq \mathrm{N} \bigl(\mu _{ \delta r}\bigl(\delta _{0},\delta _{1};\sigma _{r}^{\mathrm{reg}} \bigr), \tau _{ \delta }^{2} \bigr),\quad \mbox{for all } r=1,\dots ,R, \end{aligned}$$
(38f)
$$\begin{aligned}& \bigl(\log \gamma _{0},\gamma _{1},\tau ^{2}_{\gamma }, \xi \bigr) \simeq \mathrm{f}_{\gamma } \bigl(\log \gamma _{0},\gamma _{1},\tau ^{2}_{\gamma } \bigr)\times \mathrm{f}_{ \xi }(\xi ) \end{aligned}$$
(38g)
$$\begin{aligned}& \bigl(\log \delta _{0},\delta _{1},\delta ^{2}_{\delta }, \zeta \bigr) \simeq \mathrm{f}_{\delta } \bigl(\log \delta _{0},\delta _{1},\delta ^{2}_{\delta } \bigr) \times \mathrm{f}_{\zeta }(\zeta ), \end{aligned}$$
(38h)
where we have denoted \(\mu _{\delta r}(\delta _{0},\delta _{1};\sigma _{r}^{\mathrm{reg}}) \equiv \log (\delta _{0} [\sigma _{r}^{\mathrm{reg}} ]^{ \delta _{1}} )\) and \(f_{\gamma }\), \(f_{\xi }\), \(f_{\delta }\), \(f_{\zeta }\) stand for prior distributions.
The interpretation of this hierarchy is also simple. It is just a combined beta-binomial and Poisson-gamma model in which the gamma parameters have been chosen so that the mode is at \(\exp (\theta _{r})\) with an uncertainty degree provided by ζ. Notice that the smaller ζ, the more degree of uncertainty about the value of \(\theta _{r}\). The mode is correlated with the register-based population density \(\sigma _{r}^{\mathrm{net}}\) through a linear regression.
Under the specifications (38a)–(38h), again after some tedious computation, we can show that the multivariate distribution for the number of individuals N in the target population conditional on the number of individuals \(\mathbf{N}^{\mathrm{net}}\) detected by the network is given by
$$\begin{aligned} \mathbb{P} \bigl(\mathbf{N}|\mathbf{N}^{\mathrm{net}} \bigr) \propto & \int _{\mathbb{R}^{R}}d^{R}\mathbf{y} \omega _{ \mathrm{obs}} \bigl(\mathbf{y};\bar{\mathbf{P}}^{\mathrm{net}} \bigr) \prod _{r=1}^{R} \frac{\mathrm{negbin} (N_{r} - N_{r}^{\mathrm{net}}; 1 - p(y_{r}), N_{r}^{\mathrm{net}}+1, )}{p(y_{r})} \\ &{}\times \int _{\mathbb{R}^{R}}d^{R}\mathbf{z} \omega _{ \mathrm{state}} \bigl(\mathbf{z};{\boldsymbol{\sigma }}^{\mathrm{reg}} \bigr)\prod _{r=1}^{R}\mathrm{negbin} \biggl(N_{r}; q \biggl( \frac{A_{r}e^{z_{r}}}{\zeta } \biggr), 1 + \zeta \biggr), \end{aligned}$$
(39)
where
-
\(\mathrm{negbin}(k;p,r)\) stands for the probability mass function of the negative binomial distribution for variable k and parameters p and r;
-
\(p(y_{r})\equiv \frac{e^{y_{r}}}{1+e^{y_{r}}}\);
-
\(\omega _{\mathrm{obs}}(\mathbf{y};\mathbf{P}^{\mathrm{net}})=\int _{ \Omega _{\gamma }}\mathrm{d}\log \gamma _{0}\,\mathrm{d}\gamma _{1} \,\mathrm{d}\tau _{\gamma }^{2} \mathrm{f}_{\gamma } (\log \gamma _{0},\gamma _{1},\tau ^{2}_{\gamma } ) \mathrm{n} ( \mathbf{y};{\boldsymbol{\mu }}_{\gamma }(\gamma _{0},\gamma _{1}; \bar{\mathbf{P}}^{\mathrm{net}}), \boldsymbol{\Sigma }_{\gamma } )\) where
-
\(\mathrm{n}(\mathbf{x}; {\boldsymbol{\mu }}, \boldsymbol{\Sigma })\) stands for the probability density function of the multivariate normal distribution for variable x and mean μ and variance matrix Σ.
-
\(\mu _{\gamma r}(\gamma _{0},\gamma _{1};\bar{P}_{r}^{\mathrm{net}})= \log (\gamma _{0} [ \frac{\bar{P}_{r}^{\mathrm{net}}}{1-\bar{P}_{r}^{\mathrm{net}}} ]^{\gamma _{1}} )\).
-
\(\boldsymbol{\Sigma }_{\gamma }=\tau _{\gamma }^{2} \mathbb{I}_{R\times R}\);
-
\(q (\frac{A_{r}e^{z_{r}}}{\zeta } )\equiv \frac{\frac{A_{r}e^{z_{r}}}{\zeta }}{1+\frac{A_{r}e^{z_{r}}}{\zeta }}\);
-
\(\omega _{\mathrm{state}}(\mathbf{z};{\boldsymbol{\sigma }}^{ \mathrm{reg}})=\int _{\Omega _{\delta ,\zeta }}\mathrm{d}\log \delta _{0} \,\mathrm{d}\delta _{1}\,\mathrm{d}\delta _{\delta }^{2}\,\mathrm{d}\zeta \mathrm{f}_{\delta } (\log \delta _{0},\delta _{1},\delta ^{2}_{ \delta } )\times \mathrm{f}_{\zeta }(\zeta ) n (\mathbf{z};{ \boldsymbol{\mu }}_{\delta }(\delta _{0},\delta _{1};{\boldsymbol{\sigma }}^{ \mathrm{net}}), \boldsymbol{\Sigma }_{\delta } )\) with
-
\(\mathrm{n}(\mathbf{x}; {\boldsymbol{\mu }}, \boldsymbol{\Sigma })\) stands for the probability density function of the multivariate normal distribution for variable x and mean μ and variance matrix Σ.
-
\(\mu _{\delta r}(\delta _{0},\delta _{1};\sigma _{r}^{\mathrm{reg}})= \log (\delta _{0} [\sigma _{r}^{\mathrm{reg}} ]^{ \delta _{1}} )\).
-
\(\boldsymbol{\Sigma }_{\delta }=\tau _{\delta }^{2} \mathbb{I}_{R\times R}\).
Notice how this expression reveals both factors arising from the observation and the state processes, respectively. When \(\gamma _{0},\gamma _{1},\delta _{0},\delta _{1}\to 1\), \(\zeta \to \zeta ^{*}\), and \(\tau _{\gamma }^{2},\tau _{\delta }^{2}\to 0^{+}\) (i.e. when having fully accurate information about the parameters \(\alpha _{r}\), \(\beta _{r}\) and \(\theta _{r}\)), we have \(\omega _{\gamma }(\mathbf{y})=\delta (\mathbf{y}-{\boldsymbol{\mu }}_{ \gamma })\) and \(\omega _{\delta }(\mathbf{z})=\delta (\mathbf{z}-{\boldsymbol{\mu }}_{ \delta })\) so that after normalization equation (39) reduces to
$$ \mathbb{P} \bigl(\mathbf{N}|\mathbf{N}^{\mathrm{net}} \bigr)=\prod _{r=1}^{R} \mathrm{negbin} \bigl(N_{r}-N_{r}^{\mathrm{net}}; (1-\bar{P}_{r})\cdot Q_{r}\bigl( \zeta ^{*} \bigr), N_{r}^{\mathrm{net}}+\zeta ^{*}+1 \bigr), $$
(40)
where we have denoted \(Q_{r}(\zeta )\equiv q(\frac{A_{r}\sigma _{r}^{\mathrm{reg}}}{\zeta })\), which is indeed again equation (35).
7.2 Present population at times \(t> t_{0}\)
Now, we propose to produce probability distributions for the number of individuals \(N_{tr}\) in the target population for times \(t>t_{0}\) at region r. Currently, we consider only closed populations, i.e. neither individuals nor devices enter into or leave the territory under analysis along the whole time period. This important restriction is posed to introduce progressively the different methods in order to get a thorough assessment of every single aspect of the procedure. It will have to be lifted in future work (e.g. considering sink and source tiles in the reference grid).
Our reasoning tries to introduce as less assumptions as possible. Thus, we begin by considering a balance equation. Let us denote by \(N_{t,rs}\) the number of individuals moving from region s to region r in the time interval \((t-1, t)\). Then, we can write
$$\begin{aligned} N_{tr} = & N_{t-1r}+\sum_{\substack{r_{t}=1\\r_{t}\neq r}}^{N_{T}}N_{t,rr_{t}} - \sum_{\substack{r_{t}=1\\r_{t}\neq r}}^{N_{r}}N_{t,r_{t}r} \\ = & \sum_{r_{t}= 1}^{N_{T}}\tau _{t,rr_{t}}\cdot N_{t-1r_{t}}, \end{aligned}$$
(41)
where we have defined \(\tau _{t,rs}=\frac{N_{t,rs}}{N_{t-1s}}\) (0 if \(N_{t-1s} = 0\)). Notice that \(\tau _{t,rs}\) can be interpreted as an aggregate transition probability from region s to region r at time interval \((t-1, t)\) in the target population.
We make the assumption that individuals detected by the network move across regions in the same way as individuals in the target population. Thus, we can use
$$\tau _{t,rs}^{\mathrm{net}}\equiv \frac{N^{\mathrm{net}}_{t,rs}}{N^{\mathrm{net}}_{t-1s}}$$
to model \(\tau _{t,rs}\). In particular, as our first choice we shall postulate \(\tau _{t,rs}=\tau _{t,rs}^{\mathrm{net}}\). The probability distributions of \(N^{\mathrm{net}}_{s t-1}\) and \([\mathbf{N}^{\mathrm{net}}_{t}]_{sr} = N_{t,rs}^{\mathrm{net}}\) were indeed already computed in the aggregation module (Sect. 6).
Finally, we mention two points. On the one hand, random variables \(N_{rt}\) are defined recursively in the time index t, so that once we have computed the probability distribution at time \(t_{0}\), then we can use (41) to compute the probability distribution at later times \(t>t_{0}\). On the other hand, Monte Carlo techniques should be again used to build these probability distributions. Once we have probability distributions, we can make point estimations and compute accuracy indicators as above (posterior variance, posterior coefficient of variation, credible intervals).
7.3 Origin-destination matrices
The inference of the origin-destination matrices for the target population is more delicate than the present population because auxiliary information from population registers do not contain this kind of information. Therefore, the statistical models proposed above for the present population estimation cannot be applied. As a first important conclusion we point out that, in our view, National Statistical Plans should start considering what kind of auxiliary information is needed to make a more accurate use of Mobile Network Data and new digital data, in general.
We can provide a simple argument extending the above model to produce credible intervals for the origin-destination matrices. If \(N_{tr}\) and \(\tau _{t,rs}\) denote the number of individuals of the target population at time t in region r and the aggregate transition probability from region s to region r at the time interval \((t-1,t)\), then we can simply define \(N_{t,rs} = N_{t-1s}\times \tau _{t,rs}\) and trivially build the origin-destination matrix for each time interval \((t-1, t)\). Under the same general assumption as before, if individuals are to move across the geographical territory independently of their mobile network operator (or even not being a subscriber or carrying two devices), we can postulate as a first simple choice \(\tau _{t,rs}=\tau _{t,rs}^{\mathrm{net}}\), as before.
7.4 An example with simulated data
Let us again illustrate this approach with the same example generated with the mobile network event simulator. We consider once more the toy scenario with a population of 500 individuals and 186 subscribers with 218 mobile devices in a territory with a bounding box of \(10\mbox{ km}\times 10\mbox{ km}\) divided into 10 regions as in Fig. 10. The simulator provides the true position of each individual at each time instant so that we can make a comparison with the (synthetic) ground truth.
For the time being, we shall only provide results for the posterior distributions (26), (28), and (35), leaving the full hierarchies for future work. Taking advantage of the simulated ground truth we shall provide results taking as prior information different ranges of \(N^{\mathrm{net}}\) and \(N^{\mathrm{reg}}\) to better appreciate how errors in the input data affect the final estimates. Firstly, we shall consider values \(N^{\mathrm{net}} = (1 + \mathrm{rb}^{\mathrm{net}})\cdot N^{ \mathrm{net}0}\), so that we can investigate the effect of the bias in the input number of individuals detected by the network with respect to their true values \(N^{\mathrm{net}0}\). Secondly, similarly, we shall consider values \(N^{\mathrm{reg}} = (1 + \mathrm{rb}^{\mathrm{reg}})\cdot N^{ \mathrm{reg}0}\), so that we can investigate the effect of the bias in the input number of individuals according to the population register with respect to their true values \(N^{\mathrm{reg}0}\). Finally, for the model with the process (35), we shall also consider the range of values for the coefficient of variation of \(N^{\mathrm{reg}}\) given by \(\mathrm{cv}^{\mathrm{Nnet}}=0.01,0.05, 0.10, 0.15, 0.20\). In all cases we shall only use the RSS geolocation model with uniform prior.
In Figs. 14, 15, and 16 we represent the credible intervals for the initial number of individuals for different values of \(\mathrm{rb}^{\mathrm{net}}\) and \(\mathrm{rb}^{\mathrm{reg}}\). In the case with the process model we have focused on the largest coefficient of variation \(\mathrm{cv}^{\mathrm{Nnet}}=0.2\).
We observe that the uncertainty grows as the bias of the number of individuals according to the population register also grows in the positive direction (overestimation). We can also observe that the uncertainty grows in the same fashion with respect to the bias in the number of individuals detected by the network. The sensitivity in the case of the model with the state process (35) is also evident, thus inviting not to model the state process. That is, if the state process (number of individuals in the target population) is not modelled more robustly, then errors in the register-based population figures will translated into the estimates based on mobile network data. In our view, this is a clear example of how prior hypotheses on the generating model for target variables are dangerous in Official Statistics (historically favouring design-based inference over the model-based approach). Now models need to be used, robustness becomes a priority. Finally, we also see an overestimation effect (intervals displacing upwards) as the biases grow. Further analysis is needed, but in general the computed credible intervals cover the true values fairly accurately.
For the present population at later times and the origin-destination matrices we will see directly in the next section how to integrate all modules to produce final estimates from the initial input data from the telecommunication network.