Skip to main content

Downscaling spatial interaction with socioeconomic attributes


A variety of complex socioeconomic phenomena, for example, migration, commuting, and trade can be abstracted by spatial interaction networks, where nodes represent geographic locations and weighted edges convey the interaction and its strength. However, obtaining fine-grained spatial interaction data is very challenging in practice due to limitations in collection methods and costs, so spatial interaction data such as transportation data and trade data are often only available at a coarse scale. Here, we propose a gravity downscaling (GD) method based on readily accessible socioeconomic data and the gravity law to infer fine-grained interactions from coarse-grained data. GD assumes that interactions of different spatial scales are governed by the similar gravity law and thus can transfer the parameters estimated from coarse-grained regions to fine-grained regions. Results show that GD has an average improvement of 24.6% in Mean Absolute Percentage Error over alternative downscaling methods (i.e., the areal-weighted method and machine learning models) across datasets with different spatial scales and in various regions. Using simple assumptions, GD enables accurate downscaling of spatial interactions, making it applicable to a wide range of fields, including human mobility, transportation, and trade.

1 Introduction

Recently, spatial interaction patterns between regions have attracted wide attention in scientific communities [13]. Examples can be found in a variety of domains, from the flow of movement in transportation networks [4, 5] to the spread of epidemics in cities [6, 7] and, more generally, human mobility in urban systems [810]. However, fine-grained spatial interaction data representing detailed human activities are difficult to access [1113]. The primary reason is the high costs associated with collecting detailed spatial interactions. As the spatial resolution doubles, the data volume of spatial interaction expands exponentially, limiting data collection to a few important regions. For example, traffic flows are typically limited to major roads, and trade flows are collected at major checkpoints. Additionally, due to privacy reasons, companies must comply with regulations that prohibit the disclosure of personal or granular spatial interaction data due to privacy reasons. Therefore, even though many companies that provide location-based services can collect fine-grained spatial interaction data, researchers still have limited access to these datasets, which has become a major obstacle to a large number of geography-related applications, such as traffic flow prediction, disease transmission modeling, and tourism planning [1416].

Although fine-grained interaction data are not easy to obtain, coarse-grained data are relatively easy to acquire, highlighting the importance of finding a feasible method to downscale the spatial interaction to overcome the resolution limitation. Here, spatial interaction downscaling refers to transforming spatial flows from coarse-grained to finer-grained regions.

Researchers have devoted considerable efforts to estimating granular flow data in recent decades. On the one hand, several classic flow interpolation methods have been proposed to estimate flows between two different spatial zoning schemes [17, 18], from census tracts to Traffic Analysis Zones (TAZs) for example. In these methods, each flow between census tracts from A to B is calculated as the weighted sum of flows between all TAZs overlapping with A or B. The weights are determined by the administrative area or built-up area ratio of the census tract’s overlapping parts to the parent TAZ [17]. On the other hand, machine learning methods are employed to generate flows between locations using multiple geographic features, including population, land uses, roads, and distances [1921]. While flow interpolation methods achieve promising results in interpolating zoning schemes with similar scales, they demonstrate limited accuracy in downscaling tasks [17]. Flow generation models based on machine learning methods may not perform well on downscaling tasks either, as these models strictly depend on training data and may not be geographically transferable [22] (e.g., from coarse-grained regions to fine-grained regions).

In this work, we propose a gravity downscaling (GD) method for spatial interactions based on the key assumption that interactions are governed by the similar gravity law at different spatial scales [23]. This assumption is partially supported by previous research on (spatial or geometrical) networks, which exhibit self-repeating patterns across scales [2426]. Furthermore, the spatial interaction can be well represented by spatial networks [27]. We can, therefore, estimate the parameters of the gravity model with the coarse-grained region attributes and use these parameters for the fine-grained scales. The detailed procedure is shown in Fig. 1.

Figure 1
figure 1

The spatial interaction downscaling diagram. The coarse-grained region attributes and spatial interaction data are fitted to the gravity model to estimate parameters. Subsequently, the fine-grained flow is estimated using these parameters. Then, the estimated fine-grained flow is calibrated to ensure consistency in the total flow. Finally, the model outputs the fine-grained spatial interaction

To validate the proposed method, we use three datasets (two cellphone datasets and one taxi trajectory dataset) with varying scales, regions, and types of spatial interactions. The commonly used areal-weighted flow interpolation method, EXtreme Gradient Boosting (XGBoost) [28], and Deep Neural Network (DNN) are used as benchmarks. Overall, our method has achieved improvements of up to 66.9% on the cellphone datasets and 67.3% on the taxi trajectory dataset compared to the benchmark methods. Moreover, GD demonstrates excellent generalization capabilities for downscaling tasks of diverse scales, whether from the city level to the county level or from the county level to the sub-district level. Additionally, we highlight that GD maintains relatively high accuracy even when confronted with limited attribute (only population), rendering it particularly valuable in data-scarce situations.

Based on simple assumptions and easily accessible data, our approach demonstrates high accuracy and transferability in estimating fine-grained interactions from coarse-grained interactions. This approach holds substantial value and applicability across multiple fields, including human mobility analysis, transportation planning, urban accessibility assessment, and trade analysis. Furthermore, comparing the classic gravity model with machine learning methods provides valuable insights for advancing research in geospatial data science and related fields.

2 Methods and data

2.1 Model

The gravity model, one of the most commonly used spatial interaction models [23, 29], assumes that the number of individuals that move between two locations per unit of time is proportional to some power of the population of the source and destination locations, and decays with the distance between them. Compared with other spatial interaction models, the gravity model excels in estimating flows, effectively preserving the structure of the spatial interaction network while accurately fitting the distribution of flow distances [30, 31]. Numerous studies have shown that flows between each pair of regions can be well estimated with the enriched gravity model, by extending the population terms to more socioeconomic attributes [3234]:

$$ \ln T_{mn}=\ln k+{\alpha _{1}}\ln {M_{m}^{1}} +\cdots+ {\alpha _{b}}\ln {M_{m}^{b}} + {\beta _{1}}\ln {M_{n}^{1}} +\cdots+ {\beta _{b}}\ln {M_{n}^{b}} -{ \gamma}\ln d_{mn} $$

where \(T_{mn}\) denotes the interaction intensity between region m and region n (\(m{\neq}n\) since self-interaction is not considered in this study). Each region is supposed to have b attributes, such as population and gross domestic product (GDP), and \(M_{m}^{x}\) represents the \(x^{th}\) attribute of the region m. \({\alpha _{1}},\ldots,{\alpha _{b}}\) and \({\beta _{1}},\ldots,{\beta _{b}}\) are the parameters to be estimated. Here, we use a power law distance decay function with an exponent of γ [35, 36]. To eliminate unnecessary variables and establish a more concise model, we apply the step-wise regression [3739] to estimate the parameters \({\alpha _{1}},\ldots,{\alpha _{b}}\), \({\beta _{1}},\ldots,{\beta _{b}}\), γ and k from the coarse-grained data. Then, those parameters are used to infer the fine-grained interactions:

$$ \ln \hat{T}_{ij}=\ln k+{\alpha _{1}}\ln {M_{i}^{1}} +\cdots+ {\alpha _{b}} \ln {M_{i}^{b}} + {\beta _{1}}\ln {M_{j}^{1}} +\cdots+ {\beta _{b}}\ln {M_{j}^{b}} -{\gamma}\ln d_{ij} $$

where \(\hat{T}_{ij}\) is the estimated flow intensity between fine-grained region i and region j. To ensure that the sum of fine-grained flows \(\sum _{i \in m} \sum _{j \in n} T_{ij}^{pred}\) is equal to corresponding parent coarse-grained flows \(T_{mn}\), the results are calibrated by

$$ T_{ij}^{pred} = \hat{T}_{ij} \frac{T_{mn}}{\sum _{i \in m} \sum _{j \in n} \hat{T}_{ij}}. $$

2.2 Model evaluation

2.2.1 Evaluation metrics

To evaluate the results, we adopt five metrics to assess the performance of our model: Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) are used to measure the degree of deviation between estimated values and true values, with smaller values indicating better accuracy. Common Part of Commuters (CPC) is a similarity measure, with values ranging from 0 to 1, where a value closer to 1 indicates higher similarity. The formulas for calculating the metrics are as follows:

$$\begin{aligned}& \text{RMSE}=\sqrt{\frac{1}{n}\sum _{i,j}({T_{ij}^{pred}}-T_{ij})^{2}} \end{aligned}$$
$$\begin{aligned}& \text{MAPE}=\frac{1}{n} \sum _{i,j}\left | \frac{{T_{ij}^{pred}}-T_{ij}}{T_{ij}} \right | \end{aligned}$$
$$\begin{aligned}& \text{CPC}= \frac{2\sum _{ij}\mathrm{min}({T_{ij}^{pred}},T_{ij})}{\sum _{ij}{T_{ij}^{pred}}+\sum _{ij}{T_{ij}}} \end{aligned}$$

Two network metrics are used from the complex network perspective [40]. One is the weighted degree centrality [41], and the other is the weighted clustering coefficient. Weighted degree centrality, considering both the number of connections (degree) and the strength of those connections (weights), indicates nodes’ importance within a network. Weighted degree centrality can be calculated by \(\sum _{j} T_{ij} \), where \(T_{ij}\) is the flow intensity. The weighted clustering coefficient gauges the tendency of nodes in a network to form clusters [42]. The weighted clustering coefficient can be computed as \(\frac{1}{k_{i}(k_{i}-1)}\sum _{kj}{(T^{\prime }_{ik}T^{\prime }_{ij}T^{\prime }_{kj})^{ \frac{1}{3}}}\) [43], with \(k_{i}\) representing the number of neighbors of region i and \(T^{\prime }_{ij}\) representing the normalized flow intensity of \(\frac{T_{ij}}{\text{max}(T_{ij})}\).

2.2.2 Baselines

Areal-weighted flow interpolation:

The areal-weighted method is a traditional approach for flow interpolation that considers the area of regions as weights to distribute flows [17]. When employed for flow downscaling, the method consists of two steps: firstly, the weight used for a fine-grained region is computed based on the area ratio of the region to the coarse-grained region that contains it; then, the flow between fine-grained regions is distributed from the corresponding flow between coarse-grained regions according to the weights of its origin and destination regions. To be exact, we can get the flow intensity from \(A_{1}\) to \(B_{1}\) in Fig. 1 by \(T_{AB} \times \frac{S_{A_{1}}}{S_{A}} \frac{S_{B_{1}}}{S_{B}}\) where \(S_{A}\) and \(S_{B}\) denotes the administrative area of A and B.


XGBoost is widely used in machine learning tasks. In this study, we directly concatenate the features of the origin area, the features of the destination area, and the distance between them as the inputs to scale down the flow. The training set consists of all coarse-grained flows, while the test set consists of all fine-grained flows. As machine learning requires sufficient input features to ensure the accuracy and generalization of the model [44, 45], we use multiple variables including population, GDP, GDP in the primary, secondary, and tertiary industries, GDP index, administrative area, and built-up area as the inputs.


The DNN used here is similar to Deep Gravity [20], with 15 hidden layers of dimensions 256 (first 6 layers) and 128 (last 9 layers), and the activation function is LeakyReLU (with a parameter of 0.7). This model has the same inputs as the XGBoost and, after training with coarse-grained interaction for 3000 epochs, is used to estimate flow in fine-grained regions.

2.3 Datasets

Cellphone data of Guangdong

The population flow data are extracted from cellphone location data of 5 million individuals in Guangdong Province, China. The dataset spans from November 1, 2020 to November 15, 2020. A threshold of 30 minutes is set to identify the stay points for each anonymous user, with trips being defined as movement between consecutive stay points. Subsequently, we construct an origin-destination (OD) flow using these trips on a \(500\text{ m}\times 500\text{ m}\) grid. Finally, we aggregate daily flow at three levels: city level (prefecture-level city), county level, and sub-district level. There are a total of 21 cities, 124 counties, and 2,100 sub-districts in the study area. The socioeconomic data (e.g., population, GDP) of these areas are obtained from various statistical yearbooks in 2021.

Cellphone data of Beijing

Similar to the Guangdong cellphone data, we obtain Beijing flow data at the TAZ level. The dataset spans from May 1, 2019, to May 30, 2019. After selecting 21 weekdays, the commuting data were derived by identifying the locations where each user spent most of their work and rest time, and labeling these locations as home and workplace. There are a total of 331 sub-districts and 1,911 TAZs in Beijing. The sub-district flow is aggregated from TAZ-level data. The population data used to downscale flows are obtained from Worldpop [46].

Taxi trajectory data of Beijing

The taxi flow data are extracted from the taxi trajectory data of Beijing from March 1, 2015, to March 7, 2015. The study area consists of sub-districts located within the Fifth Ring Road of Beijing (5RBJ), as it is the most active region for taxis. The pick-up and drop-off points of each taxi trajectory are selected as the origin and destination. Then, the taxi flow is aggregated at county, sub-district, and TAZ levels. Same with cellphone data of Beijing, the population data used for flow downscaling are also obtained from Worldpop [46].

Additional file 1 Fig. S1 visualizes the datasets used in experiments.

3 Results

3.1 Model performance

We first compare the results of different methods on Guangdong cellphone data and then use the cellphone and taxi data of Beijing to verify the generalization of our method. Here, the baseline models are fed with the same data as GD. As shown in Fig. 2, our model performs far better than the areal-weighted method and outperforms the machine learning methods XGBoost and DNN. Specifically, from the city level to the county level, our gravity downscaling method with multiple variables achieves RMSE improvements of 29.2% and 45.9% when compared to the areal-weighted flow interpolation method and DNN, respectively. To investigate the reasons for this outcome, we present the results of models fitted at the city level (refer to Table S1). Although both the DNN and XGBoost models demonstrate strong performance, their performance at the county level is inferior to that of GD, indicating a probable case of overfitting. Note that our approach outperforms the baseline when using multiple socioeconomic variables, but in many cases, the availability of socioeconomic data is limited. Therefore, we further investigate the effectiveness of our method by shrinking the number of socioeconomic variables. Figure 2 shows that using only population or GDP can yield satisfactory results, outperforming other baseline models by an average decrease of 17.5% in RMSE and an average improvement of 23.6% in CPC, indicating our method has better practical value.

Figure 2
figure 2

Downscaling results from city level to county level of Guangdong cellphone data. Here, pop is short for population and area refers to the administrative area. MV means multiple variables, including population, GDP, GDP of primary, secondary and tertiary industries, GDP index (percentage in GDP compared to the previous year), administrative area and built-up area. We evaluated the accuracy of results for each method using RMSE (a), MAPE (b), and CPC (c)

To gain a deeper understanding of different models, we plot the prediction results and highlight the long-range interactions (results above 100 km with colored dots) in Fig. 3. It can be observed that our method not only achieves the most accurate overall estimation results (gray dots), but also performs better in estimating long-range fluxes (colored dots). In contrast, the areal-weighted method and DNN overestimate long-range flows. The histograms on the right side of each scatter plot investigate the distribution of flows within the predicted results (colored) and ground truth (gray).

Figure 3
figure 3

Ground truth and predicted result from city level to county level in Guangdong. The x- and y-coordinates represent the natural logarithm of the ground truth and the predicted value. The gray dots and the gray dashed line represent the results of all flows and the trend lines of all the results, respectively. The colored regions represent the 95% confidence intervals of the predicted values of flows longer than 100 km. The R-squared value for all the flows is annotated by \(R^{2}\) and the R-squared value for the flows longer than 100 km is annotated by \(R_{l}^{2}\) in the bottom right corner of each figure. The colored bar charts on the right are the histograms of the predicted results, while the gray bars are the histograms of the ground truth. (a) Downscaled by GD with population as variable. (b) By GD with GDP as variable. (c) By GD with population as variable. (d) By areal-weighted flow interpolation method. (e) By XGBoost with multiple variables. (f) By DNN with multiple variables

For further comparison between the network structure of actual data and the downscaled results, we calculated the weighted degree centrality (Fig. 4 (a)) and weighted clustering coefficient (Fig. 4 (b)) of each region. In terms of \(R^{2}\), GD can better approximate the distribution of the ground truth. Its performance may be attributed to the model’s assumption of self-similarity, enabling the preservation of certain properties of the graph across different scales. Areal-weighted estimation overestimates the weighted clustering coefficients of various regions, possibly due to the neglect of long-range inhibitory effects on human mobility.

Figure 4
figure 4

Regions’ evaluations from city level to county level in Guangdong across different methods. (a) Weighted degrees centrality of results from different methods. (b) Weighted clustering coefficient of results from different methods. In (a) and (b), the histograms above show the distribution of these two metrics. The detailed results are classified into five categories corresponding to the magnitude range of the ground truth: 0%-20%, 20%-40%, 40%-60%, 60%-80%, and 80%-100%, and visually represented using a color gradient from green to red in the maps. In the top left corner of the maps, the fitted R-squared values for predicted values and ground truth are provided. Basemap: ©OpenStreetMap contributors, ©rastertiles/voyager

Figure 5 shows the actual data, the downscaled outcomes, and the percentage of absolute error (\(\frac{|{F}_{ij}^{pred}- F_{ij}|}{F_{ij}}\)). The results of our methods are visually closer to the true distribution, outperforming the areal-weighted method, XGboost, and DNN. In terms of overall absolute error percentage, our method has smaller errors compared to the other two methods. Additionally, the areal-weighted interpolation method overestimates the flows in the periphery of Guangdong (Fig. 5(c)), where counties are geographically distant from the main cities (i.e., Shenzhen and Guangzhou). This overestimation may be largely due to the fact that the areal-weighted interpolation method only considers the area proportion and does not take into account the distance between regions and population, the two most important factors affecting spatial interaction [47].

Figure 5
figure 5

Results from city level to county level in Guangdong. (a) Ground truth of county level. (b) Flow downscaled by GD with multiple variables. (c) By areal-weighted flow interpolation method (d) By XGBoost with multiple variables. (e) By DNN with multiple variables. (f) Absolute error percentage of flow downscaled by GD with multiple variables. (g) By areal-weighted flow interpolation method. (h) By XGBoost with multiple variables. (i) By DNN with multiple variables. Basemap: ©OpenStreetMap contributors, ©rastertiles/voyager. The flow intensities are sorted in ascending order of the ground truth values, where the first 50% are classified as level 1, 50%-80% as level 2, 80%-90% as level 3, 90%-95% as level 4, and the top 5% as level 5. Note that in any comparison between the ground truth and predicted values, the classification criteria for each level are consistent

3.2 Model generalization

To demonstrate the generalization of the gravity downscaling method, we further conducted tasks using cellphone data and taxi trajectory data of Beijing. Table 1 shows that our method outperforms the areal-weighted flow interpolation method. In the Beijing cellphone dataset, the CPC increased by 73.9%, 10.1%, and 60.2% for the three scales, respectively; in the Beijing taxi trajectory dataset, the CPC for the three scales increased by 22.6%, 13.0%, and 67.3%, respectively.

Table 1 Spatial downscaling result’s evaluation at different scales

Since administrative boundaries are hierarchically organized, we test the downscaling performance between different administrative levels and find that accuracy is relatively high when the source and target scales are adjacent administrative levels. For example, the downscaling results from the county level to the sub-district level show higher performance than from the city level to the sub-district level. This may be attributed to the availability of county-level flow data, which offers more detailed flow intensity volume for calibration in Eq. (3). To further explore the specific reasons, we attempted to analyze the results of GD without calibration, and the corresponding data can be found in Table 2. It can be observed that the improvement of CPC from the country level to the sub-district level and from the city level to the sub-district level only increased by 2.5%, instead of 5.6% after calibration. This indicates that having finer-grained data for calibration can improve accuracy.

Table 2 GD without calibration result’s evaluation at different scales

To provide a rough estimation of the applicability of GD, we consider the number of subregions within the parent region ( in Table 1), which can represent the downscaling factor. Based on our results, we recommend using GD for tasks within a downscaling factor of several tens. An excessively large downscaling span ( >100) tends to yield suboptimal results.

Figure 6 shows the ground truth of spatial interactions and the downscaling results of Beijing cellphone data and taxi data. The process of downscaling from the sub-district level to the TAZ level has yielded better results on both datasets than that from the county level to other levels. The low accuracy of downscaling from the county level may result from the inadequacy of information obtained at this level for predicting local hotspots characterized by low population density but high total flows, such as transportation hubs and tourist attractions.

Figure 6
figure 6

Results of Beijing cellphone data and taxi data. The ground truth (GT) and predicted flow of Beijing cellphone dataset (a) and taxi dataset (b). Basemap: ©OpenStreetMap contributors, ©rastertiles/voyager. The classification criteria for flow are the same as Fig. 5

4 Conclusion and discussion

To overcome the challenge of obtaining fine-grained spatial interaction data, we propose a method to downscale coarse-grained spatial interaction with accessible socioeconomic data. Based on the assumption that interactions are governed by the gravity law at different spatial scales, we aim to achieve spatial interaction downscaling by transferring parameters estimated from coarse-grained regions to fine-grained regions. Our method has been proven to be effective on several empirical datasets and simulation experiments, with an average improvement of 24.6% in RMSE over benchmarks across all datasets and scales in the experiments.

The spatial configurations (i.e., population and flow distributions) in the empirical datasets remain relatively stable. To evaluate the impact of different spatial configurations and model parameters on GD, we conduct experiments using simulated data and present the outcomes in Additional file 1 (Simulated experiments). Results from the simulation experiments indicate that although the distance decay effect and spatial autocorrelation could potentially impact GD’s performance, in the vast majority of practical scenarios, GD maintains a relatively high level of accuracy.

In addition to the gravity model, we have examined the generalized radiation model and the universal opportunity model to perform downscaling tasks from the city level to the county level using cellphone data from Guangdong. From the results, we found that the gravity model exhibited superior performance (Table 3).

Table 3 Downscaling results from city to county level in Guangdong based on different models

There are still some limitations in this study. First, we only present population flow results in the main text. It is worth noting that there are various forms of spatial interactions, including cargo, telecommunication, financial networks, and so on. The gravity model could be extended to different types of flows. In the Additional file 1 (Supplementary datasets and results), we validate our method by utilizing the Baidu search index as a representation of information flow. Exploring the gravity model’s potential for downscaling other forms of spatial interactions remains a promising direction. Second, our model achieves relatively poor accuracy when the disparity between the source and target scales is substantial, which may be due to the fact that the distance decay parameters/functions of spatial interactions vary across different scale regimes [35, 36]. One possible approach to address this issue is to calibrate the distance decay parameter γ based on empirical observations. We also explore this parameter calibration method and find that it may not effectively improve accuracy (Additional file 1 Parameters calibration).

In summary, our method can potentially overcome the limitations of accessing fine-grained spatial interaction data, thereby holding substantial value and applicability. Furthermore, the superior performance of our methods, which are primarily based on the classic gravity model rather than machine learning methods, also sheds light on model development in the era of geospatial artificial intelligence (GeoAI) [48].

Data availability

Cellphone data and taxi data are not publicly available to preserve privacy. Aggregated flow data (TAZs, subdistricts, counties, and cities) can be requested from the corresponding author to reproduce the results of this study. The population data used in this study are from the Worldpop [46], which is publicly available at The Baidu Search Index data are available at The code for GD is available at


  1. Hayes MC, Wilson AG (1971) Spatial interaction. Socio-Econ Plan Sci 5(1):73–95.

    Article  Google Scholar 

  2. Tobler W (1975) Spatial interaction patterns. J Environ Syst 6(4):271–301

    Article  Google Scholar 

  3. Ullman EL, Boyce RR, Harris CD (1980) Geography as spatial interaction. University of Washington Press, Seattle

    Google Scholar 

  4. Yan X, Wang W, Gao Z, Lai Y (2017) Universal model of individual and population mobility on diverse spatial scales. Nat Commun 8(1):1639.

    Article  Google Scholar 

  5. Huang J, Levinson D, Wang J, Jin H (2019) Job-worker spatial dynamics in Beijing: insights from smart card data. Cities 86:83–93.

    Article  Google Scholar 

  6. Yuan H-Y, Hossain MP, Tsegaye M, Zhu X, Jia P, Junus A, Wen T-H, Pfeiffer D (2020) Estimating the risk on outbreak spreading of 2019-nCoV in China using transportation data, 2020–02. medRxiv

  7. Pollmann TR, Schönert S, Müller J, Pollmann J, Resconi E, Wiesinger C, Haack C, Shtembari L, Turcati A, Neumair B et al. (2021) The impact of digital contact tracing on the SARS-CoV-2 pandemic—a comprehensive modelling study. EPJ Data Sci 10(1):37.

    Article  Google Scholar 

  8. Tao H, Wang K, Zhuo L, Li X (2019) Re-examining urban region and inferring regional function based on spatial-temporal interaction. Int J Digit Earth 12(3):293–310.

    Article  Google Scholar 

  9. Zhu D, Zhang F, Wang S, Wang Y, Cheng X, Huang Z, Liu Y (2020) Understanding place characteristics in geographic contexts through graph convolutional neural networks. Ann Assoc Am Geogr 110(2):408–420.

    Article  Google Scholar 

  10. Guo H, Zhang W, Du H, Kang C, Liu Y (2022) Understanding China’s urban system evolution from web search index data. EPJ Data Sci 11(1):20.

    Article  Google Scholar 

  11. Pedrycz W, Chen S (2014) Information granularity, big data and computational intelligence, vol 8. Springer, Cham.

    Book  Google Scholar 

  12. Voigt P, Von Dem Bussche A (2017) The EU general data protection regulation (GDPR). Springer, Cham.

    Book  Google Scholar 

  13. Liu Y, Gao S, Yuan Y, Zhang F, Kang C, Kang Y, Wang K (2021) Methods of social sensing for urban studies. In: Urban remote sensing: monitoring, synthesis, and modeling in the urban environment, pp 71–89.

    Chapter  Google Scholar 

  14. Mizzi C, Fabbri A, Rambaldi S, Bertini F, Curti N, Sinigardi S, Luzi R, Venturi G, Davide M, Muratore G et al. (2018) Unraveling pedestrian mobility on a road network using ICTs data during great tourist events. EPJ Data Sci 7(1):44.

    Article  Google Scholar 

  15. Ouyang K, Liang Y, Liu Y, Tong Z, Ruan S, Zheng Y, Rosenblum DS (2022) Fine-grained urban flow inference. IEEE Trans Knowl Data Eng 34(6):2755–2770.

    Article  Google Scholar 

  16. Cardia M, Luca M, Pappalardo L (2022) Enhancing crowd flow prediction in various spatial and temporal granularities. In: Companion proceedings of the web conference 2022, pp 1251–1259.

    Chapter  Google Scholar 

  17. Jang W, Yao X (2011) Interpolating spatial interaction data. Trans GIS 15(4):541–555.

    Article  Google Scholar 

  18. Šimbera J, Aasa A (2019) Areal interpolation of spatial interaction data. In: LBS 2019; adjunct proceedings of the 15th international conference on location-based services/Gartner, Georg; Huang, Haosheng, Wien

    Google Scholar 

  19. Liu Z, Miranda F, Xiong W, Yang J, Wang Q, Silva C (2020) Learning geo-contextual embeddings for commuting flow prediction. Proc AAAI Conf Artif Intell 34(1):808–816.

    Article  Google Scholar 

  20. Simini F, Barlacchi G, Luca M, Pappalardo L (2021) A deep gravity model for mobility flows generation. Nat Commun 12(1):6576.

    Article  Google Scholar 

  21. Mauro G, Luca M, Longa A, Lepri B, Pappalardo L (2022) Generating mobility networks with generative adversarial networks. EPJ Data Sci 11(1):58.

    Article  Google Scholar 

  22. Luca M, Barlacchi G, Lepri B, Pappalardo L (2021) A survey on deep learning for human mobility. ACM Comput Surv 55(1):7–1744.

    Article  Google Scholar 

  23. Anderson JE (2011) The gravity model. Annu Rev Econ 3(1):133–160.

    Article  Google Scholar 

  24. Song C, Havlin S, Makse HA (2005) Self-similarity of complex networks. Nature 433(7024):392–395.

    Article  Google Scholar 

  25. Alessandretti L, Aslak U, Lehmann S (2020) The scales of human mobility. Nature 587(7834):402–407.

    Article  Google Scholar 

  26. Boguñá M, Bonamassa I, De Domenico M, Havlin S, Krioukov D, Serrano MÁ (2021) Network geometry. Nat Rev Phys 3(2):114–135.

    Article  Google Scholar 

  27. Barbosa H, Barthelemy M, Ghoshal G, James CR, Lenormand M, Louail T, Menezes R, Ramasco JJ, Simini F, Tomasini M (2018) Human mobility: Models and applications. Phys Rep 734:1–74.

    Article  MathSciNet  Google Scholar 

  28. Chen T, Guestrin C (2016) XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’16. ACM, New York, pp 785–794.

    Chapter  Google Scholar 

  29. Ravenstein EG (1889) The laws of migration. J R Stat Soc 52(2):241–305

    Article  Google Scholar 

  30. Lenormand M, Bassolas A, Ramasco JJ (2016) Systematic comparison of trip distribution laws and models. J Transp Geogr 51:158–169.

    Article  Google Scholar 

  31. Stefanouli M, Polyzos S (2017) Gravity vs radiation model: Two approaches on commuting in Greece. Transp Res Proc 24:65–72.

    Article  Google Scholar 

  32. Gil-Pareja S, Llorca-Vivero R, Martínez-Serrano JA (2007) The impact of embassies and consulates on tourism. Tour Manag 28(2):355–360.

    Article  Google Scholar 

  33. Eryiğit M, Kotil E, Eryiğit R (2010) Factors affecting international tourism flows to Turkey: A gravity model approach. Tour Econ 16(3):585–595.

    Article  Google Scholar 

  34. Shen J (2015) Explaining interregional migration changes in China, 1985–2000, using a decomposition approach. Reg Stud 49(7):1176–1192.

    Article  Google Scholar 

  35. Liu Y, Gong L, Tong Q (2014) Quantifying the distance effect in spatial interactions. Acta Sci Nat Univ Pek 50(3):526–534.

    Article  Google Scholar 

  36. Chen Y (2015) The distance-decay function of geographical gravity model: Power law or exponential law? Chaos Solitons Fractals 77:174–189.

    Article  MathSciNet  Google Scholar 

  37. Efroymson MA (1960) Multiple regression analysis. In: Mathematical methods for digital computers, pp 191–203

    Google Scholar 

  38. Halinski RS, Feldt LS (1970) The selection of variables in multiple regression analysis. J Educ Meas 7(3):151–157

    Article  Google Scholar 

  39. Pope PT, Webster JT (1972) The use of an F-statistic in stepwise regression procedures. Technometrics 14(2):327–340

    Google Scholar 

  40. Barthélemy M (2011) Spatial networks. Phys Rep 499(1–3):1–101.

    Article  MathSciNet  Google Scholar 

  41. Opsahl T, Agneessens F, Skvoretz J (2010) Node centrality in weighted networks: Generalizing degree and shortest paths. Soc Netw 32(3):245–251.

    Article  Google Scholar 

  42. Saramäki J, Kivelä M, Onnela J-P, Kaski K, Kertesz J (2007) Generalizations of the clustering coefficient to weighted complex networks. Phys Rev E 75(2):027105.

    Article  Google Scholar 

  43. Onnela J-P, Saramäki J, Kertész J, Kaski K (2005) Intensity and coherence of motifs in weighted complex networks. Phys Rev E 71(6):065103.

    Article  Google Scholar 

  44. Hall MA (1999) Correlation-based feature selection for machine learning. Thesis, The University of Waikato

  45. Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: A data perspective. ACM Comput Surv 50(6):94–19445.

    Article  Google Scholar 

  46. Stevens FR, Gaughan AE, Linard C, Tatem AJ (2015) Disaggregating census data for population mapping using random forests with remotely-sensed and ancillary data. PLoS ONE 10(2):0107042.

    Article  Google Scholar 

  47. Roy JR, Thill J-C (2003) Spatial interaction modelling. Pap Reg Sci 83(1):339–361.

    Article  Google Scholar 

  48. Janowicz K, Gao S, McKenzie G, Hu Y, Bhaduri B (2020) GeoAI: spatially explicit artificial intelligence techniques for geographic knowledge discovery and beyond. Int J Geogr Inf Sci 34(4):625–636.

    Article  Google Scholar 

Download references


The authors thank the two anonymous reviewers for their valuable suggestions. The authors also thank Yuanqiao Hou and Tianyou Cheng for their assistance in processing the Guangdong cellphone data.


This research was supported by grants from the National Natural Science Foundation of China (41830645) and the Fundamental Research Funds for the Central Universities, Peking University.

Author information

Authors and Affiliations



CT: Conceptualization, Methodology, Writing – original draft, review & editing. LD: Supervision, Conceptualization, Methodology, Writing – review & editing. HG: Conceptualization, Methodology. XW: Methodology, Writing – review & editing. XC: Writing – review & editing. QD: Writing – review & editing. YL: Supervision, Conceptualization, Writing – review. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Lei Dong.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, C., Dong, L., Guo, H. et al. Downscaling spatial interaction with socioeconomic attributes. EPJ Data Sci. 13, 46 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: