 Research
 Open access
 Published:
Human mobility prediction with causal and spatialconstrained multitask network
EPJ Data Science volume 13, Article number: 22 (2024)
Abstract
Modeling human mobility helps to understand how people are accessing resources and physically contacting with each other in cities, and thus contributes to various applications such as urban planning, epidemic control, and locationbased advertisement. Next location prediction is one decisive task in individual human mobility modeling and is usually viewed as sequence modeling, solved with Markov or RNNbased methods. However, the existing models paid little attention to the logic of individual travel decisions and the reproducibility of the collective behavior of population. To this end, we propose a Causal and Spatialconstrained Long and Shortterm Learner (CSLSL) for next location prediction. CSLSL utilizes a causal structure based on multitask learning to explicitly model the “when→what→where”, a.k.a. “time→activity→location” decision logic. We next propose a spatialconstrained loss function as an auxiliary task, to ensure the consistency between the predicted and actual spatial distribution of travelers’ destinations. Moreover, CSLSL adopts modules named Long and Shortterm Capturer (LSC) to learn the transition regularities across different time spans. Extensive experiments on three realworld datasets show promising performance improvements of CSLSL over baselines and confirm the effectiveness of introducing the causality and consistency constraints. The implementation is available at https://github.com/urbanmobility/CSLSL.
1 Introduction
Human mobility modeling aims to explore the regularities and patterns of human behavior [1, 2] and plays a significant role in numerous applications, such as urban planning [3], travel demand management [4, 5], health risk assessment [6], epidemic spreading modeling and control [7–9], and so on. In the big data era, the accessibility to GPS, mobile phone records, and locationbased social networks (LBSNs) provides an unprecedented chance to understand and model human mobility [2, 10].
In the research community of human mobility, physicists focus on statistical analysis from a macroscopic perspective and have summarized empirical rules [2]. For example, they found that, truncating the power law distribution can well fit the displacement distribution [1]; despite the significant differences in the travel patterns, a majority of users’ mobility behaviors are predictable [11]. Computer scientists, on the other hand, prefer to model the transition regularities from location sequences, using Markov models [12], recurrent neural networks (RNNs) [13], etc. In summary, statistical physics study the collective behavior at population level, while deep learning methods emphasize modeling individual travel trajectories. Thus, we can expect that integrating physical domain knowledge into a deep learning model encourages the model to pay attention to group behaviors and promotes the performance of deep learning models at population level.
Here we place our emphasis on next location prediction, a vital task in human mobility modeling at individual level [14]. A body of work leverages machine learning methods to tackle this problem due to the sequential nature of mobility behavior. A common thread of these studies is to efficiently capture behavior patterns from sparse data [10, 15–17]. Traditional methods mainly adopt Markov chains to model transition probability matrices across locations, along with techniques like factorization [12, 18, 19] or metric embedding [20]. In recent years, deep learning methods are gaining increasing attention in next location prediction as the recurrent neural network (RNN) presents its capability to capture sequential dependency. To model multiscale spatiotemporal periodicity, researchers designed attention or gate mechanisms and introduced time and distance interval information [13, 21–24]. Also a few studies incorporate semantic information such as location categories to cope with the data sparsity [16, 25, 26]. However, methods that capture dependencies only from location sequences are difficult to fully fit complex human travel behaviors, especially with sparse data.
To tackle this challenge, we seek to integrate physical knowledge into deep learning methods to enhance the capability of human mobility prediction. Specifically, we propose two physical constraints. The first one is summarized as “when→what→where” causal relationship. “When”, “what”, and “where” are the three core elements of human travel behavior and the dependencies between them can explain the motivation of location transfer. For example, as shown in Fig. 1(a), people have specific demands at different times, causing the shifts between locations. Considering causal dependencies enables more comprehensive modeling of human mobility. The second constraint is the macrostatistical characteristics reflecting group behavior. Figure 1(b) illustrates the deviation of the modeled displacement distribution via LSTM from the true distribution in New York City and Tokyo, suggesting LSTM is more likely to focus on shorter trips with higher frequency. Ensuring the consistency between the model output and the macrostatistical characteristics is expected to improve the model’s capability to fit travel behavior. We summarize these two constraints as causality and consistency constraints and incorporate them into deep learning models.
To this end, we propose a Causal and Spatialconstrained Long and Shortterm Learner (CSLSL), a model integrating the decision logic and the consistency constraints of human mobility modeling. To model the “when→what→where” decision logic, we introduce a causal structure in CSLSL. Based on a multitask learning, the causal structure utilizes three similar network branches to model the regularities of time, activity, and location, respectively. In line with the “when→what→where” logic, we explicitly build connections between the three branches in the causal structure. As for consistency, we exploratively propose a spatialconstrained loss to reduce the distance between the predicted and actual locations, and indirectly ensure the consistency of the spatial density distribution. In addition, we adopt a Long and Shortterm Capturer (LSC) to learn the transition patterns of different time spans. There are two units in LSC that focus on longterm and shortterm regularities respectively.
The main contributions of this work are summarized as follows:

We propose CSLSL to integrate the travel decision logic and the macrostatistical consistency for human mobility modeling. To our best knowledge, CSLSL is the first model to learn the causality and consistency constraints for next location prediction.

We introduce a causal structure that can capture not only the separate regularities of time, activity, and location, but also the “when→what→where” causal dependencies. In this way, CSLSL models more essential travel logic in addition to sequence relationships.

To ensure the consistency in spatial distribution, we propose a spatialconstrained loss to reduce the gap between the predicted and actual destinations.

We evaluate CSLSL on three realworld datasets to confirm the performance improvements. We also conduct ablation studies and visualization analyses of results such as displacement distribution to demonstrate the effectiveness of our design.
2 Related work
2.1 Next location prediction
Here we classify the approaches to the next location prediction problem into two categories: traditional and deep learning methods. Traditional methods mainly apply Markov chain (MC) and focus on constructing a better location transition probability matrix [12, 18, 20, 27]. For instance, factorized personalized Markov chain (FPMC) combines the matrix factorization technique with Markov chains to learn users’ personalized transition matrices [12]. The limitation of the MCbased methods lies in the difficulty in capturing longterm and highorder regularity [16, 17].
Deep learning methods have advantages of learning dense representation and complex dependency. Recently, RNNbased methods show promising performance in mining sequential information. A popular scheme of deep learning methods is incorporating time and distance intervals to assist the model in learning the spatiotemporal regularities of human mobility. Specifically, these methods integrate spatiotemporal information into hidden state transition [28], gate mechanisms [21, 22, 29], or selfattention mechanisms [15, 17], and exploit spatiotemporal contexts in a implicit manner. To leverage the spatiotemporal contexts, researchers explicitly used spatial and temporal factors as attention weights to select the historical hidden states [24, 30]. Another scheme emphasizes the longterm patterns of human behavior, such as DeepMove [13] and LSTPM [23]. They introduce two different components to model longterm and shortterm preferences respectively. There is also another scheme that utilizes semantic information such as location categories to improve the performance of location prediction [16, 26, 31]. However, methods that focus on modeling location transfer patterns in sequences cannot effectively capture complex human decision logic. In our work, we propose a causal structure to explicitly capture the “when→what→where” decision logic.
2.2 Time or activityjointed location prediction
The methods that jointly predict time or activity learn knowledge from related tasks to improve the prediction performance of location. RMTPP [32] combines RNN and temporal point process (TPP) to jointly model the time and location information. He et al. [25] proposed a twofold approach that predicts category with Bayesian Personalized Ranking (BPR) technique and then predicts the categorybased location. Krishna et al. [33] utilized two distinct LSTM networks to predict activities and durations. DeepJMT [34] fuses spatiotemporal information and social context to predict time and location with a hierarchical RNN and TPP technique. Sun et al. [35] proposed a hybrid LSTM and a sequential LSTM with a selfattention mechanism to jointly model location and travel time. The limitation of these approaches is that they attempt to implicitly and passively learn the correlation between time, category, and location information, but this relationship is explicit and can be directly exploited. In contrast, CSLSL explicitly models the causal dependencies between time, category, and location information through two structural designs.
2.3 Statistical physicsinformed human mobility modeling
Explicitly integrating knowledge of statistical physics contributes to guiding model optimization and improving the performance of machine learning methods. On the task of trajectory generation, researchers introduced knowledge of statistical physics to constrain the macroscopic performance of their models, such as the individual trajectory generation model TimeGeo [36] and DITRAS [37], and flow generation model DeepGravity [38]. Unlike the trajectory generation task, only a limited amount of work on individual mobility prediction incorporates knowledge of statistical physics. Zhao et al. [30] integrated domain knowledge, specifically a powerlaw decay for distances and an exponential decay for time intervals, into an attention mechanism to adjust the impact of historical information on current prediction. However, deep learningbased methods focus more on fitting individual behavior while neglecting group behavior constraints described by macrostatistical characteristics, for example, the regional attractiveness of city blocks. The spatial distribution of model prediction results should be consistent with the actual statistical distribution. Predicted locations closer to the actual location are more expected. Toward this, we propose a spatialconstrained loss function to narrow the distance between the predicted and actual locations, thereby ensuring the consistency of the spatial distribution.
3 Problem formulation
A person’s travel behavior can be represented as a sequence of locations, associated with the timestamps and her user ID. In LBSN datasets, each location is also associated with its functional category to support the analysis of user’s activity. Let \(\mathcal{U} = \{u_{1}, \dots, u_{\mathcal{U}}\}\), \(\mathcal{L} = \{l_{1}, \dots, l_{\mathcal{L}}\}\) and \(\mathcal{C} = \{c_{1}, \dots, c_{\mathcal{C}}\}\) denote a set of users, locations and functional categories, respectively. Each location \(l_{i}\) is associated with its category and geographical coordinate \((c_{i}, \mathrm{lat}_{i}, \mathrm{lon}_{i})\).
Definition 1
(Record)
Record r is a 3tuple \((u_{i}, l_{j}, t_{k})\), representing that the user \(u_{i}\) visited location \(l_{j}\) at time \(t_{k}\), where \(u_{i}\in \mathcal{U}, l_{j}\in \mathcal{L}\).
Definition 2
(Individual Trajectory)
A person’s trajectory is defined as a record sequence \(\mathcal{R}=\{r_{1}, r_{2}, \dots, r_{\mathcal{R}}\}\), which consists of the person’s all records arranged in chronological order. Note that the time interval between two consecutive records is heterogeneous due to the irregular travel behavior.
Definition 3
(Session)
Session S is a subsequence of records in a time slot. One user’s trajectory \(\mathcal{R}\) can be split into a series of sessions with various strategies. For example, DeepMove adopts a specific time interval between two consecutive records to split the trajectory [13]. Other strategies segment users’ trajectories using a fixed number of records [17, 24] or a meaningful time window such as days, weeks, etc [23]. We define the session where the prediction target is located as the shortterm session \(S_{p}\) and the previous historical sessions as longterm sessions \(\{S_{q}\},q\in \{1,\dots,p1\}\).
The location prediction problem is formulated as: given a record sequence of a user \(\mathcal{R}_{t1}=\{r_{1}, \dots, r_{t1}\}\), the goal is to predict where the user u is most likely to go in her next trip. We use \(\hat{l}_{t}\) to denote the predicted next location. Note that the timestamp of the next trip t is also unknown.
4 Methodology
In this section, we first analyze the causality and consistency constraints in human mobility modeling, and then elaborately introduce the design of the proposed model, Casual and Spatialconstrained Long and Shortterm Learner (CSLSL). The architecture of CSLSL is illustrated in Fig. 2. It mainly consists of two parts, an embedding part for learning the representations of arrival time, category, and location, from users’ recent and historical records; and the second part for learning the regularities of mobility behavior in a multitask learning based causal module and making predictions.
4.1 Causality and consistency constraints
A common practice for next location prediction is to discover similar subsequences or location transition relationships from historical records. This is accomplished by integrating the context information such as distance or time intervals into attention or RNNcentered framework [17, 22–24]. We can formulate such mainstream scheme as \(P(\hat{l}_{t}\mathcal{R}_{t1})\), where \(\mathcal{R}_{t1}\) is the historical record sequence. Another scheme adopts multitask learning techniques to jointly predict next location with time or activity [34, 35], formulated as \(P(\hat{l}_{t}, \hat{c}_{t}, \hat{t}\mathcal{R}_{t1})=P(\hat{l}_{t} \mathcal{R}_{t1})P(\hat{c}_{t}\mathcal{R}_{t1})P(\hat{t} \mathcal{R}_{t1})\), where we assume that the location category can approximate the type of activity. Although these two schemes combine contextual information to capture hidden regularities of location transition, they ignore the causal dependencies in the context information.
As aforementioned, we regard “when”, “what” and “where” as three crucial elements to describe human mobility [39, 40]. “When” refers to the time the trip takes place, e.g. “midday”. “What” tells about the activities people participate in and also answers the reasons for the trip, such as “having lunch”. “Where” is the destination of the trip, like “steakhouse”. Periodic activities exist in human mobility and occur at specific times and places, such as going to work in the morning and going to a restaurant for lunch, which reveals the correlation between the three elements. When we mention a specific timestamp, we have various activity choices. But we are accustomed to doing certain activities at certain times, such as going to the gym in the evening. Similarly, one activity (category) corresponds to multiple locations (POIs), while one location ID only corresponds to one activity, also reflected in the dataset. Moreover, our target is location prediction, thus location should be the final subtask to leverage the predicted time and activity information. Therefore, we summarize a “when→what→where”, a.k.a. “time→activity→location” causal relationship, which is in line with the coarsetofine logic of the human decision. The proposed scheme can be formulated as:
The scheme explicitly models the dependencies between time, activity, and location, and alleviates the difficulty of location prediction. For example, people are accustomed to going to restaurants at midday instead of bar, that is, \(P(\hat{c}_{t}=\mathrm{restaurant}\hat{t}=\mathrm{midday},\mathcal{R}_{t1})>P(\hat{c}_{t}=\mathrm{bar} \hat{t}=\mathrm{midday},\mathcal{R}_{t1})\). Each individual has her personalized \(P(\hat{l}_{t}\hat{c}_{t}=\mathrm{restaurant}, \hat{t}=\mathrm{midday}, \mathcal{R}_{t1})\), and the casual constrained location distribution is easier to learn than \(P(\hat{l}_{t}\mathcal{R}_{t1})\). In CSLSL, we introduce a causal structure to implement the scheme. In experiments, we also demonstrate that “time→acitivity→location” outperforms “activity→time→location” and “time, activity, location”, which does not involve logical connections.
On the other side, integrating physical knowledge provides more information and prior constraints to guide the optimization of deep learning models [41, 42]. In human mobility modeling, one can expect that properly introducing the physical laws and domain knowledge would narrow down the gap between the output of deep learningbased approaches and the observed macrostatistical characteristics of human behavior. Due to the difficulty in applying statistical constraints in the training of deep learning models, here we consider the geographic spatial consistency in an indirect way. Specifically, we devise a loss function to constrain the distance between the predicted and actual locations. That is, the closer the predicted location is to the ground truth, the smaller loss we have. By this way, we can indirectly ensure the consistency of the displacement distribution and the consistency of the spatial distribution of travelers’ destinations.
4.2 Long and shortterm capturer
Human travel behavior has long and shortcycle repetitive patterns, such as going to work every day and going to the supermarket once a week. Inspired by DeepMove [13] and LSTPM [23], we devise a Long and Shortterm Capturer (LSC) to learn the behavioral patterns in different observation cycles. In the whole framework shown in Fig. 2, we apply three LSCs to model the time, activity, and location sequences, respectively.
Let \(\boldsymbol{e}^{l}\in \mathbb{R}^{d^{l}},\boldsymbol{e}^{c}\in \mathbb{R}^{d^{c}},\boldsymbol{e}^{t}\in \mathbb{R}^{d^{t}}\) and \(\boldsymbol{e}^{u}\in \mathbb{R}^{d^{u}}\) denote the embedded representation of location, category, time and user, respectively. Given a historical record sequence \(\mathcal{R}\), CSLSL embeds each record as \((\boldsymbol{e}^{l}, \boldsymbol{e}^{c}, \boldsymbol{e}^{t}, \boldsymbol{e}^{u})\) in hidden spaces. Note that we first convert the continuous timestamp as the hour in a day \(t^{h}\) and the day in a week \(t^{d}\) to present the daily and weekly periodicity. By doing so, we have \(\boldsymbol{e}^{t}=\boldsymbol{e}^{t^{h}}\oplus \boldsymbol{e}^{t^{d}}\). Then these representations in a record are concatenated together, \(\boldsymbol{e}^{r}=\boldsymbol{e}^{l}\oplus \boldsymbol{e}^{c} \oplus \boldsymbol{e}^{t}\oplus \boldsymbol{e}^{u}\). We next split each user’s record sequence into multiple sessions with a certain time window, like days or weeks. The records in shortterm session and longterm sessions are represented as \(\widetilde{\boldsymbol{e}}^{r}_{p}=\{\boldsymbol{e}^{r}_{1}, \dots, \boldsymbol{e}^{r}_{t1}\}\) and \(\{\widetilde{\boldsymbol{e}}^{r}_{q}\}=\{\widetilde{\boldsymbol{e}}^{r}_{1}, \dots,\widetilde{\boldsymbol{e}}^{r}_{p1}\}\), respectively.
Our proposed LSC consists of two capturers to learn the transition regularities in the shortterm and longterm sessions, respectively, as shown in Fig. 3. We formulate LSC as:
where \(\boldsymbol{h}_{0}\) is the initial hidden state. In the LSC structure, the shortterm capturer takes the hidden state \(h_{S_{p1}}\) of the longterm capturer as the initial hidden state to combine the historical information. Because GRU is simple but efficient in modeling temporal data, we apply a layer of GRU in both of the capturers:
where \(i\in \{1,2,\dots,S_{p1}\}\) for longterm capturer and \(i\in \{1,2,\dots,t\}\) for shortterm capturer.
4.3 Causal structure
To model the “time→activity→location” logic relationship discussed in Sect. 4.1, we introduce a causal structure based on multitask learning techniques. As illustrated in Fig. 2, we utilize three branches with the same architecture to model the change patterns of time, activity, and location, respectively. Specifically, in each branch, we convey the same record representations to the LSC module and then transfer the output hidden states to the predictor. To explicitly model the summarized causal relation in human travel behavior, we next design two paths for information transfer between various tasks. The first path lies between two LSC modules, passing on the taskspecific hidden states. The second path lies between two predictors. In this path, the predicted result of the upstream task is processed by the converter and then conveyed to the downstream task. Here we use the fully connected layer as the predictor (P) and converter(C). That is \(\boldsymbol{y} = \mathrm{Linear}(\boldsymbol{x})=\boldsymbol{W}\boldsymbol{x}+ \boldsymbol{b}\).
Mathematically, the branch of “time” is formulated as:
where \(\boldsymbol{W}^{(P^{t})}\in \mathbb{R}^{1\times \boldsymbol{h}^{t}}\), \(\boldsymbol{h}^{t}_{t}\) is the hidden state of the next time, and t̂ is the predicted time. As the downstream task of “time” in causal structure, the branch of “activity” can be formulated as:
where \(\boldsymbol{W}^{(C^{t})}\in \mathbb{R}^{\boldsymbol{e}^{t}\times 1}, \boldsymbol{W}^{(P^{c})}\in \mathbb{R}^{\mathcal{C}\times ( \boldsymbol{h}^{c}+\boldsymbol{e}^{t})}\), \(\boldsymbol{h}^{c}_{t}\) is the hidden state of the next activity, and \(\hat{c}_{t}\) is the predicted activity. Eventually, we can formulate the branch of “location” as:
where \(\boldsymbol{W}^{(C^{c})}\in \mathbb{R}^{\boldsymbol{e}^{c}\times  \mathcal{C}}, \boldsymbol{W}^{(P^{l})}\in \mathbb{R}^{\mathcal{L} \times (\boldsymbol{h}^{l}+\boldsymbol{e}^{c})}\), \(\boldsymbol{h}^{l}_{t}\) is the hidden state of the next location, \(\boldsymbol{l}_{t}\) is the distribution of predicted location, and \(\hat{l}_{t}\) is the predicted location.
4.4 Spatialconstrained loss function
As discussed in Sect. 4.1, for seeking an alignment of spatial distribution of destinations, we propose a spatialconstrained loss function to shorten the distance from the predicted location to the ground truth at individual level. The distance constraint can be regarded as a selfsupervised auxiliary task, integrating the geographical information and restricting the candidate set for better next location prediction. The existing methods introduce the distance constraints in a regression subtask, directly predicting the geographical locations of destinations [43]. However, in the classification scheme, we must query the coordinates of location IDs to calculate their distance. This operation is not derivable. We get inspiration from REINFORCE [44], which introduces the reward in the loss function to train a policy network, and also consider the distance error as a coefficient to weight the crossentropy between ground truth and the predicted location ID with their physical distance. The spatialconstrained loss function is defined as:
where N is the total number of records and σ is the softmax function.
We next employ MAE loss for time prediction and cross entropy loss for category and location prediction. Thus we have \(L_{t}=MAE(\hat{t},t)=\sum^{N}_{i=1}\hat{t}_{i}t_{i}\) and \(L_{*}=\mathrm{CrossEntropy}(*)=\sum^{N}_{i=1} \log(\sigma ( \boldsymbol{*_{t,i}})),*\in \{c,l\}\). Thus, the total loss function can be written as
where \(\lambda _{t}, \lambda _{c}, \lambda _{s}\) are weights for their loss functions.
4.5 Structure comparison
There are various strategies for task combination in multitask learning, such as sharebottom structure [45, 46], hierarchical structure [34, 47], and multiexpert structure [46, 48]. Inspired by these structures, we propose five variants, as shown in Fig. 4, to demonstrate the advantages of our causal structure. Note that these variants use the same basic components as CSLSL, such as GRUs and fully connected layers.
Long and Shortterm Learner (LSL) is a basic approach with only one branch to predict location. To jointly predict “time”, “activity”, and “location”, ShareBottom LSL (SBLSL) introduces two additional predictors that share the same bottom LSC module with the original one. MultiExperts LSL (MELSL) is an advanced version of SBLSL, with a similar structure to Mixture of Sequential Expert (MoSE) [46]. MELSL employs several GRUs as experts to focus on different aspects of sequence dependencies and gate networks to combine relevant aspects for each task.
Unlike the sharebottom structure, Separate LSL (SLSL) employs a separate branch for each task and the only shared information between each task is the same record representations. Considering the dependencies between tasks, Hierarchical LSL (HLSL) concatenates the record embedding and the output hidden state of the upstream task as the input of the downstream task. Thus the equation (3) changes to:
where \(\boldsymbol{h}^{k}_{i}\) is the hidden state of the kth task at ith time step, and the equation (2) changes to:
where \([\widetilde{\boldsymbol{e}}^{r}_{p},\widetilde{\boldsymbol{h}}^{k1}_{p}] = \{\boldsymbol{e}^{r}_{1} \oplus \boldsymbol{h}^{k1}_{1}, \dots, \boldsymbol{e}^{r}_{t1} \oplus \boldsymbol{h}^{k1}_{t1}\}\).
5 Experiments
5.1 Data description
We leverage three publicly available checkin datasets in the experiments: two datasets from Foursquare [49] in New York (NYC) and Tokyo (TKY) and one dataset from Gowalla [50] in Dallas. Data in NYC and TKY were collected from 3 April 2012 to 16 February 2013, and data in Dallas was collected from 4 February 2009 to 22 October 2010. The number of users, locations, and records in three datasets are summarized in Table 1, where \(\mathcal{*}\) denotes the number of ∗. The number of location categories \(\mathcal{C}\) in NYC and TKY are 400 and 385, while Dallas does not contain category information.
To prepare data for baselines and proposed models, we first filter out both users and locations with fewer than 10 records, in line with previous work [13, 23]. We then merge the consecutive records with the same user and location on the same day. The statistical information of the raw and processed data is depicted in Table 1. After preprocessing, the number of categories for NYC and TKY are reduced to 308 and 286. For CSLSL and its variations, we split trajectories into sessions according to week due to the data sparsity. In addition, we require that each session contains at least two records and a user contains at least five sessions to guarantee a training/testing split of 8/2, following [13]. All baselines have their own further data preparation strategies and the modelspecific dataset information is also shown in Table 1. It’s noteworthy that, LSTPM [23] requires at least three records in each session and Flashback [24] limits the minimum records of each user to 100. These practices filter out more sparse data and reduce the challenge of prediction. Moreover, GETNext requires category information as input, thus it cannot work on dataset Dallas.
5.2 Baselines and settings
Baselines. We compare CSLSL with the stateoftheart baselines:

FPMC [12] is a Markovbased model that uses factorization to learn individual transition matrices.

DeepMove [13] adopts an attention mechanism to learn longterm preference and a GRU module to capture shortterm preference.

Flashback [24] uses spatiotemporal distances as attention weights to search the historical hidden states for current prediction.

LSTPM [23] considers temporal similarity and distance factor to model longterm preferences and geographical relevance to model shortterm preferences.

GeoSAN [15] designs a geography encoder to implicitly capture spatial proximity and introduces a loss function based on importance sampling to better use the informative negative samples.

STAN [17] introduces a twolayer attention architecture with spatiotemporal relation matrices to explicitly capture the spatiotemporal correlations.

GETNext [51] utilizes a GCN to integrate collective movement patterns and a transformer encoder to capture transition regularities. Besides, it introduces location categories as inputs and prediction targets.
Settings. To convincingly compare these baselines with our CSLSL, we collected the opensource codes released by the authors and attempted to find the optimal hyperparameters in the experiments. It’s worth noting that most of the baselines only predict next location, without category and time of visitation. Thus, we match the predicted location ID to its category for comparison and exclude the performance comparison of time prediction. Besides, we split users’ trajectories by day and week for FPMC model, referred to as FPMCD and FPMCW, respectively. For CSLSL, the dimensions of representation vectors \(\boldsymbol{e}^{l},\boldsymbol{e}^{c},\boldsymbol{e}^{t^{h}}, \boldsymbol{e}^{t^{d}}\) and \(\boldsymbol{e}^{u}\) are set to 200, 100, 10, 20, and 20 for all datasets. The dimension of the hidden state in all GRUs is set to 600. We use the Adam optimizer with the learning rate of 0.0001, and \(\lambda _{t}\), \(\lambda _{c}\), and \(\lambda _{s}\) are set to 10.
Metrics. In the next location prediction task, what we care about is whether the actual location is in the top N of our predictions, \(N=\{1,5,10\}\). \(\mathrm{Recall}@N\) is the most commonly used metric and is equal to \(\mathrm{Accuracy}@N\) because we don’t have false positive (FP) and true negative (TN). The definition of \(\mathrm{Recall}@N\) is
where \(\mathcal{L}_{u}^{T}\) and \(\mathcal{L}_{u}^{P}\) are the target and top N prediction location sets, respectively.
5.3 Performance comparison with baselines
The experimental results are averaged over 10 independent runs and shown in Table 2. For each city, the results are presented in three pieces, representing the results of baselines (lines 1–8), variants (lines 9–14), and ablations (lines 15–18), respectively. The best performance in each column is highlighted in bold text and the second best one is underlined. For NYC and TKY, we present the predicted results for categories and locations, while for Dallas, we only show the location prediction results due to the lack of category information.
From the experiment results, we can observe that the proposed CSLSL shows promising performances compared with baselines. In terms of \(\mathrm{Recall}@1\) in location prediction, CSLSL achieves 27%, 37%, and 43% averaged performance improvements over these deep learning baselines in three datasets. For \(\mathrm{Recall}@1\) in category prediction, the improvements are 34% and 23% in NYC and TKY, respectively. Considering \(\mathrm{Recall}@\{5,10\}\) in location prediction, CSLSL achieves similar performances with LSTPM and Flashback, which filter more than \(46\%, 18\%\), and 57% of sparse users than we do on three datasets shown in Table 1. These similar performances in the more challenging dataset settings can also reflect the superiority of our model. Disregarding these two models, CSLSL still obtains over 20% averaged improvements than the rest of deep learning baselines. Moreover, CSLSL has improved by 8.9% and 11.1% in Recall@1 in NYC and TKY compared to GETNext, which has similar dataset statistics to ours. The poor performances of all models on the Dallas dataset may be due to the data sparseness. Even so, CSLSL can still utilize the timelocation relationship and the spatial constraints to achieve performance gains.
Among the baselines, we observe that the overall performance of location prediction on TKY is lower than that on NYC. This is probably because the TKY dataset has a larger number of users and locations than NYC, increasing the difficulty of mobility prediction. However, CSLSL obtains more performance improvement on TKY than NYC for location prediction compared with baselines. For instance, the performance of CSLSL improves by 8.9% on NYC compared with GETNext, while the improvement is 11.1% on TKY. On the other side, the category predictions for all models have higher accuracy on TKY than on NYC. We may conclude that the larger performance improvement on TKY than NYC mainly owes to the proper modeling of the dependencies.
5.4 Performance comparison with variants
To fairly demonstrate the effectiveness of the proposed causal structure, here we develop an ablated version of CSLSL via dropping the spatialconstraint loss, namely CLSL, and compare it with the 5 variants discussed in the Sect. 4.5. Moreover, we also consider the “what→when→where” relationship, thus we change the order of these three branches in CLSL from “time→category→location” to “category→time→location” and this variant is named CLSLctl. We present the results of the variants and CLSL in the second and third pieces of Table 2. The category prediction results of LSL are obtained in the same way as the baselines.
Compared with LSL, SBLSL has a similar performance of location prediction and slightly improved performance of category prediction, suggesting that the shared bottom of SBLSL has indeed learned the category transfer regularities. However, these learned regularities make no contribution to the location prediction. Besides, the performance of MELSL is weaker than LSL and SBLSL, which may be because MELSL does not clarify the relationship between tasks and its experts cannot find suitable optimization directions. The better performance of SLSL than SBLSL indicates that the separate modules to learn transition relationships are better than the shared one. HLSL achieves the best performance in the second pieces of Table 2, suggesting that there are dependencies between time, category, and location, and that capturing the dependencies facilitates location prediction.
The performances of these variants are weaker than CLSL, suggesting that although these variants utilize temporal and categorical information, they cannot effectively and autonomously capture the dependencies between time, category, and location. In contrast, the causal structure explicitly captures the dependencies between tasks in two ways, thereby fully exploiting their dependencies to improve performance. Moreover, the better performance of CLSL than CLSLctl is in line with expectations, because location has stronger dependencies with category than time and category information can bring more performance gains for location prediction. Therefore, our proposed causal structure explicitly models “time→category→location” rather than “category→time→location”.
5.5 Ablation study
We also conduct ablation studies to examine the contributions of different components in CSLSL. The ablated models include:

LSL: the version that only keeps the location branch.

CLSL: the version that removes the spatialconstraint loss.

CSLSLt: the version that removes the time branch.

CSLSLc: the version that removes the category branch.
As shown in Table 2, we can find that CSLSLt achieves better performance than CSLSLc, indicating the “category→location” relationship has stronger dependency constraints than “time→location”. This result is also consistent with what we discussed in the Sect. 5.4. The best performance of the complete CSLSL demonstrates the significance of the entire “time→category→location” decision logic. Comparing the performance of CLSL and CSLSL, we can confirm that the spatialconstraint loss function has a positive impact on performance improvement. Moreover, LSL achieves decent performance compared with baselines, probably because it leverages category information and the LSC module is capable of capturing the longterm and shortterm preferences.
5.6 Results visualization analysis
We conduct result visualization analysis to further understand the effectiveness of the causal structure and the spatialconstrained loss. For the causal structure, we compare the category and the location prediction results of GETNext and CSLSL, as shown in Fig. 5. The successfully predicted locations are divided into two parts in the figure based on whether the category prediction is accurate. We can observe that for CSLSL, the records with successfully predicted both categories and locations on NYC and TKY account for 21% and 20% of all records. Compared with GETNext, CSLSL successfully predicted 10% and 18% more locations with more accurately predicted categories on NYC and TKY, respectively. This intuitively indicates that the causal structure can enhance location prediction with more accurate category prediction results. Interestingly, the location can be predicted correctly with a unsuccessfully predicted category. This is because the category information is introduced as additional auxiliary information without imposing mandatory constraints on the location prediction.
To further explore the relationship between categories and location prediction, we examine the accuracy of location predictions for different categories, as depicted in Fig. 6. The category classification is derived from the Foursquare platform. The results exhibit varying levels of predictability for different categories. For instance, the Community and Government category shows higher accuracy, while Retail demonstrates lower accuracy. This disparity may be attributed to the complex relationship between categories and locations. A greater number of location options within the same category in proximity to the user’s location would result in higher prediction difficulty. Additionally, the periodicity of visits to different categories also affects the accuracy of predictions.
The quantity and frequency of individuals’ visited locations can reflect the predictability of their travel behavior. Therefore, we utilize entropy to describe the patterns of individual location visits. \(\mathrm{Entropy}(u)=\sum_{i}^{n}{p_{i}}\log{p_{i}}\), where \(p_{i}\) denotes the frequency of ith location and n is the total location number the user u visited. Figure 7 (a) depicts the correlation between category entropy and location prediction accuracy, while Fig. 7 (b) illustrates the relationship between location entropy and location prediction accuracy. The results indicate a negative correlation between entropy and accuracy. Users with higher entropy tend to visit more diverse locations, making their travel predictions more challenging.
Regarding the spatialconstrained loss, we examine whether the distances between predicted and actual locations are successfully constrained, and compare CSLSL with four baselines. As shown in Fig. 8, the results show that the predicted locations of CSLSL are closer to the actual locations, which indicates that the proposed loss can successfully constrain the distance errors. Furthermore, we inspect the constraining effect of the proposed loss on spatial consistency. Figure 9(a) shows the comparison of the predicted displacement with the ground truth. It can be seen that the predicted displacement of CSLSL is closer to the true distribution. This is because the constraint between the predicted and actual locations can indirectly ensure the consistency of the predicted displacements and the ground truth. Figure 9(b) shows the prediction error of regional attractiveness. We divided the geographic regions into square grids with side lengths of 500 m, and counted the difference between the predicted and actual visits in each grid. As presented in Fig. 9(b), CSLSL has a smaller prediction error of regional attractiveness, suggesting that the proposed loss successfully constrains the spatial consistency.
5.7 Sensitivity analysis
We perform sensitivity analysis on dataset NYC and TKY to examine how the performance of CSLSL is affected by \(\lambda _{*}\), \(*\in \{t,c,s\}\). We first vary \(\lambda _{t}\) and \(\lambda _{c}\) to analyze the effect of time and category prediction subtasks with a fixed \(\lambda _{s}=1\). Then we fix \(\lambda _{t}\) and \(\lambda _{c}\) and vary \(\lambda _{s}\) to observe the impact of spatialconstrained auxiliary tasks. \(\mathrm{Recall}@1\) is chosen as the evaluation metric and the results of location prediction are averaged of three runs, shown in Fig. 10.
From Fig. 10 (a), we can observe that the performance of location prediction is more sensitive to \(\lambda _{c}\) than \(\lambda _{t}\), reflecting that the accurate category prediction exerts more influence on the location prediction accuracy, which is also consistent with our proposed decision logic. In addition, the best performance on NYC is obtained with \(\lambda _{t}=5\) and \(\lambda _{c}=10\) when \(\lambda _{s}=1\), while that on TKY is obtained with \(\lambda _{t}=100\) and \(\lambda _{c}=50\). Figure 10 (b) shows that CSLSL reaches a more stable accuracy on NYC when \(\lambda _{s}=5.0\), while the average accuracy is higher when \(\lambda _{s}=10.0\). The results on TKY show that when \(\lambda _{t}\) and \(\lambda _{c}\) are set to smaller values, better performance can be achieved when \(\lambda _{s}\) is varied. In summary, CSLSL is robust to these parameters and does not suffer from large performance fluctuations with parameter changes.
6 Conclusion
In this work, we propose a Causal and Spatialconstrained Long and ShortTerm Learner (CSLSL) to incorporate the individual travel decision logic and the group consistency for next location prediction. In CSLSL, we introduce a causal structure based on multitask learning to explicitly capture the “when→what→where” decision logic and enhance location prediction by fully exploiting the temporal and categorized information. We further propose a simple but effective spatialconstrained loss function that acts as a selfsupervised auxiliary task to incorporate geographical information and indirectly ensure spatial consistency. We conducted extensive experiments to confirm the effectiveness of the design. Specifically, we compared our model with seven baseline models on three datasets, demonstrating the superior performance of the proposed model. Besides, we conducted variant experiments and ablation experiments to validate the effectiveness of the proposed causal structure and spatial constraint loss. Furthermore, we performed additional visualization analyses on the prediction outcomes of the model. These included exploring the relationship between categories and location predictions, analyzing the influence of individual behavioral diversity on predictability, and examining distance relationships and disparities in spatial distribution. Finally, we conducted sensitivity analysis experiments on hyperparameters to examine the robustness of our model. Although we evaluated our model on checkin data, the performance improvement was limited due to the sparse nature of the data. We expect to experiment on dense datasets with comprehensive travel behavior. Such datasets would exhibit more regular patterns in human behavior, enabling the model to more effectively utilize time and activity information to enhance location prediction.
Data availability
The Foursquare datasets NYC and TKY can be accessed through the following link: https://sites.google.com/site/yangdingqi/home/foursquaredataset?authuser=0; The Dallas dataset from Gowalla is available at: https://snap.stanford.edu/data/locgowalla.html. The code implementation of this work is accessible online at: https://github.com/urbanmobility/CSLSL.
References
Gonzalez MC, Hidalgo CA, Barabasi AL (2008) Understanding individual human mobility patterns. Nature 453(7196):779–782
Barbosa H, Barthelemy M, Ghoshal G, James CR, Lenormand M, Louail T, Menezes R, Ramasco JJ, Simini F, Tomasini M (2018) Human mobility: models and applications. Phys Rep 734:1–74
Xu F, Li Y, Jin D, Lu J, Song C (2021) Emergence of urban growth patterns from human mobility behavior. Nat Comput Sci 1(12):791–800
Çolak S, Lima A, González MC (2016) Understanding congested travel in urban areas. Nat Commun 7(1):1–8
Xu Y, Çolak S, Kara EC, Moura SJ, González MC (2018) Planning for electric vehicle needs by coupling charging profiles with urban mobility. Nat Energy 3(6):484–493
Xu Y, Jiang S, Li R, Zhang J, Zhao J, Abbar S, González MC (2019) Unraveling environmental justice in ambient pm2.5 exposure in Beijing: a big data approach. Comput Environ Urban Syst 75:12–21
Arenas A, Cota W, GómezGardeñes J, Gómez S, Granell C, Matamalas JT, SorianoPaños D, Steinegger B (2020) Modeling the spatiotemporal epidemic spreading of Covid19 and the impact of mobility and social distancing interventions. Phys Rev X 10(4):041055
Jia JS, Lu X, Yuan Y, Xu G, Jia J, Christakis NA (2020) Population flow drives spatiotemporal distribution of Covid19 in China. Nature 582(7812):389–394
Luca M, Lepri B, FriasMartinez E, Lutu A (2022) Modeling international mobility using roaming cell phone traces during Covid19 pandemic. EPJ Data Sci 11(1):22
Luca M, Barlacchi G, Lepri B, Pappalardo L (2020) Deep learning for human mobility: a survey on data and models. ArXiv preprint. arXiv:2012.02825
Song C, Qu Z, Blumm N, Barabási AL (2010) Limits of predictability in human mobility. Science 327(5968):1018–1021
Rendle S, Freudenthaler C, SchmidtThieme L (2010) Factorizing personalized Markov chains for nextbasket recommendation. In: Proceedings of the 19th international conference on world wide web, pp 811–820
Feng J, Li Y, Zhang C, Sun F, Meng F, Guo A, Jin D (2018) Deepmove: predicting human mobility with attentional recurrent networks. In: Proceedings of the 2018 world wide web conference, pp 1459–1468
Couto Teixeira D, Almeida JM, Viana AC (2021) On estimating the predictability of human mobility: the role of routine. EPJ Data Sci 10(1):49
Lian D, Wu Y, Ge Y, Xie X, Chen E (2020) Geographyaware sequential location recommendation. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2009–2019
Guo Q, Sun Z, Zhang J, Theng YL (2020) An attentional recurrent neural network for personalized next location recommendation. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 83–90
Luo Y, Liu Q, Liu Z (2021) Stan: spatiotemporal attention network for next location recommendation. In: Proceedings of the web conference 2021, pp 2177–2185
Cheng C, Yang H, Lyu MR, King I (2013) Where you like to go next: successive pointofinterest recommendation. In: Twentythird international joint conference on artificial intelligence
He J, Li X, Liao L, Song D, Cheung W (2016) Inferring a personalized next pointofinterest recommendation model with latent behavior patterns. In: Proceedings of the AAAI conference on artificial intelligence, vol 30
Feng S, Li X, Zeng Y, Cong G, Chee YM, Yuan Q (2015) Personalized ranking metric embedding for next new poi recommendation. In: Twentyfourth international joint conference on artificial intelligence
Manotumruksa J, Macdonald C, Ounis I (2018) A contextual attention recurrent architecture for contextaware venue recommendation. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp 555–564
Zhao P, Zhu H, Liu Y, Xu J, Li Z, Zhuang F, Sheng VS, Zhou X (2019) Where to go next: a spatiotemporal gated network for next poi recommendation. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 5877–5884
Sun K, Qian T, Chen T, Liang Y, Nguyen QVH, Yin H (2020) Where to go next: modeling longand shortterm user preferences for pointofinterest recommendation. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 214–221
Yang D, Fankhauser B, Rosso P, CudreMauroux P (2020) Location prediction over sparse user mobility traces using rnns: flashback in hidden states! In: Proceedings of the twentyninth international joint conference on artificial intelligence, pp 2184–2190
He J, Li X, Liao L (2017) Categoryaware next pointofinterest recommendation via listwise Bayesian personalized ranking. In: IJCAI, vol 17, pp 1837–1843
Yu F, Cui L, Guo W, Lu X, Li Q, Lu H (2020) A categoryaware deep model for successive poi recommendation on sparse checkin data. In: Proceedings of the web conference 2020, pp 1264–1274
Zhao S, Zhao T, Yang H, Lyu MR, King I (2016) Stellar: spatialtemporal latent ranking for successive pointofinterest recommendation. In: Thirtieth AAAI conference on artificial intelligence
Liu Q, Wu S, Wang L, Tan T (2016) Predicting the next location: a recurrent model with spatial and temporal contexts. In: Thirtieth AAAI conference on artificial intelligence
Kong D, Wu F (2018) Hstlstm: a hierarchical spatialtemporal longshort term memory network for location prediction. In: IJCAI, vol 18, pp 2341–2347
Zhao K, Zhang Y, Yin H, Wang J, Zheng K, Zhou X, Xing C (2020) Discovering subsequence patterns for next poi recommendation. In: IJCAI, pp 3216–3222
Wang H, Yu Q, Liu Y, Jin D, Li Y (2021) Spatiotemporal urban knowledge graph enabled mobility prediction. Proc ACM Interact Mob Wearable Ubiquitous Technol 5(4):1–24
Du N, Dai H, Trivedi R, Upadhyay U, GomezRodriguez M, Song L (2016) Recurrent marked temporal point processes: embedding event history to vector. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1555–1564
Krishna K, Jain D, Mehta SV, Choudhary S (2018) An lstm based system for prediction of human activities with durations. In: Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies 1(4), pp 1–31
Chen Y, Long C, Cong G, Li C (2020) Contextaware deep model for joint mobility and time prediction. In: Proceedings of the 13th international conference on web search and data mining, pp 106–114
Sun J, Kim J (2021) Joint prediction of next location and travel time from urban vehicle trajectories using long shortterm memory neural networks. Transp Res, Part C, Emerg Technol 128:103114
Jiang S, Yang Y, Gupta S, Veneziano D, Athavale S, González MC (2016) The timegeo modeling framework for urban mobility without travel surveys. Proc Natl Acad Sci 113(37):5370–5378
Pappalardo L, Simini F (2018) Datadriven generation of spatiotemporal routines in human mobility. Data Min Knowl Discov 32(3):787–829
Simini F, Barlacchi G, Luca M, Pappalardo L (2021) A deep gravity model for mobility flows generation. Nat Commun 12(1):1–13
Zhang W, Shen Q, Teso S, Lepri B, Passerini A, Bison I, Giunchiglia F (2021) Putting human behavior predictability in context. EPJ Data Sci 10(1):42
Pacheco D, Oliveira M, Chen Z, Barbosa H, FoucaultWelles B, Ghoshal G, Menezes R (2022) Predictability states in human mobility. ArXiv preprint. arXiv:2201.01376
Karniadakis GE, Kevrekidis IG, Lu L, Perdikaris P, Wang S, Yang L (2021) Physicsinformed machine learning. Nat Rev Phys 3(6):422–440
Willard J, Jia X, Xu S, Steinbach M, Kumar V (2020) Integrating physicsbased modeling with machine learning: a survey, vol 1 pp 1–34. ArXiv preprint. arXiv:2003.04919
Xue H, Salim F, Ren Y, Oliver N (2021) Mobtcast: leveraging auxiliary trajectory forecasting for human mobility prediction. Adv Neural Inf Process Syst 34:30380–30391
Williams RJ (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Reinf. Learn: 5–32
Ruder S (2017) An overview of multitask learning in deep neural networks. ArXiv preprint. arXiv:1706.05098
Qin Z, Cheng Y, Zhao Z, Chen Z, Metzler D, Qin J (2020) Multitask mixture of sequential experts for user activity streams. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 3083–3091
Sanh V, Wolf T, Ruder S (2019) A hierarchical multitask approach for learning embeddings from semantic tasks. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 6949–6956
Ma J, Zhao Z, Yi X, Chen J, Hong L, Chi EH (2018) Modeling task relationships in multitask learning with multigate mixtureofexperts. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1930–1939
Yang D, Zhang D, Zheng VW, Yu Z (2014) Modeling user activity preference by leveraging user spatial temporal characteristics in lbsns. IEEE Trans Syst Man Cybern Syst 45(1):129–142
Cho E, Myers SA, Leskovec J (2011) Friendship and mobility: user movement in locationbased social networks. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1082–1090
Yang S, Liu J, Zhao K (2022) Getnext: trajectory flow map enhanced transformer for next poi recommendation. In: SIGIR
Acknowledgements
The authors thank Wenqing Chen for inspiring a part of the model design.
Funding
This work was jointly supported by the National Natural Science Foundation of China (62102258), Shanghai Pujiang Program (21PJ1407300), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), and the Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Contributions
ZH and YX conceived the research and designed the analyses. ZH processed data, conducted experiments, analyzed results, and wrote the paper. SX assisted with the baseline experiments. SX, MW, HW, YX, and YJ provided advice for paper writing. YX supervised the research. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Huang, Z., Xu, S., Wang, M. et al. Human mobility prediction with causal and spatialconstrained multitask network. EPJ Data Sci. 13, 22 (2024). https://doi.org/10.1140/epjds/s13688024004607
Received:
Accepted:
Published:
DOI: https://doi.org/10.1140/epjds/s13688024004607