Human Mobility Prediction with Causal and Spatial-constrained Multi-task Network

Modeling human mobility helps to understand how people are accessing resources and physically contacting with each other in cities, and thus contributes to various applications such as urban planning, epidemic control, and location-based advertisement. Next location prediction is one decisive task in individual human mobility modeling and is usually viewed as sequence modeling, solved with Markov or RNN-based methods. However, the existing models paid little attention to the logic of individual travel decisions and the reproducibility of the collective behavior of population. To this end, we propose a Causal and Spatial-constrained Long and Short-term Learner (CSLSL) for next location prediction. CSLSL utilizes a causal structure based on multi-task learning to explicitly model the"\textit{when$\rightarrow$what$\rightarrow$where}", a.k.a."\textit{time$\rightarrow$activity$\rightarrow$location}"decision logic. We next propose a spatial-constrained loss function as an auxiliary task, to ensure the consistency between the predicted and actual spatial distribution of travelers' destinations. Moreover, CSLSL adopts modules named Long and Short-term Capturer (LSC) to learn the transition regularities across different time spans. Extensive experiments on three real-world datasets show promising performance improvements of CSLSL over baselines and confirm the effectiveness of introducing the causality and consistency constraints. The implementation is available at https://github.com/urbanmobility/CSLSL.


Introduction
Human mobility modeling aims to explore the regularities and patterns of human behavior [1,2] and plays a significant role in numerous applications, such as urban planning [3], travel demand management [4,5], health risk assessment [6], epidemic spreading modeling and control [7][8][9], and so on.In the big data era, the accessibility to GPS, mobile phone records, and location-based social networks (LBSNs) provides an unprecedented chance to understand and model human mobility [2,10].
In the research community of human mobility, physicists focus on statistical analysis from a macroscopic perspective and have summarized empirical rules [2].For example, they found that, truncating the power law distribution can well fit the displacement distribution [1]; despite the significant differences in the travel patterns, a majority of users' mobility behaviors are predictable [11].Computer scientists, on the other hand, prefer to model the transition regularities from location sequences, using Markov models [12], recurrent neural networks (RNNs) [13], etc.In summary, statistical physics study the collective behavior at population level, while deep learning methods emphasize modeling individual travel trajectories.Thus, we can expect that integrating physical domain knowledge into a deep learning model encourages the model to pay attention to group behaviors and promotes the performance of deep learning models at population level.
Here we place our emphasis on next location prediction, a vital task in human mobility modeling at individual level [14].A body of work leverages machine learning methods to tackle this problem due to the sequential nature of mobility behavior.A common thread of these studies is to efficiently capture behavior patterns from sparse data [10,[15][16][17].Traditional methods mainly adopt Markov chains to model transition probability matrices across locations, along with techniques like factorization [12,18,19] or metric embedding [20].In recent years, deep learning methods are gaining increasing attention in next location prediction as the recurrent neural network (RNN) presents its capability to capture sequential dependency.To model multi-scale spatio-temporal periodicity, researchers designed attention or gate mechanisms and introduced time and distance interval information [13,[21][22][23][24]. Also a few studies incorporate semantic information such as location categories to cope with the data sparsity [16,25,26].However, methods that capture dependencies only from location sequences are difficult to fully fit complex human travel behaviors, especially with sparse data.
To tackle this challenge, we seek to integrate physical knowledge into deep learning methods to enhance the capability of human mobility prediction.Specifically, we propose two physical constraints.The first one is summarized as "when→what→where" causal relationship."When", "what", and "where" are the three core elements of human travel behavior and the dependencies between them can explain the motivation of location transfer.For example, as shown in Fig. 1(a), people have specific demands at different times, causing the shifts between locations.Considering causal dependencies enables more comprehensive modeling of human mobility.The second constraint is the macro-statistical characteristics reflecting group behavior.Figure 1(b) illustrates the deviation of the modeled displacement distribution via LSTM from the true distribution in New York City and Tokyo, suggesting LSTM is more likely to focus on shorter trips with higher frequency.Ensuring the consistency between the model output and the macro-statistical characteristics is expected to improve the model's capability to fit travel behavior.We summarize these two constraints as causality and consistency constraints and incorporate them into deep learning models.
To this end, we propose a Causal and Spatial-constrained Long and Short-term Learner (CSLSL), a model integrating the decision logic and the consistency constraints of human mobility modeling.To model the "when→what→where" decision logic, we introduce a causal structure in CSLSL.Based on a multi-task learning, the causal structure utilizes three similar network branches to model the regularities of time, activity, and location, respectively.In line with the "when→what→where" logic, we explicitly build connections between the three branches in the causal structure.As for consistency, we exploratively propose a spatial-constrained loss to reduce the distance between the predicted and actual locations, and indirectly ensure the consistency of the spatial density distribution.In addition, we adopt a Long and Short-term Capturer (LSC) to learn the transition patterns of The main contributions of this work are summarized as follows: • We propose CSLSL to integrate the travel decision logic and the macro-statistical consistency for human mobility modeling.To our best knowledge, CSLSL is the first model to learn the causality and consistency constraints for next location prediction.• We introduce a causal structure that can capture not only the separate regularities of time, activity, and location, but also the "when→what→where" causal dependencies.
In this way, CSLSL models more essential travel logic in addition to sequence relationships.• To ensure the consistency in spatial distribution, we propose a spatial-constrained loss to reduce the gap between the predicted and actual destinations.• We evaluate CSLSL on three real-world datasets to confirm the performance improvements.We also conduct ablation studies and visualization analyses of results such as displacement distribution to demonstrate the effectiveness of our design.

Related work 2.1 Next location prediction
Here we classify the approaches to the next location prediction problem into two categories: traditional and deep learning methods.Traditional methods mainly apply Markov chain (MC) and focus on constructing a better location transition probability matrix [12,18,20,27].For instance, factorized personalized Markov chain (FPMC) combines the matrix factorization technique with Markov chains to learn users' personalized transition matrices [12].The limitation of the MC-based methods lies in the difficulty in capturing long-term and high-order regularity [16,17].Deep learning methods have advantages of learning dense representation and complex dependency.Recently, RNN-based methods show promising performance in min-ing sequential information.A popular scheme of deep learning methods is incorporating time and distance intervals to assist the model in learning the spatio-temporal regularities of human mobility.Specifically, these methods integrate spatio-temporal information into hidden state transition [28], gate mechanisms [21,22,29], or self-attention mechanisms [15,17], and exploit spatio-temporal contexts in a implicit manner.To leverage the spatio-temporal contexts, researchers explicitly used spatial and temporal factors as attention weights to select the historical hidden states [24,30].Another scheme emphasizes the long-term patterns of human behavior, such as DeepMove [13] and LSTPM [23].They introduce two different components to model long-term and short-term preferences respectively.There is also another scheme that utilizes semantic information such as location categories to improve the performance of location prediction [16,26,31].However, methods that focus on modeling location transfer patterns in sequences cannot effectively capture complex human decision logic.In our work, we propose a causal structure to explicitly capture the "when→what→where" decision logic.

Time-or activity-jointed location prediction
The methods that jointly predict time or activity learn knowledge from related tasks to improve the prediction performance of location.RMTPP [32] combines RNN and temporal point process (TPP) to jointly model the time and location information.He et al. [25] proposed a two-fold approach that predicts category with Bayesian Personalized Ranking (BPR) technique and then predicts the category-based location.Krishna et al. [33] utilized two distinct LSTM networks to predict activities and durations.DeepJMT [34] fuses spatio-temporal information and social context to predict time and location with a hierarchical RNN and TPP technique.Sun et al. [35] proposed a hybrid LSTM and a sequential LSTM with a self-attention mechanism to jointly model location and travel time.The limitation of these approaches is that they attempt to implicitly and passively learn the correlation between time, category, and location information, but this relationship is explicit and can be directly exploited.In contrast, CSLSL explicitly models the causal dependencies between time, category, and location information through two structural designs.

Statistical physics-informed human mobility modeling
Explicitly integrating knowledge of statistical physics contributes to guiding model optimization and improving the performance of machine learning methods.On the task of trajectory generation, researchers introduced knowledge of statistical physics to constrain the macroscopic performance of their models, such as the individual trajectory generation model TimeGeo [36] and DITRAS [37], and flow generation model DeepGravity [38].Unlike the trajectory generation task, only a limited amount of work on individual mobility prediction incorporates knowledge of statistical physics.Zhao et al. [30] integrated domain knowledge, specifically a power-law decay for distances and an exponential decay for time intervals, into an attention mechanism to adjust the impact of historical information on current prediction.However, deep learning-based methods focus more on fitting individual behavior while neglecting group behavior constraints described by macrostatistical characteristics, for example, the regional attractiveness of city blocks.The spatial distribution of model prediction results should be consistent with the actual statistical distribution.Predicted locations closer to the actual location are more expected.Toward this, we propose a spatial-constrained loss function to narrow the distance between the predicted and actual locations, thereby ensuring the consistency of the spatial distribution.

Problem formulation
A person's travel behavior can be represented as a sequence of locations, associated with the timestamps and her user ID.In LBSN datasets, each location is also associated with its functional category to support the analysis of user's activity.Let U = {u 1 , . . ., u |U | }, L = {l 1 , . . . ,l |L| } and C = {c 1 , . . ., c |C| } denote a set of users, locations and functional categories, respectively.Each location l i is associated with its category and geographical coordinate (c i , lat i , lon i ).
Definition 1 (Record) Record r is a 3-tuple (u i , l j , t k ), representing that the user u i visited location l j at time t k , where u i ∈ U, l j ∈ L.
Definition 2 (Individual Trajectory) A person's trajectory is defined as a record sequence R = {r 1 , r 2 , . . ., r |R| }, which consists of the person's all records arranged in chronological order.Note that the time interval between two consecutive records is heterogeneous due to the irregular travel behavior.Definition 3 (Session) Session S is a subsequence of records in a time slot.One user's trajectory R can be split into a series of sessions with various strategies.For example, DeepMove adopts a specific time interval between two consecutive records to split the trajectory [13].Other strategies segment users' trajectories using a fixed number of records [17,24] or a meaningful time window such as days, weeks, etc [23].We define the session where the prediction target is located as the short-term session S p and the previous historical sessions as long-term sessions {S q }, q ∈ {1, . . ., p -1}.
The location prediction problem is formulated as: given a record sequence of a user R t-1 = {r 1 , . . ., r t-1 }, the goal is to predict where the user u is most likely to go in her next trip.We use lt to denote the predicted next location.Note that the timestamp of the next trip t is also unknown.

Methodology
In this section, we first analyze the causality and consistency constraints in human mobility modeling, and then elaborately introduce the design of the proposed model, Casual and Spatial-constrained Long and Short-term Learner (CSLSL).The architecture of CSLSL is illustrated in Fig. 2. It mainly consists of two parts, an embedding part for learning the representations of arrival time, category, and location, from users' recent and historical records; and the second part for learning the regularities of mobility behavior in a multitask learning based causal module and making predictions.

Causality and consistency constraints
A common practice for next location prediction is to discover similar subsequences or location transition relationships from historical records.This is accomplished by integrating the context information such as distance or time intervals into attention or RNNcentered framework [17,[22][23][24].We can formulate such mainstream scheme as P( lt |R t-1 ), where R t-1 is the historical record sequence.Another scheme adopts multi-task learning techniques to jointly predict next location with time or activity [34,35], formulated as P( lt , ĉt , t|R t-1 ) = P( lt |R t-1 )P(ĉ t |R t-1 )P( t|R t-1 ), where we assume that the location category can approximate the type of activity.Although these two schemes combine con-

Figure 2
The architecture of the proposed CSLSL model.It considers both long-term and short-term travel preferences and applies three branches with well-designed interconnection to explicitly models the "when→what→where" decision logic textual information to capture hidden regularities of location transition, they ignore the causal dependencies in the context information.
As aforementioned, we regard "when", "what" and "where" as three crucial elements to describe human mobility [39,40]."When" refers to the time the trip takes place, e.g."midday"."What" tells about the activities people participate in and also answers the reasons for the trip, such as "having lunch"."Where" is the destination of the trip, like "steakhouse".Periodic activities exist in human mobility and occur at specific times and places, such as going to work in the morning and going to a restaurant for lunch, which reveals the correlation between the three elements.When we mention a specific timestamp, we have various activity choices.But we are accustomed to doing certain activities at certain times, such as going to the gym in the evening.Similarly, one activity (category) corresponds to multiple locations (POIs), while one location ID only corresponds to one activity, also reflected in the dataset.Moreover, our target is location prediction, thus location should be the final subtask to leverage the predicted time and activity information.Therefore, we summarize a "when→what→where", a.k.a."time→activity→location" causal relationship, which is in line with the coarse-to-fine logic of the human decision.The proposed scheme can be formulated as: The scheme explicitly models the dependencies between time, activity, and location, and alleviates the difficulty of location prediction.For example, people are accustomed to going to restaurants at midday instead of bar, that is, P(ĉ t = restaurant| t = midday, R t-1 ) > P(ĉ t = bar| t = midday, R t-1 ).Each individual has her personalized P( lt |ĉ t = restaurant, t = midday, R t-1 ), and the casual constrained location distribution is easier to learn than P( lt |R t-1 ).In CSLSL, we introduce a causal structure to implement the scheme.In experiments, we also demonstrate that "time→acitivity→location" outperforms "activity→ time→location" and "time, activity, location", which does not involve logical connections.On the other side, integrating physical knowledge provides more information and prior constraints to guide the optimization of deep learning models [41,42].In human mobility modeling, one can expect that properly introducing the physical laws and domain knowledge would narrow down the gap between the output of deep learning-based approaches and the observed macro-statistical characteristics of human behavior.Due to the difficulty in applying statistical constraints in the training of deep learning models, here we consider the geographic spatial consistency in an indirect way.Specifically, we devise a loss function to constrain the distance between the predicted and actual locations.That is, the closer the predicted location is to the ground truth, the smaller loss we have.By this way, we can indirectly ensure the consistency of the displacement distribution and the consistency of the spatial distribution of travelers' destinations.

Long and short-term capturer
Human travel behavior has long and short-cycle repetitive patterns, such as going to work every day and going to the supermarket once a week.Inspired by DeepMove [13] and LSTPM [23], we devise a Long and Short-term Capturer (LSC) to learn the behavioral patterns in different observation cycles.In the whole framework shown in Fig. 2, we apply three LSCs to model the time, activity, and location sequences, respectively.
Let e l ∈ R d l , e c ∈ R d c , e t ∈ R d t and e u ∈ R d u denote the embedded representation of location, category, time and user, respectively.Given a historical record sequence R, CSLSL embeds each record as (e l , e c , e t , e u ) in hidden spaces.Note that we first convert the continuous timestamp as the hour in a day t h and the day in a week t d to present the daily and weekly periodicity.By doing so, we have e t = e t h ⊕ e t d .Then these representations in a record are concatenated together, e r = e l ⊕ e c ⊕ e t ⊕ e u .We next split each user's record sequence into multiple sessions with a certain time window, like days or weeks.The records in short-term session and long-term sessions are represented as e r p = {e r 1 , . . ., e r t-1 } and { e r q } = { e r 1 , . . ., e r p-1 }, respectively.Our proposed LSC consists of two capturers to learn the transition regularities in the short-term and long-term sessions, respectively, as shown in Fig. 3.We formulate LSC as: where h 0 is the initial hidden state.In the LSC structure, the short-term capturer takes the hidden state h |S p-1 | of the long-term capturer as the initial hidden state to combine the historical information.Because GRU is simple but efficient in modeling temporal data, we apply a layer of GRU in both of the capturers: where i ∈ {1, 2, . . ., |S p-1 |} for long-term capturer and i ∈ {1, 2, . . ., t} for short-term capturer.

Causal structure
To model the "time→activity→location" logic relationship discussed in Sect.4.1, we introduce a causal structure based on multi-task learning techniques.As illustrated in Fig. 2, we utilize three branches with the same architecture to model the change patterns of time, activity, and location, respectively.Specifically, in each branch, we convey the same record representations to the LSC module and then transfer the output hidden states to the predictor.To explicitly model the summarized causal relation in human travel behavior, we next design two paths for information transfer between various tasks.The first path lies between two LSC modules, passing on the task-specific hidden states.The second path lies between two predictors.In this path, the predicted result of the upstream task is processed by the converter and then conveyed to the downstream task.Here we use the fully connected layer as the predictor (P) and converter(C).That is y = Linear(x) = Wx + b.Mathematically, the branch of "time" is formulated as: where is the hidden state of the next time, and t is the predicted time.As the downstream task of "time" in causal structure, the branch of "activity" can be formulated as: where , h c t is the hidden state of the next activity, and ĉt is the predicted activity.Eventually, we can formulate the branch of "location" as: where , h l t is the hidden state of the next location, l t is the distribution of predicted location, and lt is the predicted location.

Spatial-constrained loss function
As discussed in Sect.4.1, for seeking an alignment of spatial distribution of destinations, we propose a spatial-constrained loss function to shorten the distance from the predicted location to the ground truth at individual level.The distance constraint can be regarded as a self-supervised auxiliary task, integrating the geographical information and restricting the candidate set for better next location prediction.The existing methods introduce the distance constraints in a regression subtask, directly predicting the geographical locations of destinations [43].However, in the classification scheme, we must query the coordinates of location IDs to calculate their distance.This operation is not derivable.We get inspiration from REINFORCE [44], which introduces the reward in the loss function to train a policy network, and also consider the distance error as a coefficient to weight the crossentropy between ground truth and the predicted location ID with their physical distance.The spatial-constrained loss function is defined as: where N is the total number of records and σ is the softmax function.
We next employ MAE loss for time prediction and cross entropy loss for category and location prediction.Thus we have )), * ∈ {c, l}.Thus, the total loss function can be written as where λ t , λ c , λ s are weights for their loss functions.

Structure comparison
There are various strategies for task combination in multi-task learning, such as sharebottom structure [45,46], hierarchical structure [34,47], and multi-expert structure [46,48].Inspired by these structures, we propose five variants, as shown in Fig. 4, to demonstrate the advantages of our causal structure.Note that these variants use the same basic components as CSLSL, such as GRUs and fully connected layers.
Long and Short-term Learner (LSL) is a basic approach with only one branch to predict location.To jointly predict "time", "activity", and "location", Share-Bottom LSL (SBLSL) introduces two additional predictors that share the same bottom LSC module with the original one.Multi-Experts LSL (MELSL) is an advanced version of SBLSL, with a similar structure to Mixture of Sequential Expert (MoSE) [46].MELSL employs several GRUs as experts to focus on different aspects of sequence dependencies and gate networks to combine relevant aspects for each task.
Unlike the share-bottom structure, Separate LSL (SLSL) employs a separate branch for each task and the only shared information between each task is the same record representations.Considering the dependencies between tasks, Hierarchical LSL (HLSL) concatenates the record embedding and the output hidden state of the upstream task as the input of the downstream task.Thus the equation (3) changes to: where h k i is the hidden state of the k-th task at i-th time step, and the equation ( 2) changes to: where [ e r p , h

Data description
We leverage three publicly available check-in datasets in the experiments: two datasets from Foursquare [49] in New York (NYC) and Tokyo (TKY) and one dataset from Gowalla [50]  To prepare data for baselines and proposed models, we first filter out both users and locations with fewer than 10 records, in line with previous work [13,23].We then merge the consecutive records with the same user and location on the same day.The statistical information of the raw and processed data is depicted in Table 1.After pre-processing, the number of categories for NYC and TKY are reduced to 308 and 286.For CSLSL and its variations, we split trajectories into sessions according to week due to the data sparsity.In addition, we require that each session contains at least two records and a user contains at least five sessions to guarantee a training/testing split of 8/2, following [13].All baselines have their own further data preparation strategies and the model-specific dataset information is also shown in Table 1.It's noteworthy that, LSTPM [23] requires at least three records in each session and Flashback [24] limits the minimum records of each user to 100.These practices filter out more sparse data and reduce the challenge of predic-tion.Moreover, GETNext requires category information as input, thus it cannot work on dataset Dallas.

Baselines and settings
Baselines.We compare CSLSL with the state-of-the-art baselines: • FPMC [12] is a Markov-based model that uses factorization to learn individual transition matrices.• DeepMove [13] adopts an attention mechanism to learn long-term preference and a GRU module to capture short-term preference.• Flashback [24] uses spatio-temporal distances as attention weights to search the historical hidden states for current prediction.• LSTPM [23] considers temporal similarity and distance factor to model long-term preferences and geographical relevance to model short-term preferences.• GeoSAN [15] designs a geography encoder to implicitly capture spatial proximity and introduces a loss function based on importance sampling to better use the informative negative samples.• STAN [17] introduces a two-layer attention architecture with spatio-temporal relation matrices to explicitly capture the spatio-temporal correlations.• GETNext [51] utilizes a GCN to integrate collective movement patterns and a transformer encoder to capture transition regularities.Besides, it introduces location categories as inputs and prediction targets.Settings.To convincingly compare these baselines with our CSLSL, we collected the open-source codes released by the authors and attempted to find the optimal hyperparameters in the experiments.It's worth noting that most of the baselines only predict next location, without category and time of visitation.Thus, we match the predicted location ID to its category for comparison and exclude the performance comparison of time prediction.Besides, we split users' trajectories by day and week for FPMC model, referred to as FPMC-D and FPMC-W, respectively.For CSLSL, the dimensions of representation vectors e l , e c , e t h , e t d and e u are set to 200, 100, 10, 20, and 20 for all datasets.The dimension of the hidden state in all GRUs is set to 600.We use the Adam optimizer with the learning rate of 0.0001, and λ t , λ c , and λ s are set to 10.
Metrics.In the next location prediction task, what we care about is whether the actual location is in the top N of our predictions, N = {1, 5, 10}.Recall@N is the most commonly used metric and is equal to Accuracy@N because we don't have false positive (FP) and true negative (TN).The definition of Recall@N is where L T u and L P u are the target and top N prediction location sets, respectively.

Performance comparison with baselines
The experimental results are averaged over 10 independent runs and shown in Table 2.
For each city, the results are presented in three pieces, representing the results of baselines (lines 1-8), variants (lines 9-14), and ablations (lines [15][16][17][18], respectively.The best performance in each column is highlighted in bold text and the second best one is underlined.For NYC and TKY, we present the predicted results for categories and locations, R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 while for Dallas, we only show the location prediction results due to the lack of category information.
From the experiment results, we can observe that the proposed CSLSL shows promising performances compared with baselines.In terms of Recall@1 in location prediction, CSLSL achieves 27%, 37%, and 43% averaged performance improvements over these deep learning baselines in three datasets.For Recall@1 in category prediction, the improvements are 34% and 23% in NYC and TKY, respectively.Considering Recall@{5, 10} in location prediction, CSLSL achieves similar performances with LSTPM and Flashback, which filter more than 46%, 18%, and 57% of sparse users than we do on three datasets shown in Table 1.These similar performances in the more challenging dataset settings can also reflect the superiority of our model.Disregarding these two models, CSLSL still obtains over 20% averaged improvements than the rest of deep learning baselines.Moreover, CSLSL has improved by 8.9% and 11.1% in Recall@1 in NYC and TKY compared to GETNext, which has similar dataset statistics to ours.The poor performances of all models on the Dallas dataset may be due to the data sparseness.Even so, CSLSL can still utilize the time-location relationship and the spatial constraints to achieve performance gains.
Among the baselines, we observe that the overall performance of location prediction on TKY is lower than that on NYC.This is probably because the TKY dataset has a larger number of users and locations than NYC, increasing the difficulty of mobility prediction.However, CSLSL obtains more performance improvement on TKY than NYC for location prediction compared with baselines.For instance, the performance of CSLSL improves by 8.9% on NYC compared with GETNext, while the improvement is 11.1% on TKY.On the other side, the category predictions for all models have higher accuracy on TKY than on NYC.We may conclude that the larger performance improvement on TKY than NYC mainly owes to the proper modeling of the dependencies.

Performance comparison with variants
To fairly demonstrate the effectiveness of the proposed causal structure, here we develop an ablated version of CSLSL via dropping the spatial-constraint loss, namely CLSL, and compare it with the 5 variants discussed in the Sect.4.5.Moreover, we also consider the "what→when→where" relationship, thus we change the order of these three branches in CLSL from "time→category→location" to "category→time→location" and this variant is named CLSL-ctl.We present the results of the variants and CLSL in the second and third pieces of Table 2.The category prediction results of LSL are obtained in the same way as the baselines.
Compared with LSL, SBLSL has a similar performance of location prediction and slightly improved performance of category prediction, suggesting that the shared bottom of SBLSL has indeed learned the category transfer regularities.However, these learned regularities make no contribution to the location prediction.Besides, the performance of MELSL is weaker than LSL and SBLSL, which may be because MELSL does not clarify the relationship between tasks and its experts cannot find suitable optimization directions.The better performance of SLSL than SBLSL indicates that the separate modules to learn transition relationships are better than the shared one.HLSL achieves the best performance in the second pieces of Table 2, suggesting that there are dependencies between time, category, and location, and that capturing the dependencies facilitates location prediction.
The performances of these variants are weaker than CLSL, suggesting that although these variants utilize temporal and categorical information, they cannot effectively and autonomously capture the dependencies between time, category, and location.In contrast, the causal structure explicitly captures the dependencies between tasks in two ways, thereby fully exploiting their dependencies to improve performance.Moreover, the better performance of CLSL than CLSL-ctl is in line with expectations, because location has stronger dependencies with category than time and category information can bring more performance gains for location prediction.Therefore, our proposed causal structure explicitly models "time→category→location" rather than "category→time→location".

Ablation study
We also conduct ablation studies to examine the contributions of different components in CSLSL.The ablated models include: • LSL: the version that only keeps the location branch.
• CLSL: the version that removes the spatial-constraint loss.
• CSLSL-t: the version that removes the time branch.
• CSLSL-c: the version that removes the category branch.
As shown in Table 2, we can find that CSLSL-t achieves better performance than CSLSLc, indicating the "category→location" relationship has stronger dependency constraints than "time→location".This result is also consistent with what we discussed in the Sect.5.4.The best performance of the complete CSLSL demonstrates the significance of the entire "time→category→location" decision logic.Comparing the performance of CLSL and CSLSL, we can confirm that the spatial-constraint loss function has a positive impact on performance improvement.Moreover, LSL achieves decent performance compared with baselines, probably because it leverages category information and the LSC module is capable of capturing the long-term and short-term preferences.
Figure 5 Effect analysis of the causal structure.The successfully predicted locations are divided into two parts based on whether the category prediction is accurate.The NYC dataset exhibits a higher accuracy in predicting location as compared to the TKY dataset, while the prediction of categories for TKY is easier.CSLSL outperforms GETNext on both datasets, and a higher degree of performance enhancement is observed on the TKY dataset (18%) in comparison to the NYC dataset (10%)

Results visualization analysis
We conduct result visualization analysis to further understand the effectiveness of the causal structure and the spatial-constrained loss.For the causal structure, we compare the category and the location prediction results of GETNext and CSLSL, as shown in Fig. 5.The successfully predicted locations are divided into two parts in the figure based on whether the category prediction is accurate.We can observe that for CSLSL, the records with successfully predicted both categories and locations on NYC and TKY account for 21% and 20% of all records.Compared with GETNext, CSLSL successfully predicted 10% and 18% more locations with more accurately predicted categories on NYC and TKY, respectively.This intuitively indicates that the causal structure can enhance location prediction with more accurate category prediction results.Interestingly, the location can be predicted correctly with a unsuccessfully predicted category.This is because the category information is introduced as additional auxiliary information without imposing mandatory constraints on the location prediction.
To further explore the relationship between categories and location prediction, we examine the accuracy of location predictions for different categories, as depicted in Fig. 6.The category classification is derived from the Foursquare platform.The results exhibit varying levels of predictability for different categories.For instance, the Community and Government category shows higher accuracy, while Retail demonstrates lower accuracy.This disparity may be attributed to the complex relationship between categories and locations.A greater number of location options within the same category in proximity to the user's location would result in higher prediction difficulty.Additionally, the periodicity of visits to different categories also affects the accuracy of predictions.
The quantity and frequency of individuals' visited locations can reflect the predictability of their travel behavior.Therefore, we utilize entropy to describe the patterns of individual location visits.Entropy(u) = -n i p i log p i , where p i denotes the frequency of i-th location and n is the total location number the user u visited.Figure 7 (a) depicts the correlation between category entropy and location prediction accuracy, while Fig. 7 (b) illustrates the relationship between location entropy and location prediction accuracy.The results  The blue points represent the average accuracy.The gray points reflect the accuracy distribution among users, and their transparency is normalized based on the maximum number of users in accuracy segments.The results reveal a negative correlation between entropy and accuracy indicate a negative correlation between entropy and accuracy.Users with higher entropy tend to visit more diverse locations, making their travel predictions more challenging.
Regarding the spatial-constrained loss, we examine whether the distances between predicted and actual locations are successfully constrained, and compare CSLSL with four baselines.As shown in Fig. 8, the results show that the predicted locations of CSLSL are closer to the actual locations, which indicates that the proposed loss can successfully constrain the distance errors.Furthermore, we inspect the constraining effect of the proposed loss on spatial consistency.Figure 9(a) shows the comparison of the predicted displacement with the ground truth.It can be seen that the predicted displacement of CSLSL is closer to the true distribution.This is because the constraint between the predicted and actual locations can indirectly ensure the consistency of the predicted displacements and the ground truth.Figure 9(b) shows the prediction error of regional attractiveness.We divided the geographic regions into square grids with side lengths of 500 m, and counted the difference between the predicted and actual visits in each grid.As presented in Fig. 9(b), CSLSL has a smaller prediction error of regional attractiveness, suggesting that the proposed loss successfully constrains the spatial consistency.

Sensitivity analysis
We perform sensitivity analysis on dataset NYC and TKY to examine how the performance of CSLSL is affected by λ * , * ∈ {t, c, s}.We first vary λ t and λ c to analyze the effect of time and category prediction subtasks with a fixed λ s = 1.Then we fix λ t and λ c and vary λ s to observe the impact of spatial-constrained auxiliary tasks.Recall@1 is chosen as the evaluation metric and the results of location prediction are averaged of three runs, shown in Fig. 10.
From Fig. 10 (a), we can observe that the performance of location prediction is more sensitive to λ c than λ t , reflecting that the accurate category prediction exerts more influence on the location prediction accuracy, which is also consistent with our proposed decision logic.In addition, the best performance on NYC is obtained with λ t = 5 and λ c = 10 when λ s = 1, while that on TKY is obtained with λ t = 100 and λ c = 50.Figure 10 (b) shows that CSLSL reaches a more stable accuracy on NYC when λ s = 5.0, while the average accuracy is higher when λ s = 10.0.The results on TKY show that when λ t and λ c are set to smaller values, better performance can be achieved when λ s is varied.In summary, CSLSL is robust to these parameters and does not suffer from large performance fluctuations with parameter changes.

Conclusion
In this work, we propose a Causal and Spatial-constrained Long and Short-Term Learner (CSLSL) to incorporate the individual travel decision logic and the group consistency for next location prediction.In CSLSL, we introduce a causal structure based on multi-task learning to explicitly capture the "when→what→where" decision logic and enhance location prediction by fully exploiting the temporal and categorized information.We further propose a simple but effective spatial-constrained loss function that acts as a selfsupervised auxiliary task to incorporate geographical information and indirectly ensure spatial consistency.We conducted extensive experiments to confirm the effectiveness of the design.Specifically, we compared our model with seven baseline models on three datasets, demonstrating the superior performance of the proposed model.Besides, we conducted variant experiments and ablation experiments to validate the effectiveness of the proposed causal structure and spatial constraint loss.Furthermore, we performed additional visualization analyses on the prediction outcomes of the model.These included exploring the relationship between categories and location predictions, analyzing the influence of individual behavioral diversity on predictability, and examining distance relationships and disparities in spatial distribution.Finally, we conducted sensitivity analysis experiments on hyperparameters to examine the robustness of our model.Although we evaluated our model on check-in data, the performance improvement was limited due to the sparse nature of the data.We expect to experiment on dense datasets with comprehensive travel behavior.Such datasets would exhibit more regular patterns in human behavior, enabling the model to more effectively utilize time and activity information to enhance location prediction.

Figure 1
Figure 1 Illustrations for causality and spatial consistency in human mobility.(a) An example to explain the "when→what→where" decision logic.(b) The deviation of the modeled displacement distribution via LSTM from the ground truth (GT) in New York City and Tokyo

Figure 3
Figure 3The illustration of the LSC module.It learns the long-term and short-term trajectory representation which reflects a user's travel preference

Figure 6
Figure 6 Accuracy of location prediction across different categories: (a) 9 coarse-grained categories; (b) Top 20 fine-grained categories with the highest accuracy

Figure 7
Figure 7 Accuracy of location prediction under different entropy: (a) category entropy; (b) location entropy.The blue points represent the average accuracy.The gray points reflect the accuracy distribution among users, and their transparency is normalized based on the maximum number of users in accuracy segments.The results reveal a negative correlation between entropy and accuracy

Figure 8 Figure 9
Figure 8 Distance distribution between the predicted and target location.The predicted locations of CSLSL are closest to the true locations among these models

Figure 10
Figure 10 Analysis of parameter sensitivity.(a) The accuracy heatmap with various λ t and λ s .(b) The accuracy line chart with various λ s .Among these hyperparameters, CSLSL is more sensitive to λ c .Despite minor performance fluctuations, CSLSL can still effortlessly obtain an accuracy of over 0.26 on NYC and 0.24 on TKY

Table 1
Statistical information of the three datasets, NYC, TKY, and Dallas.The number of users, locations, and records are collected from raw and processed data in Dallas.Data in NYC and TKY were collected from 3 April 2012 to 16 February 2013, and data in Dallas was collected from 4 February 2009 to 22 October 2010.The number of users, locations, and records in three datasets are summarized in Table1, where | * | denotes the number of * .The number of location categories |C| in NYC and TKY are 400 and 385, while Dallas does not contain category information.

Table 2
Performance comparison between baselines, CSLSL, its variants and ablations on three real-world datasets