Comparison of home detection algorithms using smartphone GPS data

Estimation of people's home locations using location-based services data from smartphones is a common task in human mobility assessment. However, commonly used home detection algorithms (HDAs) are often arbitrary and unexamined. In this study, we review existing HDAs and examine five HDAs using eight high-quality mobile phone geolocation datasets. These include four commonly used HDAs as well as an HDA proposed in this work. To make quantitative comparisons, we propose three novel metrics to assess the quality of detected home locations and test them on eight datasets across four U.S. cities. We find that all three metrics show a consistent rank of HDAs' performances, with the proposed HDA outperforming the others. We infer that the temporal and spatial continuity of the geolocation data points matters more than the overall size of the data for accurate home detection. We also find that HDAs with high (and similar) performance metrics tend to create results with better consistency and closer to common expectations. Further, the performance deteriorates with decreasing data quality of the devices, though the patterns of relative performance persist. Finally, we show how the differences in home detection can lead to substantial differences in subsequent inferences using two case studies - (i) hurricane evacuation estimation, and (ii) correlation of mobility patterns with socioeconomic status. Our work contributes to improving the transparency of large-scale human mobility assessment applications.


Introduction
Home location detection is an important step in several fields of human mobility analysis such as transportation planning [1], migration and evacuation studies [2,3], accessibility analysis [4], and the theory of human mobility [5,6].This task involves predicting people's 'home location' based on geolocation data, often collected passively by their devices via location-based services, call detailed records, social media activity, smart-card transactions, and in-vehicle location trackers [7].Home detection plays an essential role in understanding large-scale human mobility patterns.For instance, in the event of a hurricane, one needs the home locations both before and after the disaster to identify their evacuation status [8].In urban planning, identifying home locations serves as the foundational data for vital information including home-based trips [9] and human mobility metrics [10], and this forms the basis for optimizing existing infrastructure [11].It is consequently important to have a robust understanding of home detection approaches.
Despite its significance, existing studies using home detection algorithms (HDAs) have paid little attention to the effectiveness of their algorithms.Researchers have developed several HDAs for geolocation data of different kinds whose assumptions, methods, and parameters are not necessarily consistent with one another [12].This raises doubts about the validity of their findings as the error in home detection may propagate to the downstream calculation of home-related metrics such as evacuation counts [3], home-based trip rates [13], and data representativeness figures for accessibility analysis [14].
This issue stems primarily from a lack of ground truth home locations associated with large geolocation datasets.The collection of accurate home location collection on a large scale poses significant risks to privacy [15].Mobility data vendors provide anonymized device identifiers and modify sensitive trajectories to prevent an accurate tracking of people's trip origins and destinations [3].In the absence of ground truth data, it becomes difficult to compare the accuracy of different HDAs using supervised learning methods.Researchers have largely relied on unsupervised methods for home detection, such as ruleand clustering-based HDAs.Small-scale studies such as [12] and [16] have sought to compare the effectiveness of HDAs but have focused only on the parameters of a few HDAs.Further, their small experiments do not provide insights about the impact of study region and period and data quality on the performance of the HDAs.
In this study, we tackle this issue of a lack of a systematic comparative assessment of commonly used HDAs.In doing so, we contribute to the literature on the home location detection problem in the following ways: 1. We review the state-of-the-art HDAs that use large-scale mobility data, including their benefits, assumptions, and limitations.2. We propose three intuitive metrics to quantify the quality of the home-location detection results in the absence of ground truth home location information.3. We develop a comprehensive experiment where a set of HDAs are quantitatively compared in terms of the introduced performance metrics and their sensitivity to the data quality.4. We propose a new HDA that overcomes some of the limitations of the above methods and shows superior performance.The framework and experiment design of this study is shown in Fig. 1 and described in detail in the following sections.The main objective is to compare the performance of different HDAs across different input datasets.On the basis of the review of research literature on HDAs, we have selected four popular and unique HDAs, and additionally proposed an HDA for comparison in this study.The testing is done on eight input samples of passively collected smartphone GPS data covering four U.S. metropolitan areas of different data qualities and different time periods spawned by mobility-influencing events.Once home locations are estimated for each combination of the sample dataset and HDA, they are compared using three approximate accuracy metrics proposed in Sect.3.2.The performance of the HDAs under the different dataset conditions is discussed in Sect.4.1.In the subsequent sensitivity analysis Sect.4.2, the performance metrics are recomputed for different subsamples of the datasets by changing the quality of the users in the input dataset.Finally, the impacts of these HDAs on subsequent applications, such as hurricane

Home detection algorithms 2.1 Literature review
HDAs from mobility data can be categorized on the basis of several characteristics, such as the type of input geolocation data (such as social media and passively collected GPS data), the modeling paradigm (supervised vs. unsupervised and rule-based vs. data-driven), and constraints for filtering the input data.Based on these classifications, some prominent HDAs are reviewed and summarized in Table 1.

Supervised methods
Supervised methods predominantly rely on GPS-based travel surveys that involve the subjects carrying GPS-enabled devices that track movements.In addition to individual-level information such as actual (ground truth) home locations, demographic characteristics, and personal preferences, the devices provide detailed travel entries such as the origin and destination, the departure and arrival time, the trip purpose, and the travel mode.Such mixed methods have been used in many pilot studies [12,18,19].In some cases, it is also possible to obtain CDR data of specific groups for whom individual-level data is also available, such as employees of a telephone carrier [12].In other cases, such as in standalone travel diaries like the National Household Travel Survey, the respondents' street addresses are geocoded to coordinates, though other large-scale information is not obtained for them [38].With true home locations of the small survey sample, it is possible to create sophisticated supervised machine learning models, such as random forests and AdaBoost [21] or artificial neural networks [29].
Although supervised HDAs are powerful, they suffer from a major limitation of training data availability due to privacy reasons.In the recent past, growing pressure from human rights organizations and the subsequent government regulations has made it difficult to obtain actual home locations of individuals at a large scale [39].GPS surveys and CDR samples used in supervised HDAs are usually very small, often with fewer than 100 subjects [12,18].Samples also typically form specialized volunteer subjects such as students Passive GPS [24] Largest hierarchical cluster of stay points (detected based on Liu et al. ( 2008)) [3,25] Largest cluster of nighttime records using mean-shift clustering Heuristic CDR [26] Location of the more popular of the two cell towers with the most records during non-work time [27] Most frequently communicated tower during nights of weekdays, and weekends over the study period [28] Most frequent location during night time [29] Most common visited locations during night time [30] Anchor point determination model (cell tower location satisfying specific rules of call count) Passive GPS [31] The centroid of the most visited 20 × 20 m cell during night hours Smart card [32] Center point-based HDA (iteratively updated centroid between pairs of subway stations) [33] Most visited transit station [34] Most popular transaction place (overall and active days); place with most nighttime activity Social media [35,36] Place with the most check-ins on 3 social networks [37] Place with the most check-ins during midnight [17] and older patients [40], raising concerns about sample representativeness [16].In addition, GPS travel surveys do not represent longitudinal data.These issues make supervised HDAs much less popular in the research literature.

Basic assumptions
Due to the difficulty in obtaining high-quality home location data at a large scale, researchers have relied heavily on unsupervised HDAs.These methods necessarily depend on a set of assumptions about people's home locations that are found throughout the research literature [3,29,37].These include: • People are more likely to stay at their homes during the off-work period.This normally includes nighttime, but in some cases, can be extended to weekends or even the office after-hours.• The most observed place for an individual, especially at night, is usually their home.These assumptions intuitively make sense, although there are several exceptions, such as people who work from home or who work night shifts.However, since these assumptions are almost always used in unsupervised HDAs, we consider these assumptions to be axiomatic.

Dataset types
The input datasets for unsupervised HDAs are abundantly available on scale, including longitudinal data [31,41], although they lack the demographics and travel preferences of the subjects [27,42,43].Some of the most prominent dataset kinds include the following: • Social media location data include posts on websites such as Twitter, Foursquare, and Flickr that a user tags with the location of the mentioned place [41].They are usually available at large scales and in several time periods but are usually spatiotemporally sparse and biased toward certain demographics for effective home location detection, and access to the data can disappear quickly [44][45][46].• Smart card data include transactions at payment booths such as at subway stations and inside public transit buses [32].These are usually anonymized and frequent, but they can only be used to infer public transit mobility patterns adequately as opposed to home locations.• Call detailed records (CDRs) provide geolocation data at the cell tower level.Such datasets are characterized by large spatiotemporal density and coverage, but the quality of the detected homes is subject to the spatial distribution of the cell towers rather than the users' activity patterns [26].Nonetheless, they have been used extensively to understand people's travel and activity patterns during the recording period [26][27][28].• Passively collected GPS data are usually obtained from mobile devices such as smartphones and tablets and automobiles that have location-based services (LBS) enabled [43].GPS data overcomes the main problem with CDRs by providing the exact locations and overcomes geotagged posts by providing continuous and high-frequency records.Further, GPS can provide more detailed information about the movements of individuals, including their speed, direction, and stop durations along the way.Therefore, it has seen a substantial increase in availability and use in the last decade.In this study, we use this data kind for our analysis.

Method types
Density-based clustering methods are commonly used in the literature to estimate home locations, such as DBSCAN [47] (used in [48]) and mean shift clustering [49] for home detection (used in [50]).Mean-shift clustering [49] is a popular density-based clustering method that has been used in several studies [3,20,50,51], probably owing to its simplicity in having just one main parameter-the radius of flat kernel for kernel density estimation (KDE).DBSCAN [47] and its variants (e.g., [48,52]), on the other hand, have two main parameters -the maximum intra-cluster distance at each iteration and the minimum number of points in an acceptable neighborhood.In both these methods, the results of the clustering can be substantially sensitive to the choice of these parameters [53].
Heuristic algorithms are widely applied to detect home locations, which rely on various decision rules on the time and frequencies of user records in specific areas during observations [12,16,22,43].The most intuitive assumption is that users have the highest records at home, and their home locations are identified based on the density of the data.Different variants are proposed by shifting the rules, such as determining their home as locations with the highest number of nighttime records or the most distinct days.Li et al. (2008) [54] developed a rule-based method for detecting 'stay points' which represent spatiotemporal regions of low movement and are thus helpful in trip segmentation.These stay points are computed by identifying the breaks in the time gap and distance between the first and last point of a sequential set of points based on given thresholds of time gap (30 min) and distance (200 m).This method was further modified by Sadeghinasr et al. ( 2019) [43] who clustered these stay points using hierarchical clustering into stay regions and identified home locations as the most visited stay regions during nighttime.Other methods, such as the center-point algorithm by Zou et al. (2018) [32], which uses one's middle point of the first to-subway trip's origin and the last from-subway trip's destination to represent the home location, are easy to compute but have been shown to perform fairly well up to a large radius of tolerance (e.g., [32,44]).
Current studies that compared different HDAs already demonstrated that the results are sensitive to criteria choice, such as night time periods.For instance, Vanhoof et al. [22] primarily focused on assessing the effects of different night periods on the home detection results while ignoring the limitations of the HDA.Pappalardo et al. [12] compared five similar HDAs and validated the results with multiple small-scale datasets, yet they neglected to consider factors such as data quality and period.In contrast, this study concerns comparing the HDAs, with a particular emphasis on testing across scenarios spanning different regions, data periods, and data quality.

Algorithms used in this study
Five HDAs are compared in this study, including a simple baseline algorithm, three algorithms listed in the 'Passive GPS data' section of Table 1, and a derivative of one of those algorithms as proposed in this study.The steps involved in these algorithms, labeled as A 1 , . . ., A 5 , are illustrated in Fig. 2. The same input dataset is used for each of these HDAs.For clustering-based methods, the implementations of scikit-learn, a popular Python-based machine learning library, are used.The common set of users resulting from each of these HDAs is used for subsequent performance assessment.

A1: centroid method
This is the simplest of all the considered HDAs and is meant to serve as the baseline for comparison with the other algorithms.In this case, a user's home location is simply computed as the centroid (or alternatively the medoid) of all their nighttime ping locations over the entire study period, following the assumption that a person's most probable location during the night is their home.This is similar to, but not exactly the same as, most popular cell tower-based algorithms in the case of CDR data [27][28][29].

A2: grid frequency method
This HDA was used in Zhao et al. (2022) [31].They first divided the study region into a square grid with cells of 20 × 20 meters.They considered the home location as the

A3: all-time clustering method
This method involves finding the most popular cluster of all the pings in the nighttime data taken together without distinguishing the temporal variation in locations during this night time.Though several clustering methods exist as explained in Sect.2.1.2,this method particularly uses mean-shift clustering with the same parameters as in [3,23,50].All these studies use a flat kernel with a radius of 250 m for KDE.In this study, other parameters in this method such as the sampling strategy for the KDE process and the number of iterations in the hill climb process are controlled to prefer accuracy over runtime speed.

A4: binned clustering method
HDA A 3 uses clustering of all the nighttime points at once, meaning it does not distinguish between the following cases: (i) a scenario where most of the nighttime points are concentrated in a small time period (e.g., 10:00-10:20 PM) where the user might possibly be in movement and thus more likely to enable LBS, and (ii) a scenario where the same number of nighttime points as in case (i) are distributed evenly across the night.It can be argued that the latter case provides more confidence in the inferred home location since it relies on better-sampled data.
To overcome this limitation of A 3 , we propose an adaptation in the form of A 4 where the nighttime points are collected at fixed time intervals over the study period.The centroids of these locations are computed and used as inputs in mean-shift clustering.Similar to A 3 , the centroid of the largest cluster is labeled the home location.This HDA introduces a parameter in addition to those of A 3 -the binning period, which is taken as 30 minutes in this study.

A5: stay-point method
This HDA was proposed by Sadeghinasr et al. (2019) [24] where they used the stay point detection algorithm proposed by Li et al. (2008) [54] to first identify stay points and then cluster the stay points using hierarchical clustering into stay regions by setting a threshold of a maximum intra-cluster distance of 250 m.Then, they considered the home locations as the most visited stay regions during nighttime (8 PM-5 AM) which had a visit duration of at least 3 hours during the nighttime or a total duration of at least 24 hours.

Smartphone GPS data
This study uses GPS trace data collected using LBS on smartphones and tablets, aggregated and anonymized by a private vendor.The trace table (illustrated in Table 2) comprises events (called 'pings' here) which include a mobile device's (called 'user' here) anonymized identifier, latitude and longitude of the point, an estimate of the radius of GPS recording error for that ping, and the Unix-style timestamp of the event (seconds passed since Jan 1, 1970 UTC+00:00).More details about the data are provided in the Supplementary Sect. 1 in Additional file 1.
LBS data is usually slightly erroneous due to inaccuracies in the GPS logging system and thus needs preprocessing for better results.The preliminary data filtering done to create the dataset samples includes removing pings with an error radius of more than 50 m, those with segment speed of more than 50 m/s (180 km/h), and those with acceleration outside the range of -10 to 10 m/s 2 (based on works like [55,56]).For reference, for the ith ping in the sequence trace with coordinates x i = (x i , y i ) and timestamp t i , its speed is given by t i -t i-1 and the acceleration by , where d is the Haversine distance function.By definition, v 1 = a 1 = a 2 = 0.

Study regions and periods
Four U.S. metropolitan statistical areas (MSAs) are assessed in this study-(i) Austin, TX, (ii) Baton Rouge, LA, (iii) Houston, TX, and (iv) Indianapolis.The counties included in these MSAs, their total area, and their total population (as of the 2020 5-year estimates of the American Community Survey (ACS)) are shown in Fig. 3.
These regions are chosen from the cities with available land use and smartphone GPS data so as to cover a diverse set of scales and land use patterns.Baton Rouge has a large but sparsely populated MSA, whereas Houston has a much larger MSA.Austin and Indianapolis lie in between but represent cities with very different land use distributions and urban layouts.Houston is known for its sprawling layout, with significant suburban development extending into multiple counties.The city is characterized by a lack of zoning laws,  which has led to a unique pattern of residential, commercial, and industrial areas often being interspersed [57].Austin is characterized by a higher density in the city center, with the urban core being home to a mix of residential, commercial, and cultural facilities (towards mixed-use developments) [58].By conducting analysis for these four cities with different socio-economic contexts, underlying data characteristics, and scale and complexity, we ensure that our tests are robust and generalizable across various urban settings.
In addition to spatial variation, the datasets used for testing the HDAs are created so as to include temporal variation as well.Particularly, two case studies are chosen to represent the potential temporal difference of HDA outputs before and after two specific events.The first event is Hurricane Ida which caused damage in southeastern Louisiana upon landfall on August 29, 2020, causing waves of evacuation and displacement around the region, including in Baton Rouge.The periods depicting stability before the landfall, during the mobilization period around landfall, and long after the event are considered in this analysis.
The second event is the first government-mandated lockdown in Indiana on March 16, 2020 following the outbreak of COVID-19 in the United States which was known to have drastically reduced mobility.A before-after comparison of the HDAs of these events is deemed useful in explaining the robustness of the HDAs.This is explained in Sect.4.3.
With these two combinations of study regions and periods, a total of 8 datasets are prepared for testing the HDAs.These are shown in Table 3.The number of unique devices (referred to as 'users') obtained after cleaning the GPS data and their ratio to the regional population are also shown.Similarly, the number of filtered pings is also shown for each dataset, reflecting the scale of variation in the test datasets.Note that the pings are filtered within the regions' bounding boxes (shown in green dashed outlines in Fig. 3) instead of filtering within the MSA counties for the sake of performance speed.

Performance metrics
In the absence of ground-truth information on device users' home locations, the accuracy of the HDAs is tested using three approximate or pseudo-performance metrics.All these are based on some assumptions that are generally considered valid intuitively and in the literature.

M1: residential detection rate
This metric makes use of the idea that a good HDA should detect more homes in a city's residential areas as opposed to other land use categories such as commercial, industrial, and forests.This metric is given by the proportion of homes detected by a given HDA in the residential area of the region based on its land use distribution (see Supplementary Sect.2.1 for more details).To offset some potential mislocation errors due to the nature of the GPS data and the often convoluted land use maps, tolerance buffers of different widths are also considered in the calculation.This results in the following definition of the performance metric: Here, for buffers of width r, ranging from zero to r max , ρ A (r) is the proportion of homes detected in the combined buffered residential area detected by HDA A. For instance, a value of M 1 (A) = 0.4 can be roughly interpreted as 40% of the users' home locations detected by HDA A lying within a region of the city classified as 'residential' .In the subsequent experiments, the value of r max is taken as 50 m, with buffer width increments of 5 m, the same as the maximum allowed error in GPS spatial accuracy as explained in Sect.3.1.1.

M2: proximity to daily data
This metric uses the idea that a home location should be the origin/destination of one's daily trips.Given a user's home location detected by a given HDA, this metric involves computing its distance to the closest ping in that user's nighttime pings on each day in the study period using Haversine distance.Then, the median of these daily shortest distances is taken for each user.The cumulative density function (CDF) of this median shortest distance is drawn and the normalized area under the curve is computed.This represents the proximity performance metric, given by the following: ( 2 ) Here, F A is the CDF of the median shortest distance of the users detected by HDA A, δ A,i is the median shortest distance for the user i whose home location is given by h A,i , X i,t is the set of nighttime pings at night t, q 0.50 represents the 0.50 quantile (that is, median) over all study days up to n T , and δ max is a reasonable upper limit, taken as 5 km.

M3: home stay duration
This metric is based on the idea that people typically spend the majority of their nighttime at their homes.For a given user, we first identify the locations they visit during nighttime hours using a stay region detection method similar to Sadeghinasr et al. (2019) [43] but with an adaptive linkage calculation (details of this method are provided in Supplementary Sect.2.2).The stay region closest to each user's detected home location is assigned as their 'home region' .With the detected stay regions, the performance metric for each user is simply the ratio of time outside the home region to the total time spent in all stay regions.The overall performance metric is given by the area under the curve of the CDF of this value: Here, F τ A is the CDF of the ratio, r A,i , of time (τ ) spent in the home stay region, C h A,i , to the maximum time spent in any stay region C k,i over all the users i detected by HDA A, and K i is the total number of stay regions detected for user i.Similar to M 1 and M 2 , a higher value of M 3 indicates a better HDA.

Results
The HDAs listed in Sect.2.2 are compared on the basis of their precision, as approximated by the three performance metrics in Sect.3.2 and their sensitivity to data quality.These are described in the subsequent sections.

Performance comparison
The visual comparison of the performance metrics M 1 , M 2 , and M 3 across the HDAs over all the datasets is shown in Fig. 4. The generating curves of these metrics are provided in Supplementary Fig. 2. In Fig. 4, the size of the radar polygons depicts the overall performance of an HDA, while the skewness of the polygons hints at the differences in the behavior of the HDAs across different datasets.

Overall differences by HDA
The findings from the plots in Fig. 4 are diverse and vital.First, A 1 consistently performs the worst in these overall results.This is expected, as A 1 is a very straightforward HDA with several key limitations (i) It is difficult to find the most frequent place among GPS points is not easy due to high data precision; (ii) The centroid may not necessarily be the most probable location; (iii) The results of this method are heavily susceptible to disturbances due to outliers; (iv) This method does not distinguish between spatiotemporal regions of stay and movement.For people with high movement during the night, the mean value of the coordinates can shift the detected home substantially far away from the user's trajectory.This explains why A 1 performs substantially worse in the case of M 2 compared to the other HDAs, since M 2 directly involves computing the distance of the detected home location with the closest nighttime trajectory point.
The performance of the other algorithms is largely similar, with some exceptions.Algorithm A 4 consistently performs better than the others, as is evident from the largest radar polygons corresponding to A 4 in the three metrics.In particular, although A 4 requires a data filtering criterion on its base HDA A 3 and thus operates on fewer data points than A 3 , it performs better than that.This might be attributed to the focus on data quality over quantity by discretizing the data temporally, as explained in Sect.2.2.3.This is important because it is possible for users to have high LBS activity during traveling (e.g., for navigation services) which may overshadow the location data during stay periods such as at home.Since traveling generally occurs far from home, all HDAs other than A 4 are more likely to consider these irrelevant points for the home detection process.
This bias is reduced to a lesser extent in A 3 and A 5 that rely on clustering.This positive impact of discretization is also evident in terms of space.A 2 , which is a very simple heuristic that only involves finding the most visited grid cell, i.e., the discretization of space, performs, with metric values finishing close to A 4 in most cases.
The rule-based HDA A 5 is generally found to perform slightly worse than A 3 , although this pattern reverses in the case of M 2 .Both A 3 and A 5 involve clustering, but the order and kind of clustering are different between the two.It may be argued that the time and distance-based thresholds involved in the stay point detection step of A 5 might hamper the performance of the algorithm since those thresholds do not take into account the continuity of the data.

Differences by dataset
The radar plot in Fig. 4 also shows the significant differences in the performance of the same HDA in different datasets.Notably, all the metrics are observed to be the highest in the case of D 5 .It should be noted that datasets D 4 , D 5 , and D 6 have the same underlying urban land use and transportation networks.D 5 corresponds to the period of reduced mobility and high stay-at-home rates during the surge of COVID-19 in the Indianapolis region.It includes the date of the first death related to COVID-19 recorded in the region on March 16, 2020, and the imposition of the government-mandated lockdown on March 23 [59].Since people were more likely to stay at home during the period of D 5 , the data quality for the HDAs was substantially better than the other datasets, making it easier for all the HDAs to perform the best.This is made further prominent in the stark difference between D 4 and D 5 in the value of M 3 which depends on the time spent at home.
Moreover, the performance metric values for D 6 are consistently near the corresponding values of D 4 and D 5 .This makes sense given that the period of D 6 is the union of the periods of D 4 and D 5 which are of equal length.This indicates that the better data quality of D 5 does not inordinately skew the performance metrics.
In Baton Rouge, the effect of Hurricane Ian is observed to be small yet important.This is evident in the higher values of M 3 for dataset D 2 that corresponds to the period close and immediately after the hurricane landfall compared to the pre-landfall (D 1 ) and long-term post-landfall (D 3 ) periods.However, the values of M 1 and M 2 do not vary significantly between D 1 , D 2 , and D 3 .

Differences by metric
The ranges and behaviors of the three performance metrics also shed light on the nature of the analysis of this study.First, M 1 has a large range of 0.45 to 0.76.All the tested HDAs perform substantially better than a random uniform HDA where the residential detection rate curve is plotted by simply computing the proportion of land use region covered by residential areas.This is evident in Fig. 4A where the black dashed curve (denoting this uniform random HDA) is significantly smaller than those of the other HDAs in the plot.It must be noted, however, that M 1 relies on assumptions about home location that might not always hold true and could have skewed the results.For example, some users may stay at places other than their homes (such as a hotel or a relative's residence).Similarly, the home locations of night-shift workers may be overrepresented in the commercial areas of a city and thus reduce the value of M 1 .
The case for M 2 is also similar.It has a substantially small range outside of the poorest performing A 1 .This could be attributed to the fact that M 2 is unidirectional in its utility.That is, a small shortest distance of trajectory points from home only serves as a necessary condition for a good HDA, not a sufficient condition.Its computation relies on the distance to the closest point to the trajectory.Since the home locations are detected based on the trajectory itself, it is highly probable for an HDA to produce a high value of M 2 for a set of users who do not travel very long distances.

Sensitivity to data quality
In the previous sections, we observed the difference in the performance of the test HDAs.While it was shown that the continuity of data discretized in space and time substantially influences the goodness of an HDA, there is substantial nuance to the effect of data quality in terms of overall ping density on this goodness.In this section, we particularly ask the question: "if an analyst has geolocation data of a specific ping density, which HDA should they choose for their analysis?" Building on the notion of ping density, the data quality of a user in this section is defined as the mean number of pings per night in their data points.Users with more pings on average are expected to have higher quality data and yield better home location detection results.At the same time, however, owing to the nature of mobile phone geolocation data, most users have very few data points, making home detection a difficult task (for reference, see SM).This means that a good HDA should strike the balance between good data quantity and quality.
To achieve this, we recomputed the performance of the HDAs on several subsets of the users by dividing them by their data quality, given by their mean nightly ping count.To simplify the decision-making for HDA choice, we further computed the mean value of the three metrics for the subset of users contained in each bin, given by M = 1 3 (M 1 + M 2 + M 3 ).The results of these aggregate metrics are shown in Fig. 5.
The findings of this figure are aligned with those in the previous section.First, we see here that at nearly all levels of data quality, the order of performance is largely consistent with the overall results shown in Fig. 4. A 4 still consistently performs the best, closely followed by A 3 and A 2 , while A 1 and A 5 perform considerably worse.When the data quality is measured in relative terms, i.e., using the ping count distribution of each dataset, the trends are considerably different (see the Supplementary Fig. 3).
Notably, in Fig. 5A, though A 1 performs worse than A 5 in the case of M 1 and M 2 , the trend is reversed for M 3 .The trends of M 3 are also different from those of the other two metrics in that, unlike them, M 3 decreases with increasing data quality.It is likely because it involves computing the ratio of time spent in the detected stay-at-home region, which is likely to be exactly the same as the only (or one of the only) stay region detected for low-quality data users since they do not have enough data, to begin with.In contrast, M 1 and M 2 rely on the richness of the data in increasing the likelihood of locating a user in a residential region and closeness to the trajectory respectively.
To compare the overall relative performance of the HDAs, we also computed the mean value of each of the three metrics across all the study datasets.The result of one dataset D 1 is shown in Fig. 5B.Similar results for the other datasets are shown in the Supplementary Fig. 4. It can be seen that the opposite trends of M 1 and M 2 with M 3 are balanced to some extent when their values are averaged.There is a steady but small increase variation in the value of M as the user quality increases in D 1 .This shows that there is merit in choosing these metrics as their values do not show any abrupt behavior over different data quality categories.
This comparison is also helpful in making the choice of data filtering required for any downstream application of home location detection.For example, suppose we decide that a mean performance value of 0.8 is acceptable in a dataset similar in ping count distribution to D 1 and an urban land use similar to Baton Rouge.Then, we can refer to Fig. 5B to see that, for example, for HDA A 4 , users with at least 50 pings per night would be required for analysis (dotted vertical line).This corresponds to the 13% best quality users of the dataset since 87% of the users have fewer than 50 pings (right horizontal dotted line).

Impact on applications
To see how different HDAs would influence applications of human mobility assessment and how our performance metrics could help improve the results, we conduct two experiments on common tasks where smartphone data is considered superior to other sources.These are explained in the subsequent sections.

Hurricane evacuation identification
Large-scale GPS data is used to estimate the evacuation/return patterns during natural disasters [8,31].In this task, a crucial factor is the distance between individuals' postdisaster stay locations and original home locations before the disaster.
Here, we calculate this factor based on D 1 (before landfall) and D 3 (aftermath of Hurricane Ida) using the five test HDAs.Then, we estimate the evacuation ratio using the threshold, i.e., if the distance between an individual's pre-and post-disaster home locations exceeds 1 km, we consider them as evacuated.
We observe that among the five HDAs, A 1 and A 5 produce significantly different distributions of the distance between pre-and post-disaster homes (see Fig. 6).Even for HDAs with similar CDF curves, it can be seen from Fig. 6B that they can generate a significant estimation of evacuation ratio in some areas (e.g., the northwestern part and the southern part of the city).When connecting these results with the observations of Sect.4.1, we notice that the HDAs with good and similar performance metrics (namely A 3 , A 4 , and A 2 ) tend to create similar results.In contrast, A 1 and A 5 result in much higher evacuation rates.Since evacuation rates are essential in assessing policies and equity issues related to home evacuation, in-place sheltering, and disaster recovery, it can be imaged that adopting an arbitrary HDA can yield substantial negative impacts on policymaking [60].

COVID-19 policy impacts assessment
GPS-based cell phone location data has been extensively used to evaluate mobility patterns and potential solutions during COVID-19 [61].These include evaluating alterations in population-wide mobility [62], compliance with COVID-19 policies in various demographic groups [63], and the spread and associated risk of disease from different regions [64].However, erroneous home location inference may lead to inaccurate assessment of mobility changes and policy compliance of regions or demographics, resulting in resource misallocation and ineffective policies.
To demonstrate the consistency of inferred homes, we report the percentage of users with home locations within the same zone for each HDA (Fig. 7A).We show the results for both an aggregated administrated boundary (county) and disaggregated one (tract).With HDA A 1 , only 47% of the users exhibit a consistent census tract, while with A 5 , 56% of such users were observed.For all remaining users, demographic considerations can be inconsistent and imprecise.For every HDA at the spatially aggregated county level, more than 80% of the users are classified within the same county.At both spatial levels, A 4 shows the highest consistency, with A 3 and A 2 being comparable.Therefore, for reliable analysis, this suggests using HDAs A 2 , A 3 , and A 4 rather than A 1 and A 5 , aligning with the findings presented in Sect.4.1.
We further investigate its potential impact on realistic applications, and income-based inequality assessment, which rely on demographic information inferred from home location.Income-based inequalities have been extensively examined using cell phone data in aspects such as access to opportunities [65], the well-being of individuals [66], emissions [67], and evacuation [50].An inadequate HDA may result in the misclassification of users into different income groups, compromising the accuracy of assessing inequalities and characteristics associated with people from specific income groups.
We assess the percentage of users exhibiting income group discrepancies based on the median income of inferred home's census tract for two datasets for an HDA.Income groups are categorized from the Longitudinal Employer-Household Dynamics (LEHD) Origin-Destination Employment Series (LODES) dataset, comprising three categories based on monthly income: low (less than $1250), mid ($1250-$3333), and high ($3333 and above) [68].Figure 7B shows the percentage of users with inconsistent income group classification across the two datasets.A minimal proportion of users experienced misclassification between high and low-income groups.However, a significant number of lowincome users were incorrectly classified as middle-income and vice versa, resulting in a blending of categories and inaccurate assessment of behavior.Both A 1 and A 5 exhibit the highest percentage of misclassified users.The consistent performance of A 2 , A 3 , and A 4 suggests their suitability for studies involving demographics.These findings underscore the importance and precision of the inferred metrics, as these findings align with the results from Sect.4.1.

Discussion and conclusion
In this study, we examine several home detection algorithms (HDAs) for mobile phone geolocation data, an important source that opens novel opportunities on several crucial topics.To evaluate the quality of identified home locations, we propose three performance metrics.Each metric corresponds to a feature that the true home location would likely hold: most identified homes should be located in residential areas (metric M 1 ), the home should be close to one's daily trajectories for every day (M 2 ), and people typically spend most of the nighttime at their homes (M 3 ).We test four representative HDAs together with one which we propose and calculate the metrics on eight datasets in four US cities with different urban layouts and population distributions.We also conduct a sensitivity analysis against data density to understand the impact of data quality on the relative quality of the detected home locations.
We find that different HDAs, even well-established in the literature, can lead to significantly different home location results.Among the five HDAs tested in this study, we observe that two of them (A 1 and A 5 ) consistently perform worse than other algorithms in all eight datasets.More than 20% of the homes detected by these two HDAs fall outside a 2-mile radius from the home locations estimated by the other three HDAs in the eight datasets.A 1 is a simple centroid-based algorithm that is primarily used in call detailed records (CDR) mobile phone data.Its poor performance can be attributed to its sensitivity to outlier records and a lack of consideration for other data filtering criteria and nuances.A 5 is a more sophisticated algorithm that uses both clustering and a rulebased approach to identify the location of the home.The choice of its many parameters might be attributed to some or all of its poorer performance.The other three HDAs (A 3 , A 4 , and A 2 ) perform similarly to each other.In addition to this, it is found that all three metrics agree with each other in terms of the rank of the performance, which supports the strength of their design.
We also propose a new algorithm (A 4 ), which is based on A 3 with an additional process to bin every 30-minute pings to consider spatial data continuity.Under our metrics, we report that A 4 consistently performs better than other HDAs studied.It is worth noting that by adding the binning process, we also manage to reduce computational time when compared with A 3 .Although computational time is not a big concern for this offline task, it becomes important if the size of the samples is substantially large.
We perform a sensitivity analysis of the data quality to provide useful suggestions to researchers who might encounter different data collection frequencies and sample rates.It is found that the order of relative performance remains largely the same even for different subsets of mobile phone devices ranked by their data quality.
Further, we explore the implications of different HDAs and their corresponding metrics subsequent applications in human mobility assessment.We use two tasks to build our experiments: evacuation identification and COVID-19 policy impact assessment.There are two main takeaways: first, the use of different HDAs could significantly influence downstream results; second, by preferring HDAs with high (and similar) performance metrics, the results are more consistent and closer to expected behavior.
We expect our work can provide the following values to researchers and practitioners who are using HDAs.First, we hope that this study can shed light on a previously unexamined issue: the quality of detected homes and their potential influences on findings in subsequent applications in human mobility assessment.These findings could be different for different fields.For example, in evacuation assessment, geolocation data may not be available for a lot of nights.Using a limited amount of data may impact the quality of the estimated home locations.Second, we recommend that researchers use these metrics to compare the performance of the used HDA with others for their use case (geographic location, time period, and data quality) before finalizing that HDA.In urban planning, for example, planners might want to select a different HDA based on the data quality threshold they choose for their planning region to estimate home-based trip rates.Third, we expect our results to establish general ideas about what makes a good HDA.In the literature, we observe different researchers tend to adopt or even design different HDAs based on their available data.In this case, information such as data continuity (across different times of the day) matters more than the data density and can provide useful guidance for their methodology design.Moreover, since HDAs are commonly shared in many applications of passively collected human mobility data, we have created an open-source toolbox [69] to facilitate access to our proposed metrics and different HDAs.
We also recognize some limitations of our study and some related topics that merit further examination.First, due to the absence of large-scale true home locations, our evaluation can only be indirect.Note this is also the motivation for performing home detection, which suggests that this would be a limitation for all HDAs when they are applied in practice.Here, we introduce the COVID-19 scenario to alleviate this issue as the impact of the lockdown influence on human mobility is well studied and accepted.Given that the infor-mation about people's exact home locations is very sensitive, we expect the restrictions to be unlikely to be fully resolved, but we expect future events to provide opportunities to create more evidence.Second, we recognize that the datasets used in this study may not reflect the nature, quality, and quantity of data available to other researchers.Finally, our proposed metrics are 'necessary' conditions in the sense that the detected homes are good, as they align with our intuition of the features that a real home location would follow.It would be interesting to establish the 'sufficient' conditions for an HDA's results to be acceptable.To establish such standards, we posit the need for more and diverse empirical evidence with our proposed metrics.

Figure 1
Figure 1 Framework of the study.The figure shows the key components of the experiment-HDAs, datasets, and metrics.The cross symbol denotes Cartesian product

Figure 2
Figure 2 Flowchart of the steps of the HDAs compared in this study.The values shaded in grey depict the algorithms' parameters.The dashed lines between two HDAs depict the same or equivalent step between the two HDAs

Figure 3
Figure 3 Study regions showing the metropolitan statistical area (MSA) counties and bounding boxes.The population density of the census block groups as per ACS 2020 data is colored in red.The regions covered in the land use maps are shaded in cyan

Figure 4
Figure 4 Performance metrics for the HDAs across the study datasets.For M 1 , the dashed black line represents a uniform random selection algorithm based on the residential area buffers up to 50 m.The datasets of the same city are grouped in cyan

Figure 5
Figure 5 Impact of data quality on HDA performance.Each value x on the x-axis represents the subset of users having at least x pings per night on average.(A) Comparison of the mean value (x) of each metric across all the datasets.The shaded regions correspond to the range x ± σ , where σ is the standard deviation across the datasets (B) Comparison of the mean of the three metrics for one dataset.For reference, the CDF of the users sorted by the average nightly ping count (x-axis) is shown in the shaded blue curve on the right y-axis

Figure 6 Figure 7
Figure 6 Evacuation identification results under different HDAs.(A) CDF of the distance between homes identified before (dataset D 1 ) and after Hurricane Ida (D 3 ) in Baton Rouge MSA.The threshold to classify users as displaced (1 km) is highlighted.(B) Identified percentage of users classified as evacuated by census tracts

Table 1
Summary of commonly used HDAs for different data and algorithm types

Table 2
A sample of the GPS data used in this study.The coordinates have been fuzzed for illustration

Table 3
Description of the study datasets (combinations of region and analysis period)