In XDR, the main unit of analysis is a *network event* [20], which indicates a billing record for a device. Such events include calls, text and multimedia messages, and Internet downloads. Calls and messages are billed individually, Internet connections are billed in batches. The number of TCP/IP network packages sent through a tower may be very high, but billing is performed according to the number of megabytes transmitted. We work with anonymized XDR data, where each network event contains the ID of the tower, a timestamp, and a tokenized device ID. Device IDs are coherent in the dataset, *i.e.*, two events with the same ID describe the trajectory of the same device.

In transportation, the core unit of analysis is a *trip*, with its corresponding attributes, *e.g.*, trip origin, destination, departure time, traveled distance, purpose, and mode(s) of transportation [13]. Trips can be aggregated into Origin-Destination (OD) matrices, which encode the number of trips from one area of the city to another. These areas can be blocks, neighborhoods, municipalities, among other administrative divisions. Transportation experts usually work with OD matrices that are representative at the county or municipal level, due to the limitations of data-collection methods. For instance, in Santiago, Chile, the last travel survey is representative at the municipal-level, meaning that, even though there is individual trip information available, only the municipal-level analysis is representative of city behavior.

This paper focuses on a transportation problem: *the inference of mode(s) of transportation for commuting*. Since commuting refers to a recurrent, routinary trip, it is possible to go one level up in the analysis pipeline, and move from *trips* to *commuters*. Thus, our main unit of analysis will be devices used by commuters. From now on, we refer to commuters or users indistinctively.

To solve the problem, we propose a pipeline that takes XDR as input, generates a list of commuters, with home and work location, and their assigned mode(s) of transportation. The result can be a single mode or a combination (*e.g.*, bus and metro). Figure 1 shows a schematic view of the problem. Starting from XDR, an algorithm infers trips for each device (Section 2.1). These trips are then used to identify home and work locations for a device, allowing to assign trip purpose, and thus, effectively labeling commuting trips (Section 2.2). Next, each tower is labeled according to their proximity to relevant urban/transportation infrastructure, using crowd-sourced geographical data and public transportation network feeds (Section 2.3). With the set of non-pedestrian commuting trips, a user-tower matrix is built, according to the towers that users connected to while moving. Using the tower labels, some users that do not perform pedestrian trips can be weakly-associated to a mode of transportation. These users are considered as seeds for a semi-supervised model (Section 2.4), which, is next used to group users into *modal clusters* (Section 2.5). In those cases that do not have enough information as input for the model, we identify whether the unlabeled home/work trajectory is pedestrian (Section 2.6). Finally, by aggregating all commuters and their labels, we are able to estimate the distribution of usage of mode(s) of transportation in a city, also known as *modal partition* in transportation terms (Section 2.7).

The remaining part of this section explains each stage of the pipeline in detail.

### 2.1 Trip inference through activity detection

This stage models the task of inferring trips for a given spatio-temporal trajectory using computational geometry techniques and transportation rules. A trip is considered one of the many activities that can be performed within a day. Let *J* be the set of tuples for device *u* at a given day:

$$ J_{u} = \bigl\{ \bigl(A_{i}, (t_{iO}, t_{iD}), (p_{iO}, p_{iD}), I_{i}\bigr) \bigr\} . $$

(1)

In the tuple *i*, \(A_{i}\) is an activity type, \(p_{i}\) (and \(t_{i}\)) are the positions (and times) associated to the origin (\(p _{iO}\)) and destination (\(p_{iD}\)) of \(A_{i}\), and \(I_{i}\) is the set of intermediary points in the trajectory from \(p_{iO}\) to \(p_{iD}\). Activities can be of types *trip*, *stay*, and *unknown* (*i.e.*, activities that cannot be classified due to lack of data). The intermediary points are denoted as *within-trip waypoints*.

To identify activities, we assume that, during a day, a set of turning points exist [18]. A turning point is a moment of the day, at a specific position, where the device owner started to perform an activity (and, by definition, ends performing a previous activity). To build the list of turning points, we define the following vector per each user *u*:

$$ \vec{u} = \bigl[(t_{0}, p_{0}), (t_{1}, p_{1}), \ldots, (t_{n}, p_{n})\bigr], $$

(2)

where each element in *u⃗* corresponds to an event of *u* in a day, with a timestamp *t*, and a tower position *p*. These vectors can be projected into a 2D plane: the *x*-axis is the elapsed time during the day, and the *y*-axis is the accumulated distance from the starting point of the day:

$$ d_{i} = \sum_{j = 1}^{i} E(p_{j}, p_{j-1}), $$

(3)

where *E* is the Euclidean distance function between two points in space.

Next, we build a spatio-temporal trajectory over the turning points of *u⃗*:

$$ T_{u} = \bigl[(t_{i}, d_{i}): \forall i \in [0, n] \bigr]. $$

(4)

To identify turning points, we simplify \(T_{u}\) into \(S_{u}\) using the Visvalingam–Whyatt line simplification algorithm [21]: the points identified as relevant by the algorithm are considered turning points. The algorithm assigns a weight to each point in the trajectory, and keeps only those with a weight above a given threshold. The weight of a point is defined as the Euclidean area formed by the triangle of the previous, current, and following point. The greater the weight, the greater the importance of the point. By definition, the starting and ending points of the trajectory have infinite weight, and thus, under any threshold they are always present in the result of the algorithm. Note that prior work [18] used a different algorithm (Ramer–Douglas–Pecker), however, Visvalingam–Whyatt has a threshold with interpretable units, namely, distance multiplied by time. The points from \(T_{u}\) not included in \(S_{u}\) are saved into \(I_{i}\) as waypoints.

The points in \(S_{u}\) are chained to build a list of segments that represent activities. To do so, we employ a set of rules. *Unknown* segments are those where the total covered distance is greater than 50 kilometers. In those cases we cannot distinguish between trips and unknown situations, such as when mobile phones are connected to distant towers that are on top of a hill, or when the mobile number is associated to a vehicle (*e.g.*, a taxi). While these are indeed displacements in time and space, they are so large for the city scale that the dataset may have missed inter-events (*e.g.*, due to connection to WiFi networks). *Stays* are stationary activities. Some stationary activities involve displacements (*e.g.*, working/studying in a big campus), but the speed of movement is much slower than when performing a trip. Thus, if there is a distance displacement, but the time is greater than 180 minutes, we still identify the activity as a stay. Finally, *trips* are segments that are not *unknown* or *stay*.

We merge contiguous segments tagged with the same activity. Two or more segments are merged into an activity by keeping the first time and position in the segment as origin, and the last time and position in the segment as destination. Additionally, there is an special case when two *trips* surround a *stay*. If the duration of the latter is lesser or equal than 15 minutes, its activity is changed to *trip*, and merged accordingly. This scenario corresponds to situations when users in public transport make a connection, or when vehicles are stuck in traffic.

After merging all activities, for each user there is daily set of activities \(J_{u}\) for each day in the dataset. Note that the turning points of merged activities are saved as waypoints in \(I_{i}\).

### 2.2 Trip purpose

A commuting activity is a *trip* within two *stays*: one at home and one at work. This implies that, for a given device, we need to infer these two important locations: home and work.

In general, people follow daily routines where they spend most of the nights at home, and most of the hourly days at work in business days. This enables to infer these important locations in several ways, such as heuristics [22] or pattern recognition [23]. Given that we seek for interpretability in all stages of the pipeline, we implemented the heuristics defined in [22]: home is the most frequent area with stays at night, and work is a frequent area with stays at work hours that is more distant than others. This procedure allows to add an additional label to each trip in a set of activities \(J_{u}\): whether it is a commuting trip or not.

### 2.3 Tower labeling using urban infrastructure data

In parallel to trip inference, we associate towers to modes of transportation as a way to provide weak labels to the inference process.

A tower provides connectivity to the devices around it, however, those devices may be within different contexts. For instance, the people connected to a tower installed within a metro station are more likely to be commuting than the people connected to a tower within a park. In a similar way, the people connected to a tower near a highway are more likely to be commuting by car than people connected to a tower in a main street, where several bus services are available. Likewise, car drivers are more likely to be on local streets than in main ones [24], and passenger loads imply that, even if there are more cars than buses in main streets, a device in a main street with bus routes is arguably more likely to be in a bus than in a car. Regarding bikes, we leave them out of analysis. We explain the reasons in the Future Work section.

Having these assumptions into account, we associate each tower to one or more modes of transportation according to their proximity to urban infrastructure: highways and secondary streets are associated to cars; primary streets, to buses, if there are available bus routes; and bus corridors, to buses. Towers near metro over surface are also associated to metro. This last distinction is relevant as the underground metro network has dedicated towers identified as such. As result, for each mode of transportation *m*, we have a set of towers \(T_{m}\) that contains the corresponding associated towers. To increase distinction between several modes of transportation, we filter each \(T_{m}\) by removing towers that belong to more than one set.

### 2.4 User modeling from trajectories

Given the granularity of XDR, identifying the mode(s) of transportation of a single trip is unlikely. Consider a trip that lasts 45 minutes: according to a typical granularity of 15 minutes between records, in the best scenario this trip has three events: a trip start, a *within-trip waypoint*, and a trip end. Hence, to identify the mode(s) of transportation, we have only one event: the *within-trip waypoint*. To avoid this limitation we propose to aggregate commuting trips, which, by being recurrent, allow to have a complete picture of what is the urban infrastructure associated to user routines.

The first step is to build a *waypoint matrix*
*W*, defined as:

$$ w_{i,j} = \frac{\# \text{ of }\textit{within-trip}\text{ events of user } u _{i} \text{ at tower } t_{j}}{\# \text{ of }\textit{within-trip}\text{ events of user } u_{i}}. $$

(5)

This schema is equivalent to the row-wise normalized document-term matrices found in Information Retrieval [14], where users are the equivalent of documents, and towers are the equivalent of terms.

Our hypothesis is that, by decomposing *W* with matrix factorization, we will effectively arrange towers into clusters (or latent components) according to their co-occurrence in users’ daily routines. To do so, we decompose this matrix into two:

$$ W = A \times B, $$

(6)

where *A* is a \(|u|\times k\) matrix that encodes *k* user latent features for \(|u|\) users, and *B* is a \(k \times |t|\) matrix that encodes *k* latent tower features for \(|t|\) different cell towers. As matrix decomposition method we work with Non-Negative Matrix Factorization (NMF) [25, 26], in which by definition all \(w_{i,j} \geq 0\). We choose NMF over other matrix decomposition methods such as SVD [27] (usually used to perform Principal Component Analysis or PCA [28]) because it has shown superior performance for the task of clustering [19, 29,30,31]. Moreover, the non-negativity constraint results in more interpretable latent features, since any user (rows) or tower (columns) in the *W* matrix can be represented as a weighted sum of parts [25, 29], all positive or zero. Then, using NMF, the matrix *W* is decomposed into two non-negative matrices, which gives a lower rank approximation for *W*, such that \(W \approx A \times B\) [32]. For solving NMF, the problem has been formulated in several ways (*e.g.*, Frobenius and Kullback-Leibler losses [25, 31]), and different methods have been proposed to solve it (*e.g.*, multiplicative method, coordinate descent, *etc.* [33]). In our work, we start with the traditional formulation based on the Frobenius norm in order to further extend it to incorporate constraints based on our data. Figure 2 shows a diagram that explains the rationale behind using NMF to cluster users into modes of transportation with NMF.

NMF can be formalized as the following optimization problem:

$$ \min_{A,B} \Vert W - A \times B \Vert _{F}, $$

(7)

subject to *A* and *B* be non-negative, where number of rows in *A* and the number of columns in *B* correspond to the desired lower-rank approximation *k*. In the original algorithm, the parameter *k* must be chosen manually, and its value should be decided jointly between data scientists and domain experts. In prior work [19], we found that, for several values of *k*, the clusters determined by NMF were of two types: urban areas delimited by contiguity, and transportation networks. As such, there is potential on guiding the algorithm to focus only in transportation features. To encode this prior information in the NMF algorithm, we use a semi-supervised approach named Topic-Supervised NMF (TS-NMF) [16]. With this method, we are able to provide examples to the algorithm about some users that we already know to which cluster they belong. This information is based on how we associated towers to modes of transportation in the previous step. Thus, we propose to use \(k = 3\), where the clusters are *metro*, *bus*, and *car*. Based on MacMillan *et al.* [16], the previous formulation now incorporates constraints of users and modes of transports in a matrix *L*, and thus the original NMF formulation extends to:

$$ \min_{A,B} \bigl\Vert W - (L \odot A) \times B \bigr\Vert _{F}^{2}, $$

(8)

where ⊙ is the Hadamard product operator. The matrix *L* contains the user labels, defined as:

$$ L_{u,m}= \textstyle\begin{cases} 1, & \text{if } P(u | m) \geq h, \\ 0, & \text{otherwise}, \end{cases} $$

(9)

where *h* is a threshold probability (*e.g.*, 0.8). The factor \(P(u | m)\) is the probability of user *u* being strongly associated to a specific mode of transportation *m*, given the set of towers \(T_{m}\), defined as:

$$ P(u | m) = \sum_{t \in T_{m}} w_{u, t}. $$

(10)

If other prior information is known, such as socio-economic information (*e.g.*, census data) of the area in which user *u* lives (inferred in a previous step), then \(P(u | m)\) can be updated, for instance, using Bayes Theorem.

### 2.5 Inference of mode(s) of transportation

In this step we work with the matrix *A*, which contains the user associations with latent components. We first normalize the matrix row-wise, to convert it into a matrix \(A'\) of probabilities, such that \(a_{u,m}\) is the inferred probability of user *u* choosing mode of transportation *m* for his/her commuting.

Since the several values of \(a_{u,m}\) lie in the continuous range \([0,1]\), we still need to assess the decision boundaries to classify *u* as user of specific modes of transportation. One way to interpret these values into a label such “metro” or “bus and metro” is by performing an additional clustering step on these associations. This step requires a manually specified parameter \(k'\), which should be higher than the number of latent components *k* from the previous stage; otherwise, intermodality will not be detected. Hence, we use k-means [34] to obtain *modal clusters*, which effectively quantize the rows of \(A'\).

After quantization, a transportation expert may examine the centroids of each modal cluster, and then may proceed to label them, including his/her knowledge about the city into the model. For instance, a centroid \(\{\text{metro}: 0.6, \text{bus}: 0.3, \text{car}: 0.1\}\) may be labeled either as “metro” or “metro and bus,” depending on how the users closest to that centroid distribute in the city. Then, users are assigned a modal cluster label based on the expert’s interpretation.

As result of this step, users have a tag that identifies their mode of transportation usage for commuting.

### 2.6 Identification of pedestrian trips

It is possible that some users do not generate *within-trip* events due to how they consume data from the Internet, or due to short trips that do not allow the billing cycle to capture events in the middle of a trip. In this step we try to classify those users that were not classified into specific modes of transportation into pedestrian commuters. Those users that were not classified into either are flagged with a null value.

Pedestrian trips have decision variables that differ from other modes of transportation, including distance, available infrastructure, and safety [35]. Of these factors, distance is arguably the most critical. As such, we label users as pedestrian or not based on their distance from home to work. This distance may be manually selected by knowing the typical walk distances in a city, through transportation studies [36] (which indicates a typical maximum of 750 meters for pedestrian trips), or fitted using regression if there is access to a labeled set of trips. In both cases, care must be taken due to the characteristics of mobile phone network data: trips have starting and destination *towers*, not specific locations.

### 2.7 Estimation of the modal partition

At this stage, for each commuter we have an assignment of a modal cluster, or a pedestrian flag, or a null flag. We discard those users without modal cluster or pedestrian flags, as we cannot classify them into a specific mode of transportation. Then, we aggregate users to estimate the *modal partition*, *i.e.*, the transportation term to denote the distribution of mode of transportation usage. Particularly, we follow two aggregation strategies: first, according to home locations; second, according to home/work location pairs into an Origin-Destination (OD) matrix. Here, location can be any administrative unit; surveys are typically representative at the county or municipal level. These aggregations are commonly used by transportation experts in their day to day work [13], and, with this pipeline, we expect to generate data that is coherent and comparable to those collected through surveys, with greater granularity levels—either spatial, temporal, or both.