In this section we will introduce the three methods we will use in this article to measure social mixing, study the attraction to malls and cluster them according to their costumer profiles.

### 3.1 Quantifying social mixing

Our motivation is that malls promote social mixing in terms of co-locations of people from different socioeconomic status in the same spaces. This should be reflected in the choice of malls, traditionally understood as distance-based. If social mixing is an important factor when choosing which mall to visit, then, if two malls are at equivalent distance, then the one with highest social mixing should be preferred. To be able to test this, first, we need to quantify social mixing. We do so by analyzing segregation.

Several models of segregation exist, yet, recently Louf and Barthelemy [11] proposed a new model that considers a null model of spatial segregation, where the exposure of a population *α* to a population *β* is defined as:

$$ E_{\alpha\beta} = \frac{1}{N_{\alpha}} \sum_{m = 1}^{M} n_{\alpha }(m)r_{\beta}(m) $$

(1)

with

$$ r_{\beta} (m) = \frac{n_{\beta}(m) / N_{\beta}}{n(m) / N}, $$

(2)

where *m* is the mall index, \(N_{\alpha}\) is the total number of people belonging to category *α*, \(n_{\alpha}(m)\) is the total number of visitors to mall *m* belonging to category *α*, \(r_{\alpha}(m)\) is the *representation* of category *α* in mall *m* as computed by equation (2), \(n(m)\) is the total number of visitors to mall *m*, and *N* is the total number of people. The exposure metric is interpreted as follows: if \(E_{\alpha\beta} > 1\), then social mixing happens between subpopulations *α* and *β* within malls. Conversely, if \(E_{\alpha\beta} < 1\), then both categories are segregated.

To define visitor categories, we bin users into percentiles according to their socio-economic characteristics. These percentiles are based on the Human Development Index, or HDI. This index, when not directly available, can be estimated with a method proposed by the United Nations [39]. Its formula includes income distribution, life expectancy and education. As such, the value of *E* depends on the distributions of the percentiles of the HDI of its visitors, according to their home place.

While *E* is a measure of segregation/mixing, we still need a measure of the social diversity within a specific mall. To this end, we use the Shannon Entropy \(S_{m}\) with respect to the percentiles of HDI of its visitors:

$$ S_{m} = \sum_{q = 1}^{Q} p_{q} \log p_{q}, $$

(3)

where \(S_{m}\) is the entropy of mall *m*, and \(p_{q}\) is the fraction of visitors to *m* that belong to HDI percentile *q*.

Note that, by definition, the representation term in the model compares the relative population that visit a mall to the expected value in an unsegregated city [11]. This implies random interactions with respect to social status. As found in earlier work, mall visits are strongly influenced by distance, and thus, interactions may not follow a random pattern. Then, we shall compare the observed social mixing against a null model in which visitors always choose their nearest mall.

### 3.2 A gravity mobility model for mall visits

The gravity model of flow has been extensively used to model human mobility in different contexts [28, 29, 33], and in particular for retailing [31, 32]. It considers the flow between two nodes \((i, j)\) as directly proportional to some powers of their populations, and inversely proportional to some power of the distance between them:

$$ F_{ij}=G \frac{M_{i}^{\alpha}M_{j}^{\beta}}{D_{ij}^{\gamma}}, $$

(4)

where *G* is a proportionality constant, \(M_{i}\) is the population of a square grid in the city, computed from census data, \(M_{j}\) is mall size in terms of total rental space, and \(D_{ij}\) is the distance between the center of the square grid and the mall.

The traditional approach for fitting this model consists of applying a logarithmic transformation, leading to a linear model on the logarithms of the variables:

$$ \log(F_{ij}) = \log(G) + \alpha\log(M_{i}) + \beta \log(M_{j}) - \gamma\log (D_{ij}) + \epsilon_{ij}, $$

(5)

where \(\epsilon_{ij}\) represents an additive, independent error term. This linearized model can be fitted through OLS (ordinary least squares) [40]. However, this approach has several limitations: it cannot model the zero observations (which must be thrown away), and the estimated coefficients can have significant biases under heteroskedasticity [41]. As an alternative, we replace the linear regression by a Generalized Linear Model (GLM) [42]) with a Negative Binomial distribution for count data:

$$ \mathbf{E}[F_{ij}] = \exp\bigl[\log(G) +\alpha\log(M_{i}) + \beta\log(M_{j}) - \gamma\log(D_{ij})\bigr]. $$

(6)

This Negative Binomial GLM is fitted by maximizing the log-likelihood function. The maximization of this function does not have a closed analytical solution, but as the function is convex convergence is guaranteed by applying standard optimization techniques such as gradient descent or iteratively reweighted least squares (IWLS).

To account for the social mixing factor, in addition to the baseline gravity model we shall consider a distance-modulated model, where the distance to a mall varies according to its social diversity: malls with higher entropy appear as closer. This model is specified as:

$$ \mathbf{E}[F_{ij}] = \exp \biggl[\log(G) +\alpha\log(M_{i}) + \beta\log (M_{j}) - \gamma\frac{\log(D_{ij})}{S_{j}} \biggr], $$

(7)

where \(S_{j}\) is the social diversity (entropy) of mall *j* according to its distribution of HDI percentiles (Eq. (3)). This modulated distance would allow to differentiate malls that, with respect to a visitor, are within the same distance, but with different social mixing properties.

### 3.3 Clustering malls according to customer profiles

In order to better understand the motivations behind mall selection, we built a co-visitation network representing common mall customers. This is a weighted directed network whose nodes \(v_{i}\) are the 16 malls, while the weighted edges \((v_{i}, v_{j})\) between them represent the conditional probability of visiting mall \(v_{j}\) given that someone visited mall \(v_{i}\). We built a similarity matrix *S* between malls using the Kolmogorov–Smirnov \(S_{ij}\) distance between the customer profile distributions of malls. We then built a Logit model for describing the conditional probability that a customer visits mall \(v_{j}\) given that they also visit mall \(v_{i}\). We fitted this model using a logistic regression:

$$ \mathbf{E}[p_{j|i}] = \bigl(1 + \exp\bigl[-\log(G) - \beta \log(M_{j}) - \lambda\log (S_{ij}) + \gamma \log(D_{ij})\bigr]\bigr)^{-1}, $$

(8)

where *G* is a parameter representing the *odds* of the event in the logistic regression.