Measuring the effect of node aggregation on community detection

Many times the nodes of a complex network, whether deliberately or not, are aggregated for technical, ethical, legal limitations or privacy reasons. A common example is the geographic position: one may uncover communities in a network of places, or of individuals identified with their typical geographical position, and then aggregate these places into larger entities, such as municipalities, thus obtaining another network. The communities found in the networks obtained at various levels of aggregation may exhibit various degrees of similarity, from full alignment to perfect independence. This is akin to the problem of ecological and atomic fallacies in statistics, or to the Modified Areal Unit Problem in geography. We identify the class of community detection algorithms most suitable to cope with node aggregation, and develop an index for aggregability, capturing to which extent the aggregation preserves the community structure. We illustrate its relevance on real-world examples (mobile phone and Twitter reply-to networks). Our main message is that any node-partitioning analysis performed on aggregated networks should be interpreted with caution, as the outcome may be strongly influenced by the level of the aggregation.


Introduction
The last few years have seen a spectacular rise of the production of large datasets. These newly available sources of information have opened new avenues for research on modeling human behavior and social interactions. For instance, community detection in large scale social networks has been under intense investigation in the last decade [1]. However, oftentimes the nodes are only available under an aggregated form prior to any analysis, and to our best knowledge the impact of node aggregation on community detection has not been discussed yet.
Our motivating example in this article is the case of geolocalized social networks. A social network may be projected on a spatial territory and create a network of places. A community detection method applied to this network allows a partition into socially coherent entities. The position of a node may be a specific pair of coordinates, or coarsened to an administrative unit e.g municipalities, a technological unit, e.g. the area covered by a telecommunication tower, or a regular cell, e.g. a square or hexagon. The reason may lie in a coarse collection method, privacy concerns [2], or even relevance for a specific purpose. Indeed, one may want for instance to study how municipalities interact with one another and form larger natural entities, as could be relevant for example if one had to shape councils where representatives from municipalities with dense ties to one another would discuss common policies. In this case the community structure is to be computed on the municipality-level network, and not on a finer level. If provided with the citizen-level network (with homes or GPS locations as nodes) as primary dataset, then one has to aggregate the data accordingly. In other circumstances, the citizen-level communities are desired, and only the municipality-level dataset is available, raising the question whether the communities on both networks will share some resemblance.
Deducing wrong statistical patterns on individuals from patterns observed at the level of aggregated categories of individuals is generically called an ecological fallacy, with Simpson's paradox [3,4] or Robinson's paradox [5], as well-known examples. In geography, a particular form of such fallacy is called the Modifiable Areal Unit Problem (MAUP). In the earliest occurrence of MAUP, Gelhke and Biel [6] showed that the value of the correlation coefficient of geolocalized features was influenced by the size of the spatial units used in their analyses. Openshaw further showed that the results of quantitative spatial models and statistics may depend highly on the size and shape of the basic spatial units used [7]. This problem has been broadly studied and is the object of extensive literature, see [8] for a review. The atomic fallacy can be seen as the bias generated by extrapolating patterns present at the individual level to the level of the group to which those individuals or their geographical entities belong.
In this article, we warn against the danger of atomic and ecological fallacies in the field of community detection, and measure quantitatively the impact of node aggregation on the community structure in networks. We first show that some community detection methods are more suitable than others when computing communities on aggregated networks. Then, we introduce the aggregability index, a quantitative proxy for the robustness of the community structure of a given graph with respect to given node aggregation groups.
Although our main conclusions apply to any optimal node partitioning problem and any node aggregation, we focus on the community structure (in the sense of blocks of tightly connected nodes) of aggregated geolocalized networks for the sake of the example. We illustrate our methodology on a dataset of geolocalized tweets in Belgium, and mobile phone dataset from one provider in Belgium. We observe that the community structure of the Twitter dataset is highly sensitive to aggregation, being significantly different at the finest and coarser scales, while the mobile phone dataset offers robust aggregability properties with respect to its community structure.

Some theoretical considerations
Assume we want to detect communities in a weighted, undirected graph G, understood as a (nonoverlapping) partition C of the nodes of G. Let us assume that we are also interested in optimizing a certain criterion, capturing structural patterns of interest, typically high density of edges inside the communities and low density across communities. Some other criteria are also possible as, for instance one may want to detect core-periphery structure or general stochastic block models [9,10,11]. We want to underline here that there is a variety of possible criteria whose relevance is strongly dependent on the network and the application. For instance, some methods integrate a resolution parameter that impose a preference for small or large communities [12,13]. Some methods based on comparison with a generative model for the graph are highly dependent on the choice of a model [14]. Even more broadly, different goals for community detection may lead to entirely different objective functions [15]. As many of those methods proceed by optimizing a goodness criterion, we talk of "the optimal partition"to denote the communities found to optimize the criterion of interest -we suppose for simplicity that the partition is unique and can be discovered effectively, although in practice most algorithms are only heuristics.
Assume moreover that a coarsened graph G is obtained from the aggregation of the nodes and edges of G, following an aggregating partition P. In other words, if P partitions nodes of G into k classes, then G has k nodes, and the weight of the edge (if any) between node i and node j of G is the sum of all weights of all edges of G, between nodes in class I and class J of the aggregating partition P. In particular, node i of G has a self-loop aggregating the weight of all the edges inside class I, as there can be interactions between different nodes of the same class, giving where w uv is the weight of the link between nodes u and v, in the initial disaggregated network.
In general, we want to understand the relationship between the communities of G and G . A specific case of interest is when we want to know the communities of G yet we only have access to G . Clearly the best scenario is when the aggregation classes are subsets of the optimal communities in G. In this way, the aggregation transforms the optimal community partition C of G into a (possibly non-optimal) community partition C of G . Assuming the knowledge of the community partition C , it is then possible to recover C by de-aggregation, i.e. by replacing every node in G by its aggregation class in G. In other words, if f P : Nodes(G) → Nodes(G ) is the aggregation function relating every node of the original graph G to its corresponding node in the aggregated graph G , then every community in C is of the form C = f −1 (C ) for some community C in G . If, moreover the community C is also optimal in G then we have a natural way to recover the community structure of the original G: first compute C then de-aggregate it to C. However, whether C is optimal for G is dependent on the definition of communities.
This can be guaranteed if the objective function, evaluated on a given graph G and incumbent community partition C, only depends on the graph G , obtained by aggregating G with respect to C. In other words, we require that the objective function depends only on the total weight of all edges between any pair of communities (including from a community to itself), but not on the way those links are distributed inside a community or between communities. We call such a function an edge-counting function.
This natural result is proved simply. Since we assume that G is obtained from G by aggregation with respect to a partition P, and that the partition C is coarser than P, then the aggregation of G with respect to C coincides with the aggregation of G with respect to C. Therefore, the edge-counting objective function takes the same value for (G, C) and (G , C ), and if C is optimal for G then is so is C for G .
Despite its simplicity, this first result suggests that some methods of the literature are more appropriate than others in presence of node aggregation. Such edge-counting criteria include modularity [16], Potts models [17], linearized partition stability [13], Infomap [18], conductance [19], Normalized Cuts [20], and their natural extension to weighted graphs.
On the other hand, methods based on counting paths rather than edges, and therefore depending on the way edges are distributed and not only the number or total weights, such as Markov clustering [21], Walktrap [22], partition stability [13], etc., should be used with the greatest caution in case of aggregated data.
However, even an edge-counting objective function cannot preserve the community structure in the context of arbitrary aggregation classes. Assume for instance, that aggregation classes are chosen randomly, every node being attributed with uniform probability to one of the classes. Then, it is reasonable to assume that the aggregated graph will behave like a complete graph with all edges of similar weight, thus exhibiting no structure, or communities created only by small random fluctuations in the weights, retaining no information from the optimal communities of G. One can also generate examples where well chosen classes generate a graph with entirely different, yet statistically significant, community structure. See Fig. 1 for an illustration on a toy 4-node network, where different aggregations induce different community structures on the fine-scale network, that may or may not coincide with the community structure computed directly on the fine-scale network.
A more general example is built with the Kronecker product of an n 1 -node graph G 1 and an n 2 -node graph G 2 . In the product graph, whose node set is the Cartesian product of the two individual node sets, a node (i, j) is connected to the node (i , j ) if i and i are neighbours in G 1 , as well as j and j in G 2 . If the graphs are weighted, then the weight on an edge in the product graph is simply the product of the weights in the corresponding edges in G 1 and G 2 . The product graph can be aggregated in two natural ways, in one that retrieves G 1 as aggregated graph, and another one that retrieves G 2 . Assume that the fine-grained network is the product graph of G 1 and G 2 . Both aggregated graphs G 1 and G 2 may have a significant community structure, thus the community detection on both aggregations will provide interesting, distinct insights on the underlying fine-grained network.
A real-life analogy would involve, for instance, aggregating a social network according either to geographical location, or to age class: both may exhibit relevant community structures which can be lifted back to the social network, revealing for instance communities of people of diverse ages who tend to live in close areas, or communities of similar ages living across the country. Both community partitions offer interesting insights on the network, and at least one of the two differs from the communities found directly on the social network.
Between the two extremes of identical or completely different community structures, one finds Figure 1: Community detection over two examples of aggregations of a same 4-node network. Selfloops in aggregated networks are omitted for clarity. We assume that the community detection criterion is such that each aggregated network admits the trivial two-community partition as optimal. The community structure on each aggregated network lifts to two possible partitions on the 4-node network. The community structure could be coincide with either of the two, or with that 4-community partition, according to the respective weight of the edges. On the depicted example, it may coincide with the same-colour communities.
situations where the aggregating partition are more or less related to the optimal communities in G, and therefore where node aggregation is expected to perturb more or less the community detection.
We propose a metric that captures to which extent node aggregation will preserve community detection by introducing the aggregability index, η, as the fraction of information required to identify the community of a randomly chosen node, that is provided by the knowledge of its aggregation group: Here H(C) is the Shannon entropy of the community partition computed in the following way. As a thought experiment, pick a node uniformly at random in G. The aggregation group of the node is a random variable with Shannon entropy H(C) − C∈C P (C) log P (C), with probability P (C) of a community C being proportional to its number of nodes. Similarly, I(C; P) is the Shannon mutual information between the community in the partition C and the aggregation group in P of a randomly picked node of G. Our newly-defined aggregability index, η, ranges from 0 to 1. In the η = 0 limit, the aggregation groups are independent from the communities, which implies in particular that each node is aggregated with nodes from other communities. In the η = 1 limit, the aggregation groups are subset of the communities, thus any edge-counting criterion will preserve the community structure.
In the next sections we show empirically how the aggregability index correlates with the actual distortion in the optimal communities, found for the original and aggregated networks on two datasets that albeit embedded in the same geographical area -Belgium-will reveal different behaviors with respect to aggregation. In both cases we know a network G, aggregate it according to administrative units or regular squares, compute the aggregability index and observe the distorsion of the communities found to be optimal in the new networks.

Methods
We now describe the datasets, the definition of community and the way to compare partitions in an empirical approach. An explanation about the territory where both datasets were taken, with a visual illustration, is given in section SA.1. of the Supplementary Information.

Twitter networks
Our first dataset is composed of 291,552 tweets (short messages) between 18,327 on-line users browsed on Twitter, obtained as described in Supplementary Information SA.2. From this network, called N 0 , we created a list of aggregated networks. The territory of Belgium is divided into 589 municipalities, and used to be divided 2,675 smaller municipalities until a merge took place in 1979. We first build two aggregated versions, where nodes represent former (N f m ) and current (N m ) municipalities, respectively. We also attached a regular grid of 125m square cells (resulting in aggregate network N 125 ), and increasingly coarser square grids of cell size 250 m to 32 km, corresponding to networks N 250 to N 32k respectively. Number of nodes and links are described in Table 1 of the Supplementary Information (SA.3.).

Phone networks
Our second dataset counts the numbers of phone calls between towers in the territory of Brabant, a former administrative unit of 111 municipalities including and surrounding Brussels, the capital of Belgium. The derived undirected network, called M t , is composed of 1,168 nodes (towers) with an edge between two towers counting the number of communications between the towers in either direction, for a total of 13M communications over the network. Further aggregated networks, were derived similarly to the Twitter dataset, as described in Table 2

Linearized stability maximization
Communities are intuitively meant here as groups of strongly interconnected nodes with comparatively few connections between the groups. Among the very many formalizations of this concept, one of the most popular is modularity [23], quantifying the goodness of a given partition P of nodes as where m is the sum of all weights of the networks' edges, k i represents the degree of node i. A ij is the weighted adjacency matrix of the network, and C ∈ P represents a community of the partition. We use a generalization, called linearized partition stability [13], or equivalently Potts model [17], which introduces a resolution parameter ρ varying from 0 to ∞ as follows: where A is the weighted adjacency matrix, P ij is equal to 1 if node i is in community j in partition P and 0 otherwise, and π is defined as the vector of normalized nodes' weighted strengths, that is, π = 1 T A/2m. 1 is the N × 1 vector of ones and m is the total weight of all edges of the network. At ρ = 0, single nodes are optimal as communities, while partitions with larger communities emerge for increasing values of ρ, until a single community is optimal at ρ → ∞. For ρ = 1, the linearized stability is the modularity, r lin (1, P) = Q P . The resolution parameter ρ is hereafter called timescale, because linearized stability is formally derived in [13] as capturing the ability of incumbent communities to retain the flow of a diffusion of random walkers across the network for a timescale of the order of ρ. As most community detection criteria, linearized stability is NP-hard to optimize except for extreme values of ρ, and we use the Louvain method [24,26] as a heuristic.

Normalized mutual information for comparing partitions
To evaluate how similar two partitions C and D of the same set of nodes are, we compute the normalized mutual information [27] between the two partitions, as where I(C; D) denotes the mutual information between the two partitions, i.e. between the community in C and the community in D of a randomly picked node of the graph. Similarly, H(X ) denotes the Shannon entropy of each partition, i.e. the Shannon entropy of the community of a randomly picked node of the graph. The NMI takes values between 0, for independent (thus maximally dissimilar) partitions, and 1, for identical partitions.
In our case, we also want to be able to compare community partitions at different levels of aggregation, say the optimal partition C and D of networks N 0 and N 125 , respectively. In this case, we lift the communities of N 125 into communities of N 0 , replacing each node of N 125 by its aggregation group in N 0 . We call D this partition of the nodes of N 0 . We now compare the two partitions C and D , with the quantity NMI(C, D ), which we will also denote NMI(C, D) by abuse of notations.

Results
In the following we show how the aggregation process over the Twitter and phone calls networks strongly affects the community partition in the former case, and mildly so in the latter. We also show how the magnitude of this distorsion, as the aggregation grid becomes coarser and coarser, correlates well with the proposed aggregability index. Figure 2-a shows the communities extracted from the network N m of municipalities, using a timescale ρ = 1. Each figure 2-b to 2-f shows the spatial footprint of one community of individual Twitter users. We have used a timescale ρ = 10, in order to illustrate the case with the number of communities most similar to the N m network. We achieved 5 communities comparable with the 7 ones in the network of municipalities (Fig. 2-a). The color intensity in each municipality represents the proportion of users belonging to the community being represented.

Twitter networks
Some communities of N 0 (for example those represented on Figures 2-b and 2-c) show a remarkable geographical dispersion, and in particular do not seem to match any communities of N m . In order to analyse quantitatively the effect of aggregating data, we systematically test different levels of spatial aggregation, all at the same timescale parameter ρ = 1. Figure 3 shows communities at different level of aggregation: municipalities, former (smaller) municipalities and square cells of size 1km, 2km, 4km, 8km. As the aggregation groups become larger and larger they step over several communities forcing a rearrangement of the communities, resulting in another partition.
We can see that as the areas are increasingly aggregated, some communities gathering distant places, such as the light green community having people in separated provinces in Fig. 3-a) to 3-c), are re-arranged into geographically close communities (light green in Fig. 3-f). One may surmise that as the aggregation groups increase, communities will depend more and more in bigger values of integrants, overspreading small isolated communities. In white is depicted the physical space where no event has been recorded. In the smaller levels, users are represented as a single point (their average position), therefore virtually all space is white. As the aggregation scale increases the white space is progressively removed, being merged with neighbouring space with non-zero activity. We observe that this effect is more visible in areas with low levels of activity, as the southern part of the country.
The normalized mutual information (NMI) between the disaggregated network N 0 and several aggregated networks is depicted on Figure 5. Starting the first point (125m) with small aggregation, we observe that the NMI already drops rather steeply, even though there is some fit (NMI ≈ 0.7) between the communities displayed by aggregated units of 125m and the non-aggregated ones. Values of NMI continue to decrease with the size of the aggregation.

Mobile phone networks
The mobile phone calls dataset allows us to compare another type of communication dataset, differing not only in the geographical area (Brussels and surroundings, rather than the whole Belgium) and in the nature of the technical medium, possibly inducing different social behaviour. The difference also comes from the format of the data, where the nodes in the finest network represent towers rather than individual users. The timescale parameter for this study is set to ρ = 0.75, as is suggested by another study on the same dataset [30]. that the similarity between the communities found on the two levels of aggregation is higher than the similarity observed in the Twitter network between the disaggregated network of users (N 0 ), and the aggregated versions (see Fig. 5). On Fig. 5 we also notice that the NMI between the communities found on M t and versions aggregated with larger and larger cells is consistently higher than in the case of the Twitter dataset.

The aggregability index
For both datasets we compare, in Fig. 5, the results of community detection on networks of square cells of sides 125m, 250m, 500m, 1km, 2km, 4km, 8km, 16km and 32km, along with the aggregability indices for the same networks.
The aggregability index, η, requires the knowledge of the optimal communities and aggregability at the finest level, but not of the optimal communities of the aggregated graph, and measures to which extent every aggregating group is a subset of a community. Therefore, low values of η can be seen as a warning signal that communities on the aggregated network (once lifted on the original network) will be significantly different than the original communities. In Fig. 5 we observe, indeed, that the value of η for mobile phone calls stay remarkably steady until the aggregation scale of 1 or 2 km, while the η value for the Twitter dataset dips comparably much faster -and so does the NMI between the community partitions at different scales, as expected. Figure 5: In circles is shown the evolution of normalized mutual information, NMI, between communities found in the network prior aggregation, and communities found in aggregated networks at several square sizes. In squares, the evolution of the aggregability index, η, between communities and aggregability at the finest level compared with the same sizes as before. For Twitter data (in blue) the initial level corresponds to users centroids and time scale kept to ρ = 1. For mobile phone data (in pink), the initial level corresponds to cell towers and the timescale was kept constant with a value of 0.75.

Discussion
In this paper, we have studied the impact of data aggregation on community detection in networks.
We have shown that the data aggregation can preserve the community structure, destroy it, or reveal another relevant community structure. In our empirical illustrations, we have addressed specifically the case of the spatial aggregation of nodes with geographical coordinates, in line with the well-known Modified Area Unit Problem in geography. However, aggregation may be performed according to other meta-data, for instance age, obtaining a network of communication between the different groups of ages. This may be relevant, for instance, when evaluating the possibility of transmission of disease, with an age-dependent transmission probability. One may further refine those sets with gender, etc. Likewise, a partition of a social network according to socio-economic level may be used to assess social mobility and social integration, etc. In each case, the corresponding community structure will take a particular relevance, distinct from the meaning of the community structure found in the disaggregated data. Aggregation may also occur, due to technical, ethical or legal limitations, in which case the change of community structure may be seen as an unwanted distortion.
We have identified a specific class of community detection criteria, the edge-counting criteria, that are especially suitable to avoid useless distorsion in presence of aggregation respecting the community structure of the disaggregated data.
We have shown that the relationship between the community structure and the aggregation partitioning (quantified by the aggregability index) is a good predictor of the difference between the community structures on the original and aggregated networks (measured by the NMI). Therefore, this measure can be performed over non-aggregated networks in order to know, in advance, to what extent a given aggregation may destroy the original community structure on the data. However, let us emphasize again that a low aggregability index is not necessarily unwanted, depending on whether the aggregation is deliberate or imposed.
As to explaining why the two datasets behave differently with respect to a same aggregation strategy, one can only formulate hypotheses. While mobile phone calls dataset is shaped by the condition of previous social interaction, this constraint is not present, or to a lesser extent, in the Twitter dataset. Previous works have shown the correlation between mobile phone interactions and geographic proximity [28,29], while in the present study we observe that communities of Twitter users tend to be geographically diffuse.
Further differences between the datasets include the heterogenous density of events in the Twitter network, resulting in a different effect of the white space aggregation, and the different geographic area (Belgium or surroundings of Brussels). Even more importantly, the mobile phone dataset has towers as the finest scale, which already aggregate a large number of users. On the Twitter dataset, we can observe that the geographically diffuse communities are observed especially at the lower aggregation scales, while from 1-2km the communities are more geographically localized, and remain more stable under aggregation.
In conclusion, the pre-aggregation of data seems to have the side-effect of promoting communities that are geographically localized and stable to further aggregation. It is therefore possible, that the robustness and stability of the present mobile phone dataset is due to the pre-aggregation at the tower level. In this sense, the geographically localized communities observed in mobile phone data or other relational data in the literature are possibly be partly connected to the pre-aggregation bias, if present [31,32,33,34].
Nowadays, many datasets need to be aggregated prior to data sharing, among others for privacy reasons. Therefore, the studied datasets are only accessed by researchers after another kind of aggregation has occurred, sometimes with little control on how the aggregation is performed. Results, of community detection, as we have shown in this paper, or indeed any complex network analysis, should therefore be interpreted with caution, bearing in mind that they may be strongly influenced by the format of the data that was shared.