The previous Section 3 and Section 4 how we can apply the privacy-by-design methodology for guaranteeing individual privacy in a setting where we have a central trusted aggregation center that collects data and before releasing it can apply a privacy transformation strategy to enable collective analyses in a privacy-aware fashion.

However, privacy-by-design paradigm can also be applied with success to distributed analytical systems where we have a untrusted central station that collects some aggregate statistics computed by each individual node that observes a stream of data. In this section we discuss an instance of this case [22]; in particular, we show as the privacy-by-design methodology can help in the design of a privacy-aware distributed analytical processing framework for the aggregation of movement data. We consider the data collector nodes as on-board location devices in vehicles that continuously trace the positions of vehicles and periodically send statistical information about their movements to a central station. The central station, which we call *coordinator*, will store the received statistical information and compute a summary of the traffic conditions of the whole territory, based on the information collected from data collectors.

We show how privacy can be obtained before data leaves users, ensuring the utility of some data analysis performed at collective level, also after the transformation. This example brings evidence to the fact that the privacy-by-design model has the potential of delivering high data protection combined with high quality even in massively distributed techno-social systems. As discussed in Section 3, the aim of this framework is to provide both *individual* privacy protection by the differential privacy model and acceptable *collective* data utility.

### 5.1 State-of-the-art on privacy-preserving distributed data analytics

A privacy model particularly suitable for guaranteeing individual privacy while answering to aggregate queries is *differential privacy*[36]. Recently, much attention has been paid to use differential privacy for distributed private data analysis. In this setting *n* parties, each holding some sensitive data, wish to compute some aggregate statistics over all parties’ data with or without a centralized coordinator. [37], [38] prove that when computing the sum of all parties’ inputs without a central coordinator, any differentially-private multi-party protocol with a small number of rounds and small number of messages must have large error. Rastogi et al. [39] and Chan et al. [40] consider the problem of privately aggregating sums over multiple time periods. Both of them consider malicious coordinator and use both encryption and differential privacy for the design of privacy-preserving data aggregation methods. Compared with their work, we focus on semi-honest coordinator, with the aim of designing privacy-preserving techniques by adding meaningful noises to improve data utility. Furthermore, both [39], [40] consider aggregate-sum queries as the main utility function, while we consider network flow based analysis for the collected data. Different utility models lead to different design of privacy-preserving techniques. We agree that our method can be further enforced to against the malicious coordinator by applying the encryption methods in [39], [40].

### 5.2 Attack and privacy model

As in the case analyzed in Section 3, we consider as sensitive information any data from which the typical mobility behavior of a user may be inferred. This information is considered sensitive for two main reasons: (1) typical movements can be used to identify the drivers who drive specific vehicles even when a simple de-identification of the individual in the system is applied; and (2) the places visited by a driver could identify peculiar sensitive areas such as clinics, hospitals and routine locations such as the user’s home and workplace.

The assumption is that each node in the system is honest; in other words attacks at the node level are not considered. Instead, potential attacks are from any intruder between the node and the coordinator (i.e., attacks during the communications), and from any intruder at coordinator site, so this privacy preserving technique has to guarantee privacy even against a malicious behavior of the coordinator. For example, the coordinator may be able to obtain real mobility statistic information from other sources, such as from public datasets on the web, or through personal knowledge about a specific participant, like in the previously (and diffusely) discussed linking attack.

The solution proposed in [22] is based on *Differential Privacy*, a recent model of randomization introduced in [41] by Dwork. The general idea of this paradigm is that the privacy risks should not increase for a respondent as a result of occurring in a statistical database; differential privacy ensures, in fact, that the ability of an adversary to inflict harm should be essentially the same, independently of whether any individual opts in to, or opts out of, the dataset. This privacy model is called *ε*-differential privacy, due to the level of privacy guaranteed *ε*. Note that when *ε* tends to 1 very little perturbation is introduced and this yields a low privacy protection; on the contrary, better privacy guarantees are obtained when *ε* tends to zero. Differential privacy assures a record owner that any privacy breach will not be a result of participating in the database since anything, or almost nothing, that is learnable from the database with his record is also learnable from the one without his data. Moreover, in [41] is formally proved that *ε*-differential privacy can provide a guarantee against adversaries with arbitrary background knowledge, thus, in this case we do not need to define any explicit background knowledge for attackers.

Here, we do not provide the formal definition of this paradigm, but we only point out that the mechanism of differential privacy works by adding appropriately chosen random noise (from a specific distribution) to the true answer, then returning the perturbed answer. A little variant of this model is the (\epsilon ,\delta )-differential privacy, where the noise is bounded at the cost of introducing a privacy loss. A key notion used by differential privacy mechanisms is the *sensitivity* of a query, that provides a way to set the noise distribution in order to calibrate the noise magnitude on the basis of the type of query. The sensitivity measures the maximum distance between the same query executed on two close datasets, i.e., datasets differing on one single element (either a user or a event). As an example, consider a count query on a medical dataset, which returns the number of patients having a particular disease. The result of the query performed on two close datasets, i.e., differing exactly on one patient, can change at most by 1; thus, in this case (or, more generally, in count query cases), the sensitivity is 1.

The questions are: *How can we hide the event that the user moved from a location* a *to a location* b *in a time interval* *τ? And how can we hide the real count of moves in that time window?* In other words, *How can we enable collective movement data aggregation for mobility analysis while guaranteeing individual privacy protection?* The solution that we report is based on (\epsilon ,\delta )-differential privacy, and provides a good balance between privacy and data utility.

### 5.3 Privacy-preserving technique

First of all, each participant must share a common partition of the examined territory; for this purpose, it is possible to use an existing division of the territory (e.g., census sectors, road segments, etc.) or to determine a data-driven partition as the Voronoi tessellation introduced in Section 3.3. Once the partition is shared, each trajectory is generalized as a sequence of crossed areas (i.e., a sequence of movements). For convenience’s sake, this information is mapped onto a *frequency vector*, linked to the partition. Unfortunately, releasing frequency of moves instead of raw trajectory data to the coordinator is not privacy-preserving, as the intruder may still infer the sensitive typical movement information of the driver. As an example, the attacker could learn the driver’s most frequent move; this information can be very sensitive because such move usually corresponds to a user’s transportation between home and workplace. Thus, the proposed solution relies on the differential privacy mechanism, using a Laplace distribution [36]. At the end of the predefined time interval *τ*, before sending the frequency vector to the coordinator, for each element in the vector the node extracts the noise from the Laplace distribution and adds it to the original value in that position of the vector. At the end of this step the node {V}_{j} transformed its frequency vector {f}_{{V}_{j}} into its private version \tilde{{f}_{{V}_{j}}}. This ensures the respect of the *ε*-differential privacy. This simple general strategy has some drawbacks: first, it could lead to large amount of noise that, although with small probability, can be arbitrarily large; second, adding noise drawn from the Laplace distribution could generate negative frequency counts of moves, which does not make sense in mobility scenarios. To fix these two problems, it is possible to bound the noise drawn from the Laplace distribution, reducing to an (\epsilon ,\delta ) differential privacy schema. In particular, for each value *x* of the vector {f}_{{V}_{j}}, it is possible to draw the noise bounding it in the interval [-x,x]. In other words, for any original frequency {f}_{{V}_{j}}[i]=x, its perturbed version after adding noise should be in the interval [0,2x]. This approach satisfies (\epsilon ,\delta )-differential privacy, where *δ* measures the privacy loss. Note that, since in a distributed environment a crucial problem is the overhead of communications, it is possible to reduce the amount of transmitted information, i.e., the size of frequency vectors. In [22], a possible solution of this problem is reported, but given that this is beyond the purpose of the current paper, we omit this kind of discussion.

### 5.4 Analytical quality

So far we presented the formal guarantees to individual privacy preservation, but we have to show yet if the individually transformed values are still useful once they are collected and aggregated by the coordinator, i.e., if they are still suitable at collective level for analysis. In the proposed framework, the coordinator collects the perturbed frequency vectors from all the vehicles in the time interval *τ* and sums them movement by movement. This allows obtaining the resulting global frequency vector, which represents the flow values for each link of the spatial tessellation. Since the privacy transformation operates on the entries of the frequency vectors, and hence on the flows, we present the comparison (before and after the transformation) of two measures: (1) the *Flow per Link*, i.e. the directed volume of traffic between two adjacent zones; (2) the *Flow per Zone*, i.e. the sum of the incoming and outgoing flows in a zone. The following results refer to the application of this technique on a large dataset of GPS vehicles traces, collected in a period from 1st May to 31st May 2011, in the geographical areas around Pisa, in central Italy. It counts for around 4,200 vehicles, generating around 15,700 trips. The *τ* interval is one day, so the global frequency vector represents the sum all the trajectories crossing any link, at the end of each day.

Figure 6 shows the resulting Complementary Cumulative Distribution Functions (CCDFs) of different privacy transformation varying *ε* from 0.9 to 0.01. Figure 6(left) shows the reconstructed flows per link: fixed a value of flow (*x*) we count the number of links (*y*) that have that flow. Figure 6(right) shows the distribution of sum of flows passing for each zone: given a flow value (*x*) it shows how many zones (*y*) present that total flow. From the distributions we can notice how the privacy transformation preserves very well the distribution of the original flows, even for more restrictive values of the parameter *ε*. Also considering several flows together, like those incident to a given zone (Figure 6(right)), the distributions are well preserved for all the privacy transformations. These results reveal how a method which *locally* perturbs values, at a *collective* level permits to obtain a very high utility.

Qualitatively, Figure 7 shows a visually comparison of results of the privacy transformation with the original ones. This is an example of two kind of visual analyses that can be performed using mobility data. Since the global complementary cumulative distribution functions are comparable, we can choose a very low epsilon (\epsilon =0.01) with the aim to emphasize the very good quality of mobility analysis that an analyst can obtain even if the data are transformed by using a very low *ε* value, i.e. obtaining a better privacy protection. In Figure 7(A) and (B) each flow is drawn with arrows with thickness proportional to the volume of trajectories observed on a link. From the figure it is evident how the relevant flows are preserved in the transformed global frequency vector, revealing the major highways and urban centers. Similarly, the *Flow per Zone* is also preserved, as it is shown in Figure 7(C) and (D), where the flow per each cell is rendered with a circle of radius proportional to the difference from the median value of each global frequency vector. The maps allow us to recognize the dense areas (red circles, above the median) separated by sparse areas (blu circle below the median). The high density traffic zones follow the highways and the major city centers along their routes. The two comparisons proposed above give the intuition that, while the transformations protect individual sensitive information, the utility of data is preserved.