4.1 Graphical displays for exploratory data analysis
Exploratory data analysis is used to summarize the basic features of the underlying data in a study. Graphical displays of exploratory statistics can describe univariate characteristics, such as the marginal distribution of a single variable, indicating its mean, skewness and kurtosis, or the dispersion of the variable indicating its range, quantiles, or central tendency (e.g. mean and median). The most common plots of univariate statistics are histograms and box plots. For bivariate (or multivariate) analysis, exploratory statistics are used to summarize the relationships between different variables which, can be displayed graphically in scatter plots, heat maps and contour plots. Here, we demonstrate how we can use the proposed anonymization methods to produce common graphical outputs useful in exploratory statistical analysis.
The first example is the graphical representation of the distribution of a variable in a histogram. To construct a histogram, the range of a variable is divided into a number of – typically – equal-sized intervals that define the widths of the histogram’s bins, and the frequency of the values falling in each of these intervals defines the height/area of the bin. If the range of a variable is divided into relatively small intervals, or if the variable includes some extreme outlier values, then a single or a small number of counts might fall into certain intervals. The histogram bars with low counts of observations are potentially disclosive, as one might be able to infer the small range within a certain individual-level record lies and, with potentially low uncertainty, to estimate its exact value. To ameliorate the risk of disclosure, the stated anonymization techniques are used to generate privacy-preserving histograms. Firstly, we apply the low counts suppression rule. Using this method, any histogram bar that includes less counts than an agreed threshold is suppressed from the plot. In panels (ii) of Fig. 1, we show the histograms of variables X and Y after suppressing any bars with less than three counts. Secondly, we apply generalization by increasing the width of the intervals that divide up the range of a variable, and redistributing the variable’s values into those new bins (see panels (iii) in Fig. 1). By increasing the width of the intervals, the number of the bins decreases, but also the probability of having bins with low counts of observations decreases. Any bins that still have sub-threshold counts will still be subject to the suppression rule. Thirdly, we apply deterministic anonymization, with the value of \(k=3\) set for the number of nearest neighbours, and produce the histograms of the estimated scaled centroids (see panels (iv) in Fig. 1). Finally, we apply the probabilistic anonymization method by adding to each variable a normally distributed error with standard deviation equal to 0.25σ (where σ is the standard deviation of the original real variable) and generating histograms of the noisy data (see panels (v) in Fig. 1).
When the parameters of the anonymization techniques are selected carefully to mitigate the disclosure risk while preserving the shape of the underlying distributions, the generated histograms are both informative and secure. However, an unconsidered choice of the parameters can result in a distorted histogram. For example, let us assume that an analyst decides to use suppression with a big value of k (i.e. much greater than 3) to generate a privacy-preserving histogram of a variable with a long-tailed skewed distribution (like the log-normal distribution of variable X in Fig. 1(B)). Then, all the bins of the tail having lower counts than the selected threshold of k will be suppressed, and the displayed histogram will have a more symmetrical distribution obscuring the original asymmetrical shape. The shape can also be distorted when generalization creates unusually wide bins. A note here is that the bins do not have to be all of the same width. Given how we define our histograms, the area of the bar corresponds to the frequency, and therefore the widths of different bins can be manipulated in a way that will preserve the overall shape of the distribution, while hiding unique and outlying observations. Note that in Fig. 1 we display the density of each bin rather than the frequency, which is just the frequency divided by the width of the bin.
The other two anonymization techniques, deterministic and probabilistic, seem generally to perform well in generating privacy-preserving histograms. However, avoidance of selecting seriously inappropriate values for the control parameters in each specific data setting is particularly important for the probabilistic method. For example, large values can distort the apparent distribution. One example of such deformation, is the histogram of variable Y in panel (v) of Fig. 1(C), which fails to preserve the original bimodal nature of the variable (see histogram with red colour in panel (i) of Fig. 1(C)). In that example, a lower level of noise (i.e. noise with smaller variability) might be able to retain both the data utility and the individual privacy. Another disadvantage of these two methods is that they can both, in principle, generate anonymized values that fall outside the original (or theoretically defined) ranges characterising the underlying data. In most cases this is not a great problem as such values are generally very few in number. Furthermore, for the deterministic method it can only happen after final rescaling, when the underlying distribution generates an unusually large rescaling factor. Nevertheless, the fact that this can, in theory, occur should not be forgotten and it can potentially be problematic when the methods are applied to bounded (e.g. beta) or semi-bounded (e.g. log-normal) distributions. In such cases, some bins may be generated with values falling outside initial boundaries (see for example the bins with values below zero and above one in panels (iv) and (v) of Fig. 1(C)).
The second example of displaying exploratory statistics in graphs, is the visualization of data in scatter plots. A scatter plot is highly informative as it indicates the dependence (linear or non-linear relationship) between two variables and their correlation, if any. However, as a corollary, a scatter plot can also be potentially disclosive, as it provides a graphic representation of the exact coordinates of two variables from a set of data. A scatter plot can be rendered privacy-preserving, if the actual values of the two variables (particularly in outliers) can be concealed, while it remains informative if the graphic generated by the selected anonymization technique faithfully demonstrates the underlying statistical properties of the displayed variables. Figure 2 shows the privacy-preserving scatter plots of the data from the three datasets, generated using the k-anonymization (panels (ii) and (iii)), the deterministic anonymization (panels (iv)) and the probabilistic anonymization (panels (v)). The privacy-preserving scatter plots can be compared with the potentially-disclosive scatter plots of the raw data (panels (i)).
We mentioned earlier that the k-anonymization can be applied to plots where data are grouped either in bins (i.e. 1-dimension) or grids (i.e. 2-dimensions). So here, in order to produce the scatter plots of panels (ii) in Fig. 2, we first generate a 30 by 30 density grid matrix (or similarly a generalized 15 by 15 matrix for the scatter plots of panels (iii)) based on the ranges of the X and Y variables, and suppress any data that exist in grids with less than three counts. We then plot data points at the center of the remaining grids (i.e. those having more than three observations) and size the dots by the density of observations in their grid. It is therefore evident that the k-anonymization under-performs in the case of scatter plots, as there is a significant reduction of the information presented in the plots (i.e. more than half of the data are aggregated and/or suppressed).
On the other hand, the deterministic and, on some level, the probabilistic anonymization can retain the underlying dispersion of the original data in the 2-dimensional space. Based on a visual inspection, the statistical properties such as the mean, the variance, the covariance and the correlation of the variables appear to remain approximately stable, but the exact values of the raw data cannot be identified. A characteristic of the deterministic anonymization is that observations in a cluster of size k, which is isolated from the bulk of the data, share the same set of \(k-1\) nearest neighbours and thus have identical centroids. It is therefore possible that in a scatter plot of scaled centroids, less points are visible when compared to the scatter plot of the corresponding raw data. The centroids that are located at the same positions with other centroids are shown by black dots in panels (iv) of Fig. 2. For the probabilistic anonymization, it is important to select an appropriate value for the parameter q, in order to keep the noisy data within an area that is not extended too much from the convex hull of the original values. This condition is important when the convex hull of the data defines the boundaries of bounded (or semi-bounded) distributions. A violation of that condition is observed in panel (v) of Fig. 2 where the noisy counterpart of variable Y has values out of its initial \([0-1]\) range.
Another graphical representation of data that is widely used is the heat map or image plot. A heat map represents a 2-dimensional display of values contained in a density grid matrix. Each grid is represented by different colours, indicating the density of the values in the grid. To generate a heat map plot of bivariate data, we first create their density grid matrix. We split the range of each variable into equal sub-intervals and enumerate the count of values falling in each grid on a 2-dimensional matrix (i.e. the density grid matrix). The pixels in the heat map plot then reflect the density in each cell of the grid, with a colour determined by the observed count. Using density grid matrices we can also display similar information in contour plots (i.e. topographic maps). A contour plot is a graphical display for representing a 3-dimensional surface by plotting constant z slices, called contours, on a 2-dimensional area. Given the values of z, which are the densities in the grid matrix, the lines that form the isometric contours connect \((x,y)\) coordinates where the same z values occur.
Panels (i) at the top rows of Figs. 3(A), 3(B) and 3(C) show the heat map plots of X and Y formed by using a 30 by 30 density grid matrix. In panels (ii), we show the privacy-preserving heat map plots produced using suppression of the grids with densities less than three counts (for example, in Fig. 3(A), panel (ii), 164 grids out of the 900 have been suppressed). In panels (iii) we display the privacy-preserving heat maps produced using a 15 by 15 density grid of the X and Y variables (i.e. generalization) and then suppressing any grids with densities less than three counts (for example, in Fig. 3(A), panel (iii), 42 out of 225 grids have been suppressed). Panels (iv) and (v) represent the privacy-preserving heat maps produced using the 30 by 30 density grid matrices of the deterministic and probabilistically anonymized data respectively. The bottom rows of Figs. 3(A), 3(B) and 3(C), show the contour plots produced from the same density grid matrices as the corresponding heat map plots.
The last example we present here, of a common visualization used on explanatory data analysis, is the box plot. A box plot visually shows the distribution of numerical data and skewness, by displaying the data quartiles (or percentiles) and averages. Potentially disclosive observations on a box plot are the outliers and the minimum and maximum points which define the ends of the whiskers. To protect such information from being disclosed, we apply the three anonymization techniques and generate privacy-preserving box plots. For the k-anonymization, we take the 30 by 30 density grid matrix of variables X and Y (or similarly the 15 by 15 matrix for the case of generalization), we suppress the grids with less than three counts, and then use the remaining values of each variable separately, to produce their privacy-protected box plots. For the deterministic and the probabilistic methods, we generate the scaled centroids and the noisy values respectively, and we then use those to produce the protected box plots. Figure 4 shows the box plots generated by suppression (panels (ii)), generalization and suppression (panels (iii)), deterministic anonymization (panels (iv)) and probabilistic anonymization (panels (v)) for the variables of the three simulated datasets.
From Fig. 4 we conclude that the k-anonymization suppresses the observations with high risk of identification (e.g. outliers), however it produces an undesirable information loss, as it tends to shrink the variables’ range (compare for example the whiskers of box plots of variable Y in panels (ii) and (iii) of Fig. 4(A) with the whiskers of the actual box plot shown in panel (i)). Another disadvantage of that method (as we also observed earlier in the case of histograms), is that it can suppress a significant number of observations and therefore can deform the underlying distribution and distort its statistical characteristics. One example of such distortion, is shown in the box plot of variable Y in dataset D3 for which its mean is shifted to the left under suppression (compare the mean of the box plots of Y between panels (i) and (ii) in Fig. 4(C)). This shift leads to the unwanted conversion of a symmetrical distribution to a less-symmetrical distribution.
On the other hand, the other two methods, the deterministic and the probabilistic anonymization, seem to perform better than the k-anonymization. One observation is that both of the methods keep displaying outlier points, however this is not a concern, as their location (and also their number) differs from the location of the outlier points in the original box plot. The only observed limitation of the probabilistic technique is that high level of noise might generate box plots with longer whiskers than the underlying whiskers, denoting an expanded range for the variable (see for example the box plot of variable Y in panel (v) of Fig. 4(C)).
4.2 Regression plot diagnostics
In linear regression analysis, key statistical properties of the model, such as likelihood ratio tests, Wald tests, associated p-values, and the \(R^{2}\) statistic, can provide information about the extent to which a model and its various components provide an adequate representation of its generating data. Amongst the approaches to exploring model fit, diagnostic plots are amongst the most important; they typically provide a quick and intuitive way to look at key aspects of model fit. For example, they can provide information about assumptions of linearity or constancy of variance. They can also reveal unexpected patterns in the data and can give insights on how the model structure may be improved. One important class of diagnostic plots makes use of regression residuals, representing the discrepancy between the observed outcome variable and the fitted model predictors in individual records. However, by combining the model covariates, the estimated regression coefficients, and knowledge of basic model structure (such as the link function), one can generate the predicted values and by then adding the vector of residuals to directly obtain the exact values of the outcome variable [27]. This is, by definition, disclosive. Furthermore, one of the most informative, and widely used, diagnostic tools is to plot residuals versus fitted (or predicted) values. With no other information at all, this can allow direct calculation of the actual values of the outcome variable. Because of the analytic value of such plots (whilst recognising the basic disclosure risk they create), we use the anonymization approaches to generate privacy- and information-preserving regression plot diagnostics. To do this, we first apply the regression model to the actual data and then use an anonymization process to anonymize the regression outcomes, and hence, allow visualization of the diagnostic plots, while mitigating the risk of disclosure. This approach can be used to visualize plots of residuals, metrics of leverage, influence measures and their relationships with fitted values.
Figures 5(A), 6(A) and 7(A) show the plots of residuals against fitted values for the data from datasets D1, D2 and D3 respectively. For a number of classes of model, for example those with a Gaussian error and identity link (like the fitted model of the relationship between X and Y from dataset D1 which is given by \(\hat{Y}=\beta _{0} + \beta _{1} X\)), this type of plot can provide a convenient way to explore the basic assumptions of linearity and homoscedasticity (i.e. the residuals have constant variance) [28]. If the residuals are symmetrically distributed around the horizontal axis without nonlinear patterns and without a systematic trend in their dispersion, then it can reasonably be concluded that there is little or no evidence against the assumption that the regression relationship is linear and homoscedastic. Thus, Fig. 5(A) validates the assumptions of linearity and homoscedasticity, Fig. 6(A) provides evidence of a non-linear relationship between X and Y, and Fig. 7(A) indicates that there is no relationship between X and Y (i.e. the solution of a simple linear regression model is a horizontal line at \(\hat{Y} \approx 0.5\)).
Figures 5(B), 6(B) and 7(B) present normal quantile-quantile (QQ) plots for the residuals of the regression outcomes of data from datasets D1, D2 and D3 respectively. The QQ plots compare the observed distribution of residuals against a theoretical normal distribution by plotting their quantiles (percentiles). If the residuals do not seriously deviate from the line of identity, then we may conclude that they are consistent with a normal distribution. Thus, Fig. 5(B) confirms the normality of the residuals (as we have deliberately created the error term to follow a standard normal distribution – see Sect. 3), while Figs. 6(B) and 7(B) indicate that the residuals do not follow a normal distribution (for dataset D2 we already know that the residuals follow a uniform distribution – see Sect. 3). Finally, the plots in Figs. 5(C), 6(C) and 7(C) display the residuals against Cook’s distance which is a measure of local influence [29, 30]. Large Cook’s distance could be due to a large residual, high leverage or both. For the three examples shown in Figs. 5(C), 6(C) and 7(C), all cases are well inside the Cook’s distance lines which are not visible in the selected narrow window frames.
Those three regression diagnostic plots can be considered as scatter plots, as they present the dispersion of points in a 2-dimensional space. Therefore, the three anonymization techniques are applied here to generate privacy-preserving diagnostic plots for the regression model assumptions, in the same way as they used to produce privacy-protected scatter plots. Consequently, their performance is similar to what we observed in scatter plots. In other words, the k-anonymization causes a significant information loss (which is evident in almost all panels (ii) and (iii) of Figs. 5–7), while the deterministic and the probabilistic approaches preserve most of the actual characteristics of the plots. An exception is when the addition of random noise, through the probabilistic anonymization, on the points displayed in QQ plots, might obscure some characteristics of the actual trajectory of the points around the line of identity.
4.3 The effect of parameter k of the k-anonymization process
We assess the performance of the k-anonymization on generating privacy-preserving visualizations when varying the value of the parameter k. We use the values of k equal to 3, 5, 7 and 9 and produce visualizations of exploratory statistics and regression diagnostics. For simplicity, we only use the data from dataset D1. In Fig. 8 we show that when k increases the utility of the plots degrades with the exception of the histograms. As the two variables follow a normal distribution, the suppression rule suppresses – an equal number on average – bins from the two tails of their histogram. Thus, the histograms remain in some sense informative, even for \(k=9\), as the symmetrical shape and the mean of the distributions remain stable. For all the other graphs (i.e. scatter, heat map, contour and box plots) there is an inevitable information loss, even when \(k=3\), which increases with the increase of k and results in non-informative plots, which are useless for statistical analysis and derivation of conclusions.
Moreover, when the k-anonymization is applied for the generation of privacy-preserving graphs, we need to examine not only the impact of the parameter k in the protection (and consequently the distortion) of the individual-level data, but also the level of generalization. For the scatter, heat map, contour and box plots of Fig. 8, we use a 30 by 30 density grid matrix and suppress any grids with less counts than the selected value of k. For the histograms, we divide the range of each variable in sub-intervals (which define the bins) of width equal to 0.2, and then suppress any bins with less than k observations. In Fig. 9, we present the same plots as those in Fig. 8, with the same variation of the value of parameter k, but with a doubled level of generalization. In other words, we use a 15 by 15 density grid matrix for the generation of scatter, heat map, contour and box plots and bins of width equal to 0.4 for the generation of histograms. As a result, there is again a loss of information on the graphical outputs, however this loss is lower than the loss we observed previously.
Using the same values of parameter k and the same considerations of generalization, we reproduce the regression diagnostic plots for the regression outcomes of the linear model between the variables X and Y of dataset D1. As those graphs are of the same type as scatter plots, the information loss is significant and thus the statistical utility of the plots is negligible (see Figs. 10 and 11).
4.4 The effect of parameter k of the deterministic anonymization process
We analyse the performance of the deterministic anonymization on generation of privacy-preserving visualizations, with different values of the parameter k. Here we use the values of k equal to 5, 10, 20 and 50. Figure 12 displays the generated protected graphs that can be used for explanatory statistical analysis. We observe that, as the value of parameter k increases, the anonymized data are shifted to the center of their mass (compare panels from left to right in Fig. 12(A), (B), and (C)). This behaviour causes some level of information loss, as the trend line that best describes the linear relationship between the variables X and Y, slightly deviates from its actual position (compare red and grey trend lines in Fig. 12(A)). However, this information loss is significantly lower than the information loss which can be introduced through the use of the k-anonymization. Another observation is that the number of centroids that are placed in identical positions (black dots in Fig. 12(A)) is decreasing as the value of k increases.
The observation that the centroids tend to be “attracted” towards their center of mass as the value of k increases, is also observed in the residuals versus fitted values, and the residuals versus leverage plots (Figs. 13(A) and (C)). This attraction does not change considerably the structure of the displayed points on the graphs (at least for the case of the bivariate normal distribution), and therefore valid conclusions about the linearity of the model, the homoscedasticity of the errors and the existence of any influential points could be derived. However, for big values of k (e.g. greater than 10), the extreme quantile values of the residuals presented in QQ plots disappear (see for example the right panel in Fig. 13(B)), and therefore such plots are not informative for the diagnosis of the residuals’ normality. However, it should be noted that small values of k (e.g. values between 3 to 10) can be used to generate privacy- and utility-preserving visualizations.
4.5 The effect of parameter q of the probabilistic anonymization process
A sensitivity analysis is also performed for different levels of noise added to the underlying variables through the probabilistic anonymization. We select the values of parameter q equal to 0.1, 0.5, 1, and \(\sqrt{2}\). When \(q=1\) the variance of the added noise is equal to the variance of the real variable; when \(q=\sqrt{2}\), the variance of the noise is double the variance of the real variable. Figure 14 shows the privacy-preserving data visualizations for exploratory analysis, generated by the probabilistic anonymization with variation of parameter q. In this Figure, we observe an outspread of the data points when q increases. This was an expectation due to the increase in the variability of the data. Therefore, large values of q (e.g. greater than 1) generate large noise, which distort the displayed data and misrepresent the underlying statistical properties. For example, the heat map plot on the right panel of Fig. 14(B), obscures the apparent correlation of the variables which is observed in the heat map plot of the actual variable on the left panel. Similar deformation is observed in the generated histograms (Figs. 14(D) and (E)), which still indicate the symmetry of the normal distributions and retain their average, but expand their variance.
In Fig. 15 we show how the variation of parameter q affects the utility of regression diagnostic plots. Here, as we use the two normally distributed variables from dataset D1, the generated plots of the residuals versus fitted values (Fig. 15(A)), preserve the correlation between the two quantities (even for \(q=\sqrt{2}\)), as the independent noise added to each vector follows a normal distribution. Therefore, valid inferences for the model linearity and homoscedasticity can be derived. On the other hand, the addition of random noise on the quantiles of residuals (Fig. 15(B)), and on the leverage and residuals (Fig. 15(C)) deform the underlying characteristics of the corresponding plots when q increases. Thus, the privacy-preserving QQ plots can not provide evidence for the errors’ normality. Similarly, the noisy points displayed in the residuals versus leverage plots spread on the vertical direction of the graph as q increases, hence may produce leverage points which do not exist in the real data.
4.6 The performance of the techniques in different sample sizes
We acknowledge that it is a difficult task to protect the privacy of individuals in datasets of small sample sizes, without losing information. Here, we use the three anonymization techniques to generate privacy-preserving plots for data from small sample sizes to illustrate their performance. We simulate three sets of two normally distributed variables, each having the same statistical characteristics as those of dataset D1, with 50, 100 and 300 observations respectively. Figures 16, 18 and 20 show the generated privacy-preserving plots of exploratory statistics, and Figs. 17, 19 and 21 show the generated privacy-preserving regression diagnostic plots for each sample size respectively. As we can see from those six Figures, the k-anonymization can not be used on data from small sample sizes, because it eliminates too much information from the plots. The deterministic anonymization performs quite well in almost all of the instances, while the probabilistic anonymization also performs well, but a careful selection of the parameter q is crucial to prevent unwanted over-anonymization.