What did you see? A study to measure personalization in Google’s search engine

In this paper we present the results of the project “#Datenspende” where during the German election in 2017 more than 4000 people contributed their search results regarding keywords connected to the German election campaign. Analyzing the donated result lists we prove, that the room for personalization of the search results is very small. Thus the opportunity for the effect mentioned in Eli Pariser’s filter bubble theory to occur in this data is also very small, to a degree that it is negligible. We achieved these results by applying various similarity measures to the result lists that were donated. The first approach using the number of common results as a similarity measure showed that the space for personalization is less than two results out of ten on average when searching for persons and at most four regarding the search for parties. Application of other, more specific measures show that the space is indeed smaller, so that the presence of filter bubbles is not evident. Moreover this project is also a proof of concept, as it enables society to permanently monitor a search engine’s degree of personalization for any desired search terms. The general design can also be transferred to intermediaries, if appropriate APIs restrict selective access to contents relevant to the study in order to establish a similar degree of trustworthiness.

the possibilities and dangers of a so-called algorithmically generated filter bubble increase steadily.

The model of algorithmically generated and reinforced filter bubbles
The term filter bubble in the context of search engines is a partial concept of the filter bubble theory by Eli Pariser. In his 2011 book, "The Filter Bubble: What the Internet Is Hiding from You" [21], he explained that two of his friends had received significantly different results when searching for "BP" on the online search platform Google while the oil rig Deepwater Horizon was losing oil in the Gulf of Mexico in 2010. Therefore the internet activist pointed out the possible dangers of so-called filter bubbles. He developed a theory according to which personalized algorithms in social media tend to display content to individuals that corresponds to the previous views of the user, so that different information spheres can form, in which different contents or opinions prevail. In short, individual filtering of the information flow can lead to groups or individuals being informed about different facts, i.e. living in "a unique universe of information" [21, S.9]. This is especially problematic if the respective content is politically extreme in nature and if a one-sided perspective results in impairment or total deterioration of citizens' discursive capabilities. A filter bubble in this sense is a selection of news that corresponds to one's own perspectives, which could potentially lead to solidification of one's own position in the political sphere.
Filter bubbles can be understood as even more advanced concepts. Selecting websites by means of state censorship can also create filter bubbles by restricting information. While this censorship is supported by algorithms, this does not constitute a filter bubble through algorithm-based personalization which is hypothesized in this article. This kind of restriction of information and its possible consequences for filter bubble formation are not examined here. After all, a search engine operator or a social media platform could knowingly and intentionally limit the data base in a certain direction, thus presenting all users with the same content while only offering a selective extract of reality. This option is also not examined here.

When are algorithmically generated and amplified filter bubbles dangerous?
Eli Pariser's filter bubble theory, with its unsettling consequences for society, is based on these four basic mechanisms [21]: 1 Personalization: An individually customized selection of contents, which achieves a new level of granularity and previously unknown scalability. 2 Minor overlap of respective new/different results: A low or non-existent overlap of filter bubbles, i.e. news and information from one group remain unknown in another. 3 Contents: Those contents' nature, which essentially only becomes problematic with politically charged topics and drastically different perspectives. 4 Isolation from other sources of information: The groups of people whose respective news situation displays homogenic, politically charged and one-sided perspectives, rarely use other sources of information or only those which place them in extremely similar filter bubbles. The stronger those four mechanisms manifest themselves, the stronger the filter bubble effect grows, including its potential harmful consequences for society. The degree of personalization is essential, as politically relevant filter bubbles do not emerge if personalization of an algorithm responsible for selecting news is low. High personalization and verified filter bubbles do not necessarily take political effect if either their contents are not political in nature or users make use of other sources of information as well. For instance, information delivered to citizens of different languages are free of overlap by definition, if the results are displayed in those languages-regardless, contentwise those citizens are not in any way embedded in filter bubbles.

Revise Eli Pariser's filter bubble theory
As algorithms are capable of controlling the flow of information directed towards users, they are assigned a gatekeeper role similar to journalists in traditional journalism (see [17]). As a result, it is necessary to examine how powerful the algorithmically generated and hardened filter bubbles on various intermediaries and search engines actually are. The number of reliable studies is relatively low: an important German study by the Hans Bredow Institute offers a positive answer to the question of the informational mix: sources of information today are diverse and capable of pervading other news and information of algorithmically generated and hardened filter bubbles [22], also noteworthy is, that newer results that indicate the absence of filter bubbles in Google searches (compare [12]). It is pointed out that algorithms offer a possibility to burst open filter bubbles if such a functionality is explicitly implemented. To our knowledge, apart from anecdotal examinations, a quantitative evaluation of the degree of personalization for a larger user base has not been deducted up until 2017: For example, in the context of a Slate article Jacob Weisberg asked only five persons to search for topics and found results to be very similar [32]. Vital questions of the degree of personalization and overlap of single news flows can only be resolved with a large user base. Executing such an investigation appears imperative, especially in light of the debate regarding influence of filter bubbles in social networks, which was sparked in 2016 after Donald Trump's presidential election victory-unfortunately, due to insufficient APIs, this is currently not possible. Given the major political event of the federal elections in Germany we decided to realize the "#Datenspende: Google und die Bundestagswahl 2017" a (referenced as Datenspende) project in order to find out whether Google already personalizes search results, as has often been speculated. This project should therefore make the first of the four basic mechanisms for a filter bubble measurable.
The first section of this paper deals with the design of the project Datenspende and the general framework of this project, the data preparation and the resulting datasets. In the following section Personalization an overview of the recent research in the field of personalization (and regionalization) of the results from search engines is given. Furthermore the investigation to determine possible personalization effects are described. In the section Discussion the results are put together and a conclusion summarizes the results.

Study design
The study design and the basics of the collection are explained in the following section, including the structure of the data, important terms and the preparation of the data basis.

Software structure and enrollment
The plug-in to collect the data was made available for the Internet browsers Chrome and Firefox in order to achieve a market coverage in Germany of over 60% [27]. All necessary insights into the source code of the plug-in were published at the beginning of the project via GitHub. b The plug-in searched for 16 search terms at fixed search times (4:00, 8:00,  Figure 1 Browser plug-in, immediately before initiating a search for a user, whose first search result page would then be transferred to the provided server structure and is thus "donated" 12:00, 16:00, 20:00 and 24:00), if the browser was open at that time. The search queries to Google and Google News ran automatically and the personal results of the donors were automatically sent to our server. For each user and timestamp, 16 search terms were queried twice and the first page of each search result was submitted. The search terms were limited to the seven major parties and their respective party leaders (see Table 1). As can be seen in Fig. 1, after downloading the plug-in, users were free to decide whether they wanted to be informed about future donations or whether they should run in the background as far as possible.
Information regarding the project and the related call for the data donation were distributed via our project partners' communication channels as well as our media partner Spiegel Online [15]. As a result, 4384 plug-in installations took place. The resulting search results are freely accessible to the populace for analytical purposes (see section: Availability of data). It should be pointed out that all results that can be seen in the final report are not necessarily representative, as the data donors were recruited voluntarily and by self-selection. For the most vital findings however, especially regarding the degree of personalization, we assume that they do not change much if the user base is representative. It should also be noted that an automated search for about an approximate dozen of search terms can have an influence on the search engine algorithm itself. On Google Trends, over the runtime of our data collection and for the search terms "Dietmar Bartsch", "Katrin Göring-Eckardt" it can clearly be seen that the search request volume was hereby increased (see Fig. 2). Since the search requests were performed automatically and none of the offered links were actively clicked, we suspect the effects to be low enough to be  negligible. However, lacking exact knowledge of the underlying algorithm, this cannot be proven and has to remain unevaluated.
The following terms are also used in this report: Study period: This is the period from 21.08.2017 to 24.09.2017, taking into account only the days of the week and the election weekend as the only weekend (for more information, see next chapter and especially Fig. 5). The investigation period thus includes 27 days.
Search time/time stamp: By a search time/time stamp we define a day within the investigation period and the corresponding time, which can be 12:00 o'clock, 16:00 o'clock or 20:00 o'clock. We limit ourselves to these times, because at the other search times significantly fewer users are searching. The total number of search times/time stamps is 81 (three different times on a total of 27 days).
(Search) result list: By a results list we understand the set of URLs that were delivered to a user for a given search term and a defined search time.
Topstories and organic search results: Topstories are the up to three news articles that Google sometimes (see Fig. 3) delivers at the top of a regular Google search query. In addition to the pure textual information, a corresponding figure is also displayed. The remaining search results are hereinafter referred to as organic search results. Figure 3 shows that there are almost always some search results without top stories, but the majority of result lists are having Topstories. Since for every timestamp there are result lists with and without Topstories and these are not comparable, we decided to drop the Topstories, to make them comparable. The decision how similar search result lists with and without are would be an arbitrary decision. Especially since some measures include the respective position and then the question would be how to deal with missing top stories. A more specific analysis of these top stories would perhaps also lead to interesting results.
There are entries that are incorrect or different from the standard. Also the plug-in did not run smoothly from the beginning and produced partly erroneous data. Therefore, we describe the necessary data preparation in the next section.

Data preparation and datasets
The first version of the plug-in for the Firefox browser assigned the same ID to all users. Since we also wanted to analyze the changes in search result lists over time, we decided to ignore this data in order to have a uniform data basis. This means that 34% of all donated URLs are removed from Google searches.
A first analysis of the available data showed further irregularities. For example, it has to be mentioned that the database contained search results lists that did not correspond to the expected standards in length (10 entries for a pure Google search). Further, the data basis contains some data records with 200 entries in the results lists. This is due to the ability to set the number of search results displayed on the first page in the Google Account. We have shortened these lists to the usual 10 plus any Topstories displayed. Other errors are due to incorrect programming of the first Firefox plug-in and thus generated search result lists with the same URL everywhere. To clean the data basis we dropped always the a whole (erroneous) result list, to guarantee that the remaining lists are correct in total.
The same is true for URLs that merely contained a reference to the corresponding URL on Google (google.de/ url) or contained a URL entry that only refers to "google", i.e. did not contain an entire link-these entries refer to a private search result. It was also noticeable that a number of search results lists contained larger numbers of URLs that were used on websites in other languages.
These cleanup steps together reduced Google search data records by 19.1%. c For the later calculations 3 data sets were built up. On the one hand, what was only cleaned as already explained (All). To filter out foreign results, we have used the user's presumed language, which can be determined from the field "published" of the top story. German users were given a time in German ("vor kurzem", "vor 4 Stunden"). As soon as one of these details were written in German, we marked this result list as German and built a German data set from this data. Our second database has thus been limited to Germanlanguage result lists (German). The final one had an IP location in Berlin in addition to the German characteristic (Berlin). An overview of the adjusted data sets is given in Table 2.
Since it turned out that Google not only regionalized by IP address, every URL in the dataset Berlin was manually tagged for regionality. d Thus it is possible to reduce the Berlin dataset to URLs which do have a regional character, we call this dataset Berlin regional.

Threads of validity
Data cleaning Data cleaning can lead to incorrect datasets if search result lists are inconsistent or shortened. We have minimized this effect by not removing singular URLS but completely removing such faulty lists.
Noise Due to the study design we have no influence on the log-in-behavior and the IPaddresses and Geo-regions of the users, which were all volunteers with an installed plug-in for Firefox and Chrome. So this source of potential noise can not be excluded due to the impossibility to run the searches with fresh accounts and different log-in-status. Nevertheless the amount of result lists submitted per time stamp ensures that the results are reliable.
Carry-over effects The data was cleaned with great confidence but noise and the carry over effect e described by [13] can not be excluded completely. In our study the searches were performed at time stamps that are four hours apart, but they were executed in a slot of approximately 30 minutes so this effect can also not be excluded completely.

Hard facts
The data was collected from volunteer donors who installed the data collection tool into the Mozilla Firefox or Google Chrome browser on their machine(s). So one would expect that most data is collected during the normal work-days and one would expect a decline in the number of data collected during the week-ends. Figure 4 shows that these expectations are true. The valleys which can be seen in this figure are corresponding to the much lower number of URLs we got in result lists during week-ends. Therefore only the search results for workdays were considered for the study. f On average, for each day under consideration there were 506.9 users who contributed search results.  Due to the fact, that on weekends there are by far fewer users online than on workdays only the search results for workdays were considered for the study. g On average, for each day under consideration there were 506.9 users who contributed search results.
The distribution of the users who contributed data is shown in Fig. 5. This map shows, that we successfully acquired contributors of data for the total area of Germany.

Personalization
First we give a definition of personalization and regionalization used throughout this paper. Next this section addresses the question whether, and when to which degree we can detect personalization or regionalization in the given data. Therefore different measures are applied to the data measuring the similarity between result lists for each time stamp.

Definition personalization and regionalization
Regarding the term of (preselected) personalization, this article follows the definition given by [34], according to which personalization allows the selection of content that has not yet been clicked by the user, but which is associated with users with similar interests. Algorithmically speaking, this is based on so-called "recommendation systems", which determines the interests of a currently searching user from other people who have shown similar click behavior in the past. It is also plausible that according to their own click behavior and together with known categorizations of clicked content for each person a profile is compiled, saying, for instance: This person prefers news about sports and business, reads medium-length text and news that are not older than a day (for a detailed overview see Google's patent for "Personalized search", [30]). Since 2000 there are research results from the field of information retrieval (IR) which prove the advantages, e.g. effectiveness through personalized search results [6,16,24]. What is missing, however, is a systematic analysis of the possible risks. Since the number of web pages associated with a search term is more than 10 for almost all queries and at the same time the highest ranked web pages shown are receiving the greatest attention from users (web searches: [10]; Google: [20]), it is essential that search engines filter the possible search results, by selection and sorting. Certainly one of the most important filters is the user's language, while topicality and popularity play an additional role as well as, to a lesser extent, embedment in the entire WWW (e.g. measured by the PageRank algorithm [3]). In 2004 the search engine Google launched a test version of a personalized search engine [14], slowly transferred them from their test environment into day-to-day operation since November 2005 [7] and from on 2009 Google speaks of a "personalized search for all" [8]. In the current privacy policies of Google the personalized searches are clearly addressed: "We use the information we collect to customize our services for you, including providing recommendations, personalized content, and customized search results" [9].
Here the number of used signals seems to increase steadily, in 2011 Pariser wrote that more than 50 signals are used [21, p. 2], Google itself mentions here user's language, geolocation, history of search queries, and their Google+ social connections [25] and today it seems to be already more than 200. h Considering the vast number of users, this can only be achieved algorithmically, using different modes of machine learning and thus only form statistical models [35]. At the latest since May 2012, when Google published in its privacy policy, that all of Google's services can share their informations about the users [33] it could be expected that users logged in into their Google accounts will tend to receive more personalized search results than not logged in users.
An important point, which also attracts a lot of attention in the IR, is the relationship between location and relevance of search results [2], this concept is called regionalization. With regard to Internet searches regionalization is the selection of websites for a whole group of people who are currently searching from a certain region or who are known to come from a certain region but who do not necessarily mention a region in their search query. For instance, the current location can roughly be derived from the searching device's IP address, or more accurately from smartphone location information or from the profile known to the search engine [4,29]. The delivered websites themselves are clearly related to the location of interest specified by Google; which can be the case, for example, if a nearby location's name appears on the website repeatedly. It is important to note that regionalization on a particularly small scale can be counted towards personalizationfor example, if a selection of regional websites is delivered to each person of a household while differing from the selection for their neighbors. However, if the results refer to a larger group, such as cities or federal states, this paper does not assume these results as personalization, as they are too extensive for a filter bubble as mentioned before.

Computation
To examine the degree of possible personalization measures have to be defined to compare the search results for every key and time stamp. The search results consists of 8 to 10 URLs i for each keyword and time stamp (see section "Study design"). In the following, different similarity measures are applied and finally, depending on the aggregation, the respective mean values are calculated.
For the recent investigation four different similarity measures were used. In a first step we calculate the number of common results for each pair of result lists belonging to one time stamp, the so called commons. So we get an overview how many search results (i.e. URLs) can be personal to a user. The next measure applied ist the deviation per rank, where we calculate the percentage of results that change at each rank (position 1 up to 10 in the result list). The third measure we used is the longest common subsequence (LCS), to get information whether there are identical sublists in the result lists. The LCS does not take into account the ordering of the result lists. So we refined this by applying the measure Kendall τ to the result lists where we got information about the ordering of the result lists.

Commons of result lists
One of the most intuitive similarity measure for a pair of result lists is the number of URLs common to two lists, i.e. let l 1 and l 2 be two lists of search results, then: We are using this measure to get evidence about the space for personalization in the search results. Therefore we calculate the average length of the result lists and subtract the common results we found in the result lists. This delivers a measure for the space for personalization in the results-the number of potential unique URLs for the user. The results we got are shown in Fig. 6 for the parties and Fig. 7 for the persons.
Shown here are the results for the whole dataset (labeled "All" in the figures) and three subsets of the data. First the result lists identified by the location indicator as German results (labeled "German" in the figures), second the result lists identified as Berlin results (labeled "Berlin" in the figures) and lastly we tagged the URLs by hand to definitely identify URLs from the Berlin area (labeled "Berlin regional" in the figures). This analysis was in that depth only performed for the city of Berlin, because this is the only city for which we have enough data to do such an analysis seriously.

Figure 6
The space which is available for personalization is the difference between the average result list length and the average amount of commons per tuple for the parties in the respective data set In the different datasets the average length of the result lists (for search results for persons and parties compare the corresponding tables in the Appendix on page 18) varied, therefore the delta between the amount of common results and the average length of the result list was plotted as mentioned before. This delta can be seen as the respective space which is available for personalization, since the result lists differ from the average by this delta. This shows that about six URLs, from in the mean nine, are identical in the result lists each user gets at every time stamp. Figure 6 shows that the space for personalization is reduced when search results are restricted to more and more local areas. Highest values can be seen for the dataset consisting of all data, while the lowest are shown when the results are restricted to results of the Berlin area, the results for the dataset "German" are somewhere in the middle. The largest space for personalization is shown here for the party "CSU" which is only active (and eligible) in Bavaria.
Remarkable is the difference in the results for persons shown in Fig. 7. We see the same "shrinking" of space for personalization when restricting the results to more regional areas, but we also have a by far smaller space for this personalization at the whole. As the maximum value for personalization for parties in the whole dataset is approximately 4.5 (CSU, see Fig. 6) for persons, we see a maximum value of approximately 2 (Angela Merkel and Alexander Gauland, see Fig. 7). It is noteworthy that for persons the drastic reductions in the space for personalization due to the restriction to the German dataset. Whereas in the search for parties all three reductions show a clear effect.
It can be stated that the more we restrict the dataset to local data the less the space for personalization there is.

Deviation per rank
Personalization is not only possible by quantity, but also by the position where the search results are presented to the user, especially when the number of clicks increases dramatically with rising position (web searches: [10]; Google: [20]).
Hannak et al. [13] conducted a study in which they tried to quantify personalization in search engines. Among other things, they compared the results of 200 Google users with a parallel, neutral search. So it was checked rank, i.e. position in the result list, whether the users get the same or different results as the scientists. A similar rank-based approach was applied to our data to use an appropriate measure for the sorting order of the delivered results. According to our study-design there is no Amazon-Mechanical-Turk data to compare j to, we used this rank based measure in the following way.
First the search result lists were compared in pairs, for every rank, which is the position in the result list, with one another (separated by search term and time stamp) to count how many of the URLs are identical in each rank, then the mean value over the study period was determined.
The metric we use is that used by [13], let δ be the discrete metric, i.e. for a set M: Let L := {l 1 , l 2 , . . . , l n } be a result list for a given time stamp and search query, then every entry in the result lists consists of a list of (up to 10) URLs, Then the deviation k of rank k for 1 ≤ k ≤ 10, is k = 1 -1≤i,j≤n δ(e i,k , e j,k ) n(n -1) .
As the results of the searches for parties and persons are very similar in each case, they have been aggregated, Tables 3 and 4 show how many percent the search results change at what position of the result list. Since the position of individual URLs is changed by manually removing regional URLs (if URL 2 is removed, Pos 3 becomes Pos 2 etc.), this analysis has not been applied to Berlin regional. Table 3 Average pairwise deviation (in %) at each rank searching for a party in the respective data set for all result lists, German results and results originated in Berlin    We see the effect that the result change is increasing when taking into account results referring to more local area in the Berlin dataset. The highest value is seen for all results lists, the lowest for result lists originated in Berlin and Germany somewhat in the middle. Also remarkable is the difference between parties and persons looking at changes in the first position of the result list.
Even if both figures show a clear increase in the pairwise deviations, they differ significantly in both gradient and maximum. While the mean values in the search for parties (see Table 3) increases monotonously with rising ranks, this is at the top of the lists not the case for result lists for queries for persons (see positions 1 and 2 in Table 4). The percentage of different URLs on position 1 is slightly higher than on position 2. Also, the mean value here already settles at a value of 70%, whereas the percentage deviations of the parties rise far above 80%. The parties thus show a significantly greater diversity with regard to the individual positions and thus the sorting.

Longest common sub sequence
Another often used measure of similarity for ordered lists is the longest common subsequence, LCS. Given two (ordered) lists, e.g. ranks, l 1 and l 2 , then a sequence s := s 1 , . . . , s p is a longest common subsequence if s is a subsequence of l 1 and l 2 and p is maximal.
Applications for this similarity measure are text analysis [1], investigations regarding the clustering of genome sequences [18] and recently the detection of trajectories especially for mobile devices [19] or the automatic cleaning of incomplete city names in data bases [23]. Other applications cover the treatment of high-frequency financial data [11].
Here we apply this similarity measure to the search results we got for every key and time stamp during the study period. So we get a measure which is in a certain sense sharper than the measure commons for the fact that now also the ordering of the results is taken into account. In Fig. 8 and Fig. 9 the results are plotted for persons and parties respectively.
We observe that the mean length of the LCS for persons is significantly larger than the mean length of the LCS for parties, a similar result we have already seen for the similarity measure commons. Regardless whether we take into account parties or persons we see  that the mean length of the LCS grows the more we restrict the dataset on more local data. These results are showing, for example for the party CSU, which is only active in Bavaria, the shortest mean length of the LCS over all parties. Here we have a mean length of the LCS of less than two.

Kendall τ
When comparing ranked lists the measure Kendall τ can be used to measure the degree of equal ordering for the two lists. Unlike the Spearmann correlation coefficient, Kendall τ takes only into account the occurrences of differently orientated appearances of items in the in the two lists and not their difference.
Given two lists x and y of length n Kendall τ is computed as follows. Let P be the set of all tuples (i, j) and 1 ≤ i, j ≤ n then c, the number of concordant pairs is defined as c := (i, j) | x i < x j and y i < y j the number of discordant pairs is defined as In cases were elements of one list are not present in the other list, so called "ties" are defined as follows: .
This τ coefficient ranges from -1 (inverse ranks) to +1 (identical ranks) and a value of 0 indicates no correlation. The application of Kendall τ is a further refinement over the measure LCS. Kendall τ was used by [13] to analyze the personalization of web-searches and earlier by [5] to compare the results of different search engines. The latter used a modification of Kendall τ which only varies from 0 to 1. This measure was also used by [31] and applied to results of search engines. A more detailed and more mathematical discussion of this measure can be found [28].
We apply the Kendall τ measure to the result lists we got from the Google searches. As seen before, when applying the similarity measure commons on page 10 the space for personalization is rather small, the users get mostly the same search results.
Kendall τ , as a measure for the orientation of two different search result lists, shows that the results lists are not only consisting of mostly the same links (as seen with the similarity measure commons, they are furthermore in a very high degree orientated in the same way. We see most of the values for Kendall τ near values about 0.9. This is most clearly for the German and Berlin results, as shown for the persons in Fig. 10, here we see a slightly different result for the search key Angela Merkel, who is the only person with a larger international impact, compared to the other persons taking part in the German election campaign. The results for the parties under the similarity measure Kendall τ , shown in Fig. 11. Like the results for LCS we have already seen, we see the same effect, that the Bavarian party CSU gets different (lower) values than the other parties.
Up to now we have seen, that the mean values for Kendall τ for parties and persons is positive, saying that the ordering of the result has an identical orientation. In a next step we computed Kendall τ over all result lists and performing a binning, to bring into light the deviation of the Kendall τ . Figure 12 shows the deviation of the Kendall τ coefficients as histogram data. For a point n 10 on the x-axis are shown the percentage of all values of Kendall τ lying in the interval ( n-1 10 , n 10 ] on the y-axis. The three different results for Kendall τ are shown over all the result lists in the data set, grouped by the result lists wich are associated to users in the Berlin, in Germany and all users, respectively. It can be seen, that there is only little difference in the regional results.

Discussion
When investigating the datasets regarding the project Datenspende the main objective was to detect how large the amount of personalization within the Google search results could be. Result lists were analyzed for every time stamp and key, i.e. for every time stamp and key we collected the result lists of all users and applied different measures of similarity to them.
The first interesting measure is the (mean) number of common links the users were presented with every time stamp. That way we could determine the (mean) number of possibly individual links users got, as could be seen in the previous sections this number was comparably small, for parties we got a maximum of about four URLs. For persons there are mostly less than two URLs which are potentially personalized (see Fig. 6 and Fig. 7).
This interesting difference when searching for persons in contrast to the search for parties shows that the space for personalization in these cases ranges between 0.5 and 4.5 (see Fig. 6). The largest space for personalization is seen for the party CSU which is only active in Bavaria.
Further for the analysis with similarity measure commons we divided the data in four different datasets with decreasing size: • All data, • German data, • Berlin data, • Berlin data regional. We observe, that the space for personalization is decreasing, when we restrict the dataset to more local data. The largest amount for personalization is seen, if we take into account all data, which also covers search queries not originated in Germany.
When we investigate the same situation for searches regarding keys for persons (see Fig. 7), the space for personalization is even smaller and ranges from 0.5 to 2.
A possible personalization of search results is also the position at which a certain search result ist presented to the user. We addressed this problem by applying the measure deviation per rank which gives the percentage of results that change at each rank, usually one to ten, to the lists of search results, these results are shown in Table 4 and 3. Theses results show that regarding the search for parties we observe a relatively small amount of changes in the first two positions of the corresponding result lists, but when searching for persons the deviation per rank for the first four positions is less than 50%.
The next applied similarity measure for (ranked) lists is the longest common subsequence (LCS). For persons the results are shown in Fig. 8. It can be observed, that the lengths of the LCS for each person is increasing, when we restrict ourselves to more and more regional datasets. A similar result holds for result lists for parties, but with significantly lower values for LCS. As observed before the values for the Bavarian (regional) party CSU are smaller than for all the other parties.
With this measure, the LCS, the ordering of search result lists is not taken into account. The next measure we used is Kendall τ , a measure which also observes whether the lists of search results are ordered in the same way or not.
The similarity measure Kendall τ also takes into account the ordering of the URLs contained in the result lists. Therefore this measure is "sharper" than the measure commons. Our first observation is, that taking the mean of the measures Kendall τ over all time stamps for parties and persons we see that all the values are comparably large (mostly above 0.8 for persons 0.7 for parties) and the values are all positive, which means that the ordering of the result lists is overall consistent. Besides this the values are in a very small range. In Fig. 11 is shown that for the highly regionalized party CSU we see the smallest values for all parties.
For persons this measure shows that the results have a very similar ordering for all persons. An interesting observation in Fig. 10 is, that the values for all elements in the dataset, which also includes results obtained in countries other than Germany, shows that there is a significant deviation in the results for the keyword Angela Merkel. The result for Angela Merkel over all data is lowest in the field. We did indeed observe negative values for the keyword Angela Merkel when analyzing the whole dataset. But this was only the case for (non-German) result lists with two common elements in reverse order respectively.
Summarizing the results it can be claimed, that on the basis of the data we got, the space for a personalization of search results by Google is relatively small. Moreover we see the space for personalization decrease the more we restrict the data under investigation to a more regional basis. Thus we could show that one of the most important pillars of the filter bubble theory has been switched off by Google. Only regionalization seems to exist and without a fine-grained personalization, disjunctive information spheres hardly arise.
We can state that with such a "data donation", we have created a methodology for society as a whole to examine a black box like the Google search for important characteristics, without needing insight into the exact procedure of the algorithm.

A.1 Commons persons
Tables 5-8 are showing the common results for the result lists regarding persons for All, German, Berlin and Berlin without regional results.     Tables 9-12 are showing the common results for the result lists regarding parties for All, German, Berlin and Berlin without regional results.

A.5 Kendall τ
The only case where we observed negative values for Kendall τ is for the keyword "Angela Merkel". Here we see negative values when all data is under consideration. For the German and Berlin data the values of Kendall τ are positive.

Figure 13
The only case where we observed negative values for Kendall τ is for the keyword "Angela Merkel".
Here we see negative values when all data is under consideration. For the German and Berlin data the values of Kendall τ are positive