What did you see? A study to measure personalization in Google’s search engine

Krafft, Tobias D.; Gamer, Michael; Zweig, Katharina A.

doi:10.1140/epjds/s13688-019-0217-5

Regular article
Open access
Published: 16 December 2019

What did you see? A study to measure personalization in Google’s search engine

EPJ Data Science volume 8, Article number: 38 (2019) Cite this article

7115 Accesses
24 Citations
20 Altmetric
Metrics details

Abstract

In this paper we present the results of the project “#Datenspende” where during the German election in 2017 more than 4000 people contributed their search results regarding keywords connected to the German election campaign.

Analyzing the donated result lists we prove, that the room for personalization of the search results is very small. Thus the opportunity for the effect mentioned in Eli Pariser’s filter bubble theory to occur in this data is also very small, to a degree that it is negligible. We achieved these results by applying various similarity measures to the result lists that were donated. The first approach using the number of common results as a similarity measure showed that the space for personalization is less than two results out of ten on average when searching for persons and at most four regarding the search for parties. Application of other, more specific measures show that the space is indeed smaller, so that the presence of filter bubbles is not evident.

Moreover this project is also a proof of concept, as it enables society to permanently monitor a search engine’s degree of personalization for any desired search terms. The general design can also be transferred to intermediaries, if appropriate APIs restrict selective access to contents relevant to the study in order to establish a similar degree of trustworthiness.

1 Main text

Political formation of opinion as well as general access to information have changed significantly due to digitalization. Information sources such as newspapers, TV and radio, where large parts of the population read, heard, or saw the same news and interpretations of these news stories, are being replaced by more and more diverse and personalized media offerings as a result of digital transformation. This personalization is enabled by algorithmic decision making systems (ADM systems): Here, an algorithm—not a human being—decides which contents users might be interested in, and only these are offered to them in various news platforms and social networks. The same is true for personalized search engines like Google, Yahoo or Bing, for which, at more than 3200 billion search queries in 2016 [26], it is impossible to have a human-made sorting available. Thereby, the possibilities and dangers of a so-called algorithmically generated filter bubble increase steadily.

1.1 The model of algorithmically generated and reinforced filter bubbles

The term filter bubble in the context of search engines is a partial concept of the filter bubble theory by Eli Pariser. In his 2011 book, “The Filter Bubble: What the Internet Is Hiding from You” [21], he explained that two of his friends had received significantly different results when searching for “BP” on the online search platform Google while the oil rig Deepwater Horizon was losing oil in the Gulf of Mexico in 2010. Therefore the internet activist pointed out the possible dangers of so-called filter bubbles. He developed a theory according to which personalized algorithms in social media tend to display content to individuals that corresponds to the previous views of the user, so that different information spheres can form, in which different contents or opinions prevail. In short, individual filtering of the information flow can lead to groups or individuals being informed about different facts, i.e. living in “a unique universe of information” [21, S.9]. This is especially problematic if the respective content is politically extreme in nature and if a one-sided perspective results in impairment or total deterioration of citizens’ discursive capabilities. A filter bubble in this sense is a selection of news that corresponds to one’s own perspectives, which could potentially lead to solidification of one’s own position in the political sphere.

Filter bubbles can be understood as even more advanced concepts. Selecting websites by means of state censorship can also create filter bubbles by restricting information. While this censorship is supported by algorithms, this does not constitute a filter bubble through algorithm-based personalization which is hypothesized in this article. This kind of restriction of information and its possible consequences for filter bubble formation are not examined here. After all, a search engine operator or a social media platform could knowingly and intentionally limit the data base in a certain direction, thus presenting all users with the same content while only offering a selective extract of reality. This option is also not examined here.

1.2 When are algorithmically generated and amplified filter bubbles dangerous?

Eli Pariser’s filter bubble theory, with its unsettling consequences for society, is based on these four basic mechanisms [21]:

1
Personalization: An individually customized selection of contents, which achieves a new level of granularity and previously unknown scalability.
2
Minor overlap of respective new/different results: A low or non-existent overlap of filter bubbles, i.e. news and information from one group remain unknown in another.
3
Contents: Those contents’ nature, which essentially only becomes problematic with politically charged topics and drastically different perspectives.
4
Isolation from other sources of information: The groups of people whose respective news situation displays homogenic, politically charged and one-sided perspectives, rarely use other sources of information or only those which place them in extremely similar filter bubbles.

The stronger those four mechanisms manifest themselves, the stronger the filter bubble effect grows, including its potential harmful consequences for society. The degree of personalization is essential, as politically relevant filter bubbles do not emerge if personalization of an algorithm responsible for selecting news is low. High personalization and verified filter bubbles do not necessarily take political effect if either their contents are not political in nature or users make use of other sources of information as well. For instance, information delivered to citizens of different languages are free of overlap by definition, if the results are displayed in those languages—regardless, contentwise those citizens are not in any way embedded in filter bubbles.

1.3 Revise Eli Pariser’s filter bubble theory

As algorithms are capable of controlling the flow of information directed towards users, they are assigned a gatekeeper role similar to journalists in traditional journalism (see [17]). As a result, it is necessary to examine how powerful the algorithmically generated and hardened filter bubbles on various intermediaries and search engines actually are. The number of reliable studies is relatively low: an important German study by the Hans Bredow Institute offers a positive answer to the question of the informational mix: sources of information today are diverse and capable of pervading other news and information of algorithmically generated and hardened filter bubbles [22], also noteworthy is, that newer results that indicate the absence of filter bubbles in Google searches (compare [12]). It is pointed out that algorithms offer a possibility to burst open filter bubbles if such a functionality is explicitly implemented. To our knowledge, apart from anecdotal examinations, a quantitative evaluation of the degree of personalization for a larger user base has not been deducted up until 2017: For example, in the context of a Slate article Jacob Weisberg asked only five persons to search for topics and found results to be very similar [32]. Vital questions of the degree of personalization and overlap of single news flows can only be resolved with a large user base. Executing such an investigation appears imperative, especially in light of the debate regarding influence of filter bubbles in social networks, which was sparked in 2016 after Donald Trump’s presidential election victory—unfortunately, due to insufficient APIs, this is currently not possible. Given the major political event of the federal elections in Germany we decided to realize the “#Datenspende: Google und die Bundestagswahl 2017”^{Footnote 1} (referenced as Datenspende) project in order to find out whether Google already personalizes search results, as has often been speculated. This project should therefore make the first of the four basic mechanisms for a filter bubble measurable.

The first section of this paper deals with the design of the project Datenspende and the general framework of this project, the data preparation and the resulting datasets. In the following section Personalization an overview of the recent research in the field of personalization (and regionalization) of the results from search engines is given. Furthermore the investigation to determine possible personalization effects are described. In the section Discussion the results are put together and a conclusion summarizes the results.

2 Study design

The study design and the basics of the collection are explained in the following section, including the structure of the data, important terms and the preparation of the data basis.

2.1 Software structure and enrollment

The plug-in to collect the data was made available for the Internet browsers Chrome and Firefox in order to achieve a market coverage in Germany of over 60% [27]. All necessary insights into the source code of the plug-in were published at the beginning of the project via GitHub.^{Footnote 2} The plug-in searched for 16 search terms at fixed search times (4:00, 8:00, 12:00, 16:00, 20:00 and 24:00), if the browser was open at that time. The search queries to Google and Google News ran automatically and the personal results of the donors were automatically sent to our server. For each user and timestamp, 16 search terms were queried twice and the first page of each search result was submitted.

The search terms were limited to the seven major parties and their respective party leaders (see Table 1). As can be seen in Fig. 1, after downloading the plug-in, users were free to decide whether they wanted to be informed about future donations or whether they should run in the background as far as possible.

Table 1 All 16 search terms used for the Dataspende project

What did you see? A study to measure personalization in Google’s search engine

Abstract

1 Main text

1.1 The model of algorithmically generated and reinforced filter bubbles

1.2 When are algorithmically generated and amplified filter bubbles dangerous?

1.3 Revise Eli Pariser’s filter bubble theory

2 Study design

2.1 Software structure and enrollment

2.2 Data preparation and datasets

2.3 Threads of validity

Data cleaning

Noise

Carry-over effects

3 Hard facts

4 Personalization

4.1 Definition personalization and regionalization

4.2 Computation

4.3 Commons of result lists

4.4 Deviation per rank

4.5 Longest common sub sequence

4.6 Kendall τ

5 Discussion

Notes

Abbreviations

References

Acknowledgements

Availability of data and materials

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Appendix

Appendix

1.1 A.1 Commons persons

1.2 A.2 Commons–parties

1.3 A.3 LCS person

1.4 A.4 LCS parties

1.5 A.5 Kendall τ

Rights and permissions

About this article

Cite this article

Share this article

Keywords