 Research
 Open access
 Published:
Evolving demographics: a dynamic clustering approach to analyze residential segregation in Berlin
EPJ Data Science volume 13, Article number: 21 (2024)
Abstract
This paper examines the phenomenon of residential segregation in Berlin over time using a dynamic clustering analysis approach. Previous research has examined the phenomenon of residential segregation in Berlin at a high spatial and temporal aggregation and statically, i.e. not over time. We propose a methodology to investigate the existence of clusters of residential areas according to migration background, age group, gender, and socioeconomic dimension over time. To this end, we have developed a sequential mixed methods approach that includes a multivariate kernel density estimation technique to estimate the density of subpopulations and a dynamic cluster analysis to discover spatial patterns of residential segregation over time (20092020). The dynamic analysis shows the emergence of clusters on the dimensions of migration background, age group, gender and socioeconomic variables. We also identified a structural change in 2015, resulting in a new cluster in Berlin that reflects the changing distribution of subpopulations with a particular migratory background. Finally, we discuss the findings of this study with previous research and suggest possibilities for policy applications and future research using a dynamic clustering approach for analyzing changes in residential segregation at the city level.
1 Introduction
This manuscript examines the phenomenon of residential segregation in Berlin from a dynamic perspective, using data science to identify patterns in its human geography.
Previous research on residential segregation in Berlin has analyzed its different dimensions. For example, research has focused on residential segregation driven by ethnicity, [1–29]; residential segregation driven by agegroup [1, 6, 7, 12, 20, 21, 23, 24, 29]; residential segregation driven by gender [4, 8, 15]; segregation driven by socioeconomic factors [2, 3, 10, 12–15, 17, 18, 20, 23, 27, 29]; as well as residential segregation driven by digital segregation [24]. Demography, economics, sociology, geography, and ethnographic studies have explored all these dimensions, among other disciplines. They all support the notion that there is an uneven, clustered, or patchy spatial distribution of subpopulations in residential areas of Berlin.
We focus on the case of Berlin, a city in which historical events have changed both the city and its society. First, we are motivated by the Berlin case because a large body of research examines changes in residential segregation before and during the fall of the Berlin Wall. In fact, before the fall of the Wall, studies were based exclusively on data from West Berlin, as no unified public statistics were available. Then, after the fall of the Wall, data from both parts of the city became available for the first time, and research focused on understanding the structural changes that occurred as a result of the city’s reunification. Finally, Berlin has served as a reference point for other comparative studies of residential segregation in other German cities in the context of immigration policy. Further details on the historical and research development of the phenomenon of residential segregation in Berlin can be found in [3, 29, 30].
Second, from a conceptual point of view, residential segregation occurs over time and space. However, to the best of our knowledge, there is no research in the case of Berlin that has carried out an analysis that includes these two components. For example, in Helbig’s research on residential segregation [23], the author conducts a time series analysis but does not include the geographical dimension of residential areas. Another example is the work of Marcińczak and Bernt [28], in which they use hierarchical clustering on temporal data, which does not allow the identification of the emergence of new clusters or the disappearance of clusters over time. New methodological approaches are therefore needed to study the dynamics of residential segregation.
Third, we are motivated to explore the possible impact of the 2015 European migrant crisis on residential segregation in Berlin. As the city that received the highest number of refugees in Germany during the crisis, we are interested in examining whether the migration process led to the creation or elimination of clusters that shape the demographic composition of residential areas. The type of changes may include, for example, changes in the number of clusters over time (i.e. macrodynamics) and internal changes in subpopulations over time and space (i.e. microdynamics). Past studies looking at this period have not reported the existence of structural and internal changes within clusters, and one possibility for this situation may be the use of aggregate data or static clustering methods.
To reveal both structural and internal changes within clusters, we propose an exploratory analysis based on data science using a dynamic clustering algorithm. The objectives of this paper are:

Estimate population density according to dimensions such as migration background, age group, gender, and socioeconomics.

Dynamically determine the number of clusters over time according to dimensions such as migration background, age group, gender, and those reflecting the socioeconomic conditions.

Identify structural as well as intracluster changes over time.

Determine the variables that are over or underrepresented for each cluster in a given year.
As a result, our approach has led to discoveries about residential segregation. First, at a macro or structural level, a new cluster emerged in 2015, which we interpret as a result of the socalled European migrant crisis. At a micro or intracluster level, we report which subpopulations are under or overrepresented in each cluster over time, revealing a rich dynamic of change in the city. By applying data science principles, it is possible to explore the phenomenon of residential segregation in an unsupervised and dynamic way. The contribution of this research is to present a dynamic analysis of the existing clusters in Berlin for the first time. In other words, the contribution of this paper is that it allows us to dynamically determine the number of clusters and the attributes that are more important over time in this analytical context. In this sense, an approach based on data science offers a huge field of application for the identification of changes in the city.
The remainder of the article is organized as follows. Section 2 provides an overview of previous research. Section 3 presents the methodological approach used to estimate population densities in residential areas and perform dynamic clustering based on calculating different subpopulations in Berlin. Section 4 presents the results, the clusters found, the criteria used to validate them, and the interpretation of the results. Finally, Sect. 5 presents a discussion and conclusions based on the research objectives.
2 Literature review
Data science is used to address complex problems related to sociological, economic and demographic factors. In particular, it is used to study residential segregation using unsupervised approaches. Multivariate and unsupervised methods are often preferred because there is no single view or way of quantifying residential segregation, and there is no baseline or ground truth for conducting supervised analyses.
A range of methodological approaches from the field of data science have been employed to study residential segregation. Spatial concentration patterns have been studied for a long time using a factorial approach [31, 32], which is now better known as factorial ecology [33]. Modern urban data science approaches also use this method. For instance, Benassi et al. [34] developed a composite index using multiple principal component analyses, which has been a revival of this approach. Recently, nonsupervised machine learning methods have been employed to recognize patterns of residential segregation. For example, OlteanuRaimond et al. [35] used traditional selforganizing maps, a type of neural network, to identify emerging patterns. Other researchers (see for example [36]) have used data science to improve the visualisation of changes in segregation and diversity in 61 major US cities between 1990 and 2020. Finally, Masías et al. [29] used unsupervised algorithms commonly used in image processing and remote sensing to generate visualizations and humanunderstandable information, based on concepts of cognitive psychology.
Approaching the study of residential segregation from a data science perspective, taking into account its spatial and temporal dimensions is a multidimensional problem. In the case of Berlin, for example, several dimensions of residential segregation have been studied. These include the ethnic dimension, which has been studied in Germany under the concept of migration background [1–29]; age or agegroup segregation [1, 6, 7, 12, 20, 21, 23, 24, 29], which corresponds to the fact that different age groups are clustered in different parts of the city; gender segregation [4, 8, 15], which is a phenomenon associated with unbalanced gender ratios across space; social or socioeconomic segregation [2, 3, 10, 12–15, 17, 18, 20, 23, 27, 29], where people are grouped with others with similar socioeconomic characteristics, reflecting their economic opportunities; and the digital segregation dimension [24], which attempts to map access to social media and other digital technologies. The different emphases of some previous publications are summarised below (see Table 1).
As seen in Table 1, most previous studies focus on the ethnic aspect of residential segregation. The study of segregation by age group is the second most common. The third most studied aspect is social segregation. Finally, the least researched aspects are gender segregation and digital segregation. However, as can be seen, previous research has also been a study of more than one dimension at a time. Among the works cited, we would like to highlight the following ones:

Kemper [6, 7], Arin [3], and Yamamoto [5, 37] contribute to a conceptual, empirical and historical understanding of the emergence of ethnic residential segregation in Berlin from a geographical and economic perspective.

Nakagawa [4, 8] and Kröhnert [15] understand gender segregation as a consequence of ongoing migration processes within Germany from a sociodemographic perspective.

Although Helbig’s [23] contribution does not consider the demographic spatial dimension of residential areas (i.e. estimates of population density across residential areas), his work emphasizes the temporal dimension of residential segregation.

The contribution of Marcińczak [28], who conducted a cluster analysis of residential segregation in Berlin, used hierarchical cluster analysis to examine several years of demographic data. However, no dynamic analysis of cluster formation in Berlin was carried out, due to the nature of the clustering method used, which is static. Furthermore, this work is guided by a predefined interpretation of the clusters.

Kurtenbach’s key study [24], which explored digital segregation in Berlin using data from a social media service designed to organise community life in neighbourhoods.

Finally, innovative techniques for estimating population densities in residential areas in Berlin have been developed. For example, Groß [21] has developed methods for estimating anonymized spatial densities at a higher resolution. Building on this work, Masías et al. [29] have used nonnegative matrix factorization to study different facets of residential segregation.
Previous studies on the city of Berlin have adopted a nondynamic approach. The lack of dynamic clustering methods has led other researchers to use, for example, hierarchical cluster analysis, which does not allow the temporal aspect to be taken into account.
In this context, we aim to perform a data analysis which has the advantage of not being a black box. This allows for a more direct interpretation of the clusters and takes into account the existing dynamics, being more accessible to interpret, which is not the case with factor analysis methods or those using blackbox machine learning methods.
3 Methods
3.1 Methodological approach
The proposed methodological approach is threefold: first, estimating the spatial density of diverse subpopulations over Berlin, with the exclusion of nonresidential areas, employing a Multivariate Kernel Density Estimation; second, the spatial densities estimated in the first step are analyzed via a Dynamic Fuzzy CMeans clustering method; finally, humanreadable information about the composition of the clusters and their interpretation is generated. The flow chart in Fig. 1 summarizes the methodological approach followed throughout this work.
3.1.1 Data source
The register of residents (Einwohnerregister) from 2009 to 2020, available at the Statistical Office of Brandenburg (see www.statistikberlinbrandenburg.de), was used as the input for this step. We only used the information regarding the migration background, age group, gender, and socioeconomic demographics for each LOR spatial planning area (i.e., Die lebensweltlich orientierten Räume).
As an indirect measure of the ethnic dimension of residential segregation, the category of migration background has often been used in German sociology. It was first defined in 2005, when it was used in the microcensuses. The official definition used in 2005 is as follows: an individual with a migrant background is defined as “all migrants who entered the current territory of the Federal Republic of Germany after 1949, and all foreigners born in Germany and all those born in Germany as Germans with at least one parent who immigrated to Germany or who was born as a foreigner in Germany” [38, p.6]. In this context, the migrant background is instead referred to as a statistical category based on citizenship and an indirect record of the place of birth of the individual’s parents.
Information on the demographic distribution by age group and gender in each LOR area for each year was also used. To estimate the density of males and females in the city of Berlin, we used data on the sex of individuals and information on the number of people in a given age group living in a given LOR. As sex ratios vary from country to country, and international or internal migration processes may have skewed age groups that differ from the destination population, and as this phenomenon has previously been reported as occurring in Berlin (see, [15]), we explore the possibility of residential segregation by gender.
Finally, people experiencing economic hardship in Germany are entitled to receive social benefits as defined in the Second and Third Book of the Social Code (SGB II and SGB III). In principle, any EU or nonEU citizen with a valid residence permit is entitled to SGB II and SGB III benefits after working in Germany for at least one year.
The SGB II (Sozialgesetzbuch Zweites Buch) and the SGB III (Sozialgesetzbuch Drittes Buch) are the two most fundamental laws of the German social security system. SGB II, or “Hartz IV”, deals with social benefits for unemployed or lowincome persons. SGB II also regulates the payment of unemployment benefits and social assistance. SGB III deals with employment promotion, vocational training, and education. It is aimed at helping people to find and keep a job and to improve their vocational skills. SGB III provides various measures to support job seekers, such as job counselling, placement services, and vocational training programs. Funding provisions and support for companies to create jobs and train their employees are also included.
In summary, while SGB II focuses on providing financial assistance to those in need, SGB III aims to promote employment and vocational training. For this paper, the number of people who obtained benefits under SGB II and SGB III in a given year and city location is considered a proxy measure of social or socioeconomic segregation.
3.1.2 Multivariate kernel density estimation in the presence of measurement error
The spatial density of inhabitants is estimated following the work of [39], where a method is proposed to estimate the population density over areas with arbitrary shapes. That method is, in turn, based on a previous publication of the same author in which demographic estimates from rectangular spatial grids of different sizes are computed while introducing measurement error and data anonymization [21]. Other worth mentioning areas of application of the present method are the estimation of ethnic minority settlement areas [21], regional childcare demand estimates [40], regional election analyses [41], or estimates of the incidence of Coronavirus infections over time and space [42].
In the present study, Berlin is divided into spatial units (Planungsräume) whose centroids contain the spatial coordinates (measured in degrees). The technique of [39] is then applied using the LORs (LebensweltlichOrientierte Räume) areas on the aggregated number of inhabitants with distinct migratory origins, age, gender, and socioeconomic conditions living in each of those spatial units. To obtain corrected density estimates, the nonresidential areas were discounted in the analysis (see [43]).
The model used in this work to estimate the corrected spatial density from heaped data (i.e., the arbitrary aggregation of data in a spatial area) in polygons of an arbitrary shape is based on a nonparametric estimation method: the Multivariate Kernel density estimation technique. This approach estimates a finite sample’s joint probability density function of two or more continuous random variables. In simpler words, it is used to estimate the distribution or spread of the data across more than one dimension when only a finite number of data points are available.
Let \(X=\{X_{1},X_{2}, \dots ,X_{n}\}\) be a sample from a multivariate a random variable with probability distribution described by the unknown density function \(f(x)\) to be estimated. Each random variable is twodimensional in our case, i.e., \(X_{i}=(X_{i1},X_{i2})\), \(i=1, \dots , n\), being \(X_{i1}\) and \(X_{i2}\) the longitude and latitude coordinates, respectively, and X is the set containing all the available spatial coordinates. Then, the multivariate kernel density estimate at the twodimensional point x is defined to be:
where:

\(\cdot \) denotes the determinant.

\(K(\cdot )\) is the kernel, a symmetric multivariate density function. This function assigns weights to the observed data points based on their distance from the point where we want to estimate the density. We use the standard multivariate normal kernel, i.e., \(K(x)=(2\pi )^{\frac{d}{2}}e^{\frac{1}{2}x^{T}H^{1}x}\).

H is the bandwidth \(d\times d\) ^{Footnote 1} matrix, characterized by being symmetric and positive definite. It controls the window size in each dimension over which the kernel function operates. A small bandwidth will result in a density estimate that is very sensitive to the data (potentially too sensitive, resulting in overfitting). In contrast, a large bandwidth may smooth out important features of the data (underfitting). Therefore, the choice of H is critically important for the accuracy of the kernel density estimations. There exists a lot of discussion in the literature about the selection of the bandwidth matrix. Here, we use the approach of Wand and Jones, as it is done in [21].
In short, a function that returns high values for points close to the data point and low values for points far away is created at each data point, the multivariate kernel. The final density estimate at point x is the average contributions from all these kernel functions centered at each data point \(X_{i}\). In this way, the density is high, where many data points are close together, and low, where the data points are spread out.
Since we have data spatially aggregated for each area of the city, rather than the exact coordinates, we use the approach of Groß et al. [21], that introduces measurement error to produce estimates of population density while anonymising the sensitive data. Formally, the actual values \(X=\{X_{1},X_{2}, \dots ,X_{n}\}\) are unknown, and only the aggregated values over each area can be utilized, which are denoted by \(W=\{W_{1},W_{2}, \dots ,W_{n}\}\). They can be seen as a measurement with an introduced error of the actual coordinates of individual i, where \(i=1, \dots , n\). The objective is to estimate the density \(f(x)\), from which X is drawn, only with the values \(W_{i}\).
A naive kernel density estimator, which would use the aggregated values as the real coordinates in Equation (1), may lead to a spiky density far from the actual density of the true data. This effect becomes more noticeable as the sample size increases. Therefore, a model which contemplates the measurement error must be used. Under the assumption that the anonymization process is known, a measurement error model for W can be defined as \(\pi (WX)=\prod_{i=1}^{n}\pi (W_{i}X_{i})\), where \(\pi (WX)\) refers to the conditional distribution of W given X, and
with \(\operatorname{area}(W_{i})\) being the set of coordinates that lie within the area where \(W_{i}\) belongs. Using the Bayes theorem formulation \(\pi (X_{i}W_{i})\propto \pi (W_{i}X_{i})\pi (X_{i})\) (i.e. the probability of \(X_{i}\) given \(W_{i}\) is proportional to the probability of \(W_{i}\) given \(X_{i}\) times the probability of \(X_{i}\)), pseudosamples of \(X_{i}\) can be drawn from \(\pi (X_{i}W_{i})\), which are used to estimate the density function \(f(x)\). In particular, following an iterative procedure, \(X_{i}\) is drawn from the known conditional distribution \(\pi (W_{i}X_{i})\) using \(\pi (X_{i})\) as a weight. Since \(f(X_{i})\) is unknown, and thus, \(\pi (X_{i})\) as well, the multivariate kernel density estimator \(\hat{f}_{H}(x)\) defined in Equation (1) is used instead. At the beginning of the procedure, an estimate \(\hat{f}^{(0)}_{H}(x)\) is calculated according to Equation (1) from the artificial geocoordinates \(W_{i}\). After drawing the pseudosamples, the multivariate kernel density estimator is applied to these samples to estimate the density function \(\hat{f}^{(1)}_{H}(x)\). In the following iterations, the density estimate \(\hat{f}^{(N+1)}_{H}(x)\) is recalculated by utilizing the drawn pseudosamples in the previous iteration N. In this way, the pseudosamples provide a way to fill in the information lost due to data aggregation, and the density estimate is refined in each iteration. For more details on the steps of the algorithm, see [21].
3.1.3 Dynamic fuzzy cmeans
This dynamic clustering algorithm, presented by Crespo and Weber [44] in 2005, relies on updating the structure of the current clusters based on relevant changes in the dynamic data. The period between the creation of a cluster structure and its update is called cycle, and its definition makes it possible to adapt the algorithm methodology to any probabilistic clustering algorithm, i.e., any clustering algorithm that determines degrees of membership. The degree of membership of an item to the clusters is used to identify changes in the structure of the clusters.
Changes in the structure of the clusters can be the creation of new clusters, elimination of clusters, or movement of the centers of the clusters. The following are the basic steps of the Dynamic Fuzzy CMeans:

1. Run the fuzzy cmeans algorithm using the initial data set.

2. Receive new data and merge it with the current data.

3. Look for relevant changes in the structure of clusters.

4. If relevant changes exist, update the structure of clusters.

5. Repeat until no new data arrive.
In what follows, a detailed description of the mathematical aspects of the algorithm used here is provided. Let \(X_{0}\) be the initial data set and \(X_{1}, X_{2}, \dots , X_{t}\) be the new datasets the algorithm receives in each cycle \(t > 0\). In the beginning, the traditional fuzzy cmeans algorithm is run on the first data set, \(X_{0}\), with \(c \geq 2\) clusters and fuzzifier \(m > 0\), so that it produces c clusters with its respective centers \(\mathbf{v}_{j}\), for \(j = 1,2, \dots ,c\), and the membership matrix \(\mathbf{M}_{n\times c}^{0}\), being n the number of data points in \(X_{0}\). The components of this membership matrix are the membership degrees, i.e., its component at position \((i,j)\), \(i=1, \dots ,n\), \(j = 1,2, \dots ,c\), is the membership degree \(\mu _{i,j}\) of the data point \(x_{i}\in X_{0}\) to cluster j.
Let \(X_{t}\) be the new data chunk arriving into the dataset at cycle \(t>0\), which could produce changes in the current structure of the clusters because it contains data points that are not well classified by the current clusters. Let \(c_{t}\) be the number of clusters at cycle t, \(n_{t}\) the number of objects in the dataset \(X_{t}\) and \(i=1, \dots ,n_{t}\) the index of the new objects. To identify the data points producing changes, the following must be calculated:

pairwise distance \(d(\mathbf{v}_{j},\mathbf{v}_{k})\) between each pair of the current centers \(\mathbf{v}_{j}\) and \(\mathbf{v}_{k}\), for all \(j,k = 1,2, \dots ,c_{t}\).

distance \(d(\mathbf{x}_{i},\mathbf{v}_{j})\) between the new data point \(\mathbf{x}_{i}\in X_{t}\) and the current centers \(\mathbf{v}_{j}\), for all \(i = 1,\dots ,n_{t}\) and \(j = 1,2,\dots ,c_{t}\).

the membership degree \(\hat{\mu}_{i,j}\) of the new object \(\mathbf{x}_{i}\in X_{t}\) to the cluster j, for all \(i = 1,\dots ,n_{t}\) and \(j = 1,2,\dots ,c_{t}\).
Then, conditions shown in Equation (3) and Equation (4) must be evaluated on the new data to detect objects of \(X_{t}\) that are incorrectly assigned to the current clusters, i.e., those objects that would involve a change in the current structure.
where \(\alpha > 0\) is a threshold parameter fixed beforehand by the decision maker or dynamically determined by the algorithm.
The two above conditions are used to define the indicator function (see Equation (5)), that is equal to one if, and only if, the data point \(\mathbf{x}_{i} \in X_{t}\) is correctly classified by the current structure:
If at least one new data object is not well classified, the criterion defined in Equation (6) is applied to decide whether new clusters should be created or if, conversely, moving the current centers is sufficient:
where \(\beta \in [0,1]\) is another threshold parameter that can be fixed previously or adjusted dynamically, and \(\cdot \) represents the number of elements of a set. Whenever the condition defined in Equation (6) fulfils, new clusters must be created, and, in other cases, it is enough to update the centers of the current clusters.
If many new objects cannot be correctly assigned to the current clusters, i.e., new clusters are to be created, the optimum number of new clusters has to be determined. To do so, we select the number of clusters that maximize the structure strength [45], as it is done in the original paper [44]. Nevertheless, any other procedure could be used to find the new number of clusters. Once the optimum number is determined, the fuzzy cmeans algorithm is run from scratch using that number on the total dataset.
In other cases, when it is sufficient to move the current centers of the clusters, the current centers are combined with those representing the new data. The cluster centers representing only the new data are calculated using Equation (7) and combined with the previous centers as defined in Equation (8):
where \(\lambda _{j}\) is determined by Equation (9):
Note that a data point \(\mathbf{x}_{i}\) is assigned to a cluster \(C_{j}\) if and only if \(j = \operatorname{arg\,max} _{k = 1,2,\dots ,c_{t}}\{\mu _{i,k}\}\), being \(C_{j}\) the set of data points that belongs to cluster j, \(\forall j = 1,2,\dots ,c_{t}\)
As a last step of the algorithm, a cluster is deleted if it has been a predefined number of cycles, T, without receiving new objects. For this purpose, each cluster has a counter that includes the number of cycles it has been active without any update. When the counter reaches the value T, it is deleted by removing its center and all the data belonging to it from the data set.
3.1.4 Cluster interpretation
To characterize a cluster with numerical variables, e.g. X, valuetests (vtest) are computed for each of those variables using the following statistic:
where X̅ is the mean of the variable X in the entire dataset, \(\overline{X_{C}}\) is the mean of X within the cluster C, \(n_{C}\) is the number of objects in C, and \(s^{2}\) is the global variance of X. The statistic follows a Student’s tdistribution with \(n_{C}1\) degrees of freedom, denoted by \(t_{n_{C}1}\).
The vtest allows the interpretation of which variables characterize the clusters. If the value of the statistic in Equation (10) for a variable X in a cluster C is larger than 1.96, then it is interpreted that the variable characterizes the cluster. Additionally, the larger the value of the statistic, the better that variable characterizes the cluster, and the sign of the test indicates whether the variable is underrepresented (i.e., a negative sign) or overrepresented (i.e., a positive sign) in the given cluster, in comparison with all the data available for a given year. This statistic is very intuitive, as specific subpopulations may be over or underrepresented when all are compared at the city level.
4 Results
4.1 Results of the multivariate kernel density analysis
The results of the multivariate application of the kernel density estimation method are presented in Table 2, which shows in aggregate form, over the years measured, the mean values of the variables studied (i.e., migration background, age group, gender, and socioeconomic factors), the standard deviation, the minimum, the maximum, and selected percentiles.
As seen in Table 2, the residential densities of the German subpopulations have higher mean values, while those of the Chinese subpopulations have the lowest mean values. It can also be observed that the population densities of individuals with a migrant background from Turkey have a higher standard deviation over the years, meaning there have been changes in the residential densities over time. Subpopulations with a migration background from Vietnam reach the maximum residential density, which can be interpreted as these communities locating in the specific residential areas, while those from Ukraine reach the minimum residential density. These statistics also show that, over time, Berlin has an average population of young adults aged between 30 and 35, and the subpopulation of elderly people aged over ninety has lower average values.
In addition, the highest values are found in the population aged 5055 years, which is the population living in common areas of the city. Finally, there are only marginal differences in the distribution between men and women in all the descriptive statistics. However, it should be noted that in some areas of the city, the female population peaks almost twice as high as the male population. Similarly, SGB II and SGB III show similar population densities reflecting socioeconomic problems, although SGB II shows slightly higher values in some descriptive statistics.
4.2 Results based on the dynamic fuzzy cmeans
4.2.1 Cluster validation
The Bezdek partition coefficient, an indicator defined by James Bezdek, was used to validate and quantify the quality of clustering solutions on our timevarying data sets. The Bezdek partition coefficient of a fuzzy cpartition of n data points is defined as [46]:
where \(u_{i k}\) is the membership of object i to cluster k, and c is the number of clusters. This index takes the value of 1 when the clusters are perfectly differentiated, and each object belongs only to a single cluster, and the value \(1 /c\) when each object belongs simultaneously to each cluster, so the distinction between the different clusters is undetermined. Therefore, the extreme values of the Bezdek Partition Coefficient allow evaluation of the quality of the clustering solution generated. Also, the partition coefficient depends on the number of clusters; the more clusters there are, the lower the value of the Bezdek index, which means that clustering is fuzzy since its value is close to \(1/c\). Figure 2 plots the evolution of the Bezdek Partition Coefficient over the years 2009 to 2020 for the dimensions considered.
The dynamic clustering algorithm tries to make decisions that do not worsen the Bezdek partition coefficient too much so that the partition continues to have a good level of quality. The initial cluster number of the algorithm was chosen considering the best Bezdek partition index obtained. The period of cycle T has been set to 20 in order to keep the data up to date. At the same time, the different variables are updated since, as explained above, the algorithm deletes the classes that are not updated. This parameter avoids deleting data in the classes that are not updated, and this allows keeping the data for the analysis of new incoming data blocks.
The Bezdek partition coefficient indicates that the dynamic clustering solutions for the age group dimensions and the socioeconomic variables improve over time. However, the coefficient behaves differently in the case of dynamic clustering based on migration background and clustering based on gender variables. In the case of dynamic clustering based on variables describing migration origin, it can be observed that the coefficient decreases until 2015 when the dynamic clustering algorithm detects the emergence of a new cluster, which reflects a new cluster structure. After this year, the coefficient improved and remained relatively stable but declined after 2018.
In the case of dynamic clustering based on gender variables, the Bezdek partition coefficient remains relatively stable over the years. It was only in 2017 that the coefficient values started to fall, but this reflects only a certain instability of the clusters. As we will see below, those clusters arise due more to changing population densities across the city than to uneven differences in the density of men and women in Berlin’s residential areas, which are often remarkably similar.
The Bezdek partition coefficient generally shows that the dynamic clustering solutions obtained improve over time. It also shows us that when a decrease in the coefficient is observed, the emergence of a new cluster structure can be expected, in this case, when the migration background variables are considered. Therefore, the solutions obtained from the data represent valid cluster solutions.
4.2.2 Clustering results
To characterize the change of the clusters considering migration background, age group, gender, or socioeconomic variables, we will present the clustering results in the years 2009 and 2020 in terms of their Mean in Cluster (MIC, the mean of a given variable in a given cluster) and the vtest, both indicating whether a variable is under or overrepresented in a given cluster and year and their corresponding statistical significance. Finally, the normalized size of the clusters (see, Fig. 3) and in absolute terms (see, tables in the Appendix) are presented.^{Footnote 2}
To visualize the micro changes and trends in the clusters over time, we generated a series of bump charts to describe which categories of variables were significant over time in each cluster. A Bump Chart “shows how quantitative category rankings have changed over time. They are typically structured around a temporal xaxis with equal intervals from the earliest to the latest. Quantitative rankings are plotted using joinedup lines that effectively connect consecutive points positioned along a yaxis (typically top = first)” [47]. After evaluation, each vtest value is assigned a rank, and each variable’s ranks for a given year are plotted in descending order. The graph also shows that the values are grouped into different clusters based on a threshold to determine whether the variable is overrepresented, underrepresented, or significant in a given cluster. To do this, the graphs use the critical values (i.e., 1.96 and −1.96 for a twotailed test at a 5% significance level). Therefore:

If the vtest value is greater than 1.96, a variable is considered to be overrepresented in a given cluster.

If the vtest value is less than −1.96, a variable is considered to be underrepresented.

If the vtest value is between −1.96 and 1.96, then a variable is not considered significant.
The bump chart is used here to visualize the microdynamics of residential segregation. On the Yaxis, the names of the variables are listed according to the value obtained for each year and cluster. This provides very informative visual information about the composition of the clusters because while the yaxis represents the ranking of the variables, the xaxis represents the years, and the connecting different lines show how the ranking of the different categories changes over time.
Results based on the migratory background.
In terms of migrant background and clustered residential areas, the city of Berlin has a diverse and mixed population. The dynamic cluster analysis shows seven clusters from 2009 to 2014 and a total of eight clusters from 2015 to 2020. In 2015 a change in the clusters’ structure was detected. The change corresponds to the emergence of Cluster 7. From a qualitative point of view, it can be seen that the change in the cluster structure occurred in the same year as the socalled European migration crisis.
Cluster 0 is characterized by all migrantrelated variables being underrepresented (see, Fig. 4a). In the year 2009, the three most underrepresented variables correspond to Germans without a background of migration (MIC = 7.165; vtest = −96.753; p = 0.000), Poland (MIC = 2.181; vtest = −96.19; p = 0.000) and other subpopulations, while for the year 2020, the most underrepresented variables correspond to Poland (MIC = 2.052; vtest = −313.965; p = 0.000), Germans with no migration history (MIC = 6.301; vtest = −311.816; p = 0.000) and Syria (MIC = 1.702; vtest = −283.379; p = 0.000).
Cluster 1 is also characterized by the underrepresentation of all variables related to the migrant background (see Fig. 4b). In 2009, the three most underrepresented variables corresponded to other minorities (MIC = 6.332; vtest = −30.513; p = 0.000), Syrians (MIC = 5.136; vtest = −29.968 p = 0.000) and Ukrainian subpopulations (MIC = 5.981, vtest = −29.255, p = 0.000). The least underrepresented are the USA (MIC = 12.166; vtest = −3.303; p = 0.001), Iran (MIC = 9.331; vtest = −10.596; p = 0.000) and China (MIC = 6.855; vtest = −16.39; p = 0.000). For 2020, the most underrepresented variables are other minorities category (MIC = 5.132; vtest = −129.563; p = 0.000), Poland (MIC = 6.504; vtest = −122.126; p = 0.000), and Italy (MIC = 4.154; vtest = −121.158; p = 0.000).
In Cluster 2, all variables are overrepresented, except for Kazakhstan, which ranks last and is underrepresented for all measured years (see Fig. 4c). The most overrepresented subpopulations in 2009 are Iran (MIC = 76.504; vtest = 123.307; p = 0.000), Ukraine (MIC = 56.573; vtest = 104.875; p = 0.000), China (MIC = 54.082; vtest = 84.846; p = 0.000), USA (MIC = 60.44; vtest = 84.445; p = 0.000) and Austria (MIC = 51.335; vtest = 83.014; p = 0.000), and other subpopulations. Similarly, the most overrepresented variables for the year 2020 correspond to those of Iran (MIC = 74.076; vtest = 394.988; p = 0.000), Ukraine (MIC = 62.733; vtest = 365.308; p = 0.000), China (MIC = 56.408; vtest = 291.015; p = 0.000), Greece (MIC = 54.289; vtest = 243.657; p = 0.000) and Austria (MIC = 52.133; vtest = 241.398; p = 0.000), among other variables that characterize this cluster.
In Cluster 3, as all variables are statistically significant, all variables characterize this cluster (see Fig. 4d). In 2009, all the variables of the migratory background were overrepresented, as in the case of Poland (MIC = 35.064; vtest = 73.191; p = 0.000), Croatia (MIC = 31.789; vtest = 56.777; p = 0.000), Syria (MIC = 27.023; vtest = 49.38; p = 0.000), RU (MIC = 32.033; vtest = 49.372; p = 0.000) and Serbia (MIC = 32.033; vtest = 49.372; p = 0.000). From 2015 to 2020, a change was observed as the USA, France, and Spain subpopulations became underrepresented. It is also observed that between 2015 and 2020, the United Kingdom no longer represents this cluster. For the year 2020, it is observed that the subpopulations of Poland (MIC = 32.751; vtest = 251.125; p = 0.000), Croatia (MIC = 27.117; vtest = 171.742; p = 0.000), Serbia (MIC = 25.932; vtest = 144.585; p = 0.000), Syria (MIC = 20.281; vtest = 121.066; p = 0.000), and BA (MIC = 23.835; vtest = 120.809; p = 0.000) are the five most overrepresented subpopulations in this cluster.
Most of the variables in Cluster 4 are underrepresented, although there are a few overrepresented variables (see, Fig. 4e). In 2009, the variables Kazagastan (MIC = 19.642; vtest = 21.459; p = 0.000), Germans without a migration background (MIC = 18.297; vtest = 14.523; p = 0.000), and Vietnam (MIC = 16.58; vtest = 14.282; p = 0.000) are overrepresented. For the same year, the most underrepresented variables are the subpopulations of France (MIC = 6.218; vtest = −17.707; p = 0.000), Italy (MIC = 6.797; vtest = −17.258; p = 0.000), USA (MIC = 7.334; vtest = −16.322; p = 0.000), Spain (MIC = 5.953; vtest = −16.258; p = 0.000) and UK (MIC = 7.218; vtest = −15.911; p = 0.000), among others. By 2020, the only overrepresented subpopulation is the German subpopulation without a migration background (MIC = 15.596; vtest = 8.687; p = 0.000), while the Spanish (MIC = 5.938; vtest = −60.339; p = 0.000), French (MIC = 5.938; vtest = −58.254; p = 0.000) and Italian subpopulations (MIC = 7.178; vtest = −56.994; p = 0.000) are the most underrepresented.
There is a mixture of over and underrepresented variables in Cluster 5 (see Fig. 4f). The most overrepresented subpopulations in 2009 are Kazakhstan (MIC = 46.609; vtest = 91.192; p = 0.000), Vietnam (MIC = 35.34; vtest = 62.19; p = 0.000) and Germans without a migration background (MIC = 23.93; vtest = 41.097; p = 0.000), among others. In the same year, the most underrepresented subpopulations are the USA (MIC = 8.197; vtest = −13.493; p = 0.000), France (MIC = 8.265; vtest = −11.823; p = 0.000) and Spain (MIC = 8.126; vtest = −10.444; p = 0.000), among other subpopulations. For the year 2020, the three most overrepresented subpopulations are Kazakhstan (MIC = 48.736; vtest = 346.537; p = 0.000), Vietnam (MIC = 45.111; vtest = 269.945; p = 0.000) and RU (MIC = 31.451 vtest = 244.861 p = 0.000), while the USA is the most underrepresented (MIC = 5.524; vtest = −67.792; p = 0.000), followed by France (MIC = 5.78; vtest = −58.401; p = 0.000) and UK (MIC = 6.468; vtest = −55.138; p = 0.000). Finally, the variables of Romania, Croatia, Iran, Bulgaria, and other minorities are not always characteristic of this cluster over time.
During 2009, Cluster 6 was characterized by almost all subpopulations being overrepresented, except for Kazakhstan (MIC = 10.477; vtest = −7.066; p = 0.000), which was the only one underrepresented (see Fig. 4g). In 2020, the three most overrepresented subpopulations were Spain (MIC = 79.323; vtest = 479.471; p = 0.000), France (MIC = 80.279; vtest = 479.408; p = 0.000), and Italy (MIC = 71.934; vtest = 470.824; p = 0.000), along with other subpopulations.
Most interestingly, the emergence of cluster 7 in 2015 was revealed by the dynamic cluster analysis. In this cluster, all subpopulations are representative and overrepresented. In 2015, this cluster has Syria (MIC = 30.966; vtest = 142.006; p = 0.000) as the most overrepresented variable, and the second most overrepresented nation is China (MIC = 31.461; vtest = 131.278; p = 0.000) and the third most overrepresented variable is other minorities (MIC = 37.489; vtest = 13072; p = 0.000). The least overrepresented variable is Kazakhstan (MIC = 13.077; vtest = 7.84; p = 0.000). In the year 2020, this cluster is characterized by China as the most overrepresented variable (MIC = 32.429; vtest = 193.024; p = 0.000). The second most overrepresented variable is Syria (MIC = 28.793; vtest = 185.59; p = 0.000), and the third most overrepresented variable is Croatia (MIC = 32.348; vtest = 185.12; p = 0.000). The least overrepresented variable is again Kazakhstan (MIC = 13.718; vtest = 16.089; p = 0.000).
Finally, the visualization of the clusters is shown in the maps in Fig. 5, and the normalized size of clusters over time is shown in Fig. 3a.
Results based on the age group.
The dynamic cluster analysis revealed that the population of Berlin is grouped in residential areas in a structure of four different clusters of age groups. Qualitatively, Cluster 3 is located in the city centre, Cluster 2 is located around the city centre, surrounded by Cluster 0. Finally, Cluster 1 is located on the city’s outskirts.
Cluster 0 has only a few overrepresented variables and several underrepresented ones (see Fig. 6a). Analysis using the value test shows that in 2009, subpopulations in Cluster 0 ranging from 80 to 85 are the most overrepresented (MIC = 14.578; vtest = 31.307; p = 0.000), and subpopulations ranging from 30 to 35 are the most underrepresented (MIC = 9.79; vtest = −25.043; p = 0.000), and subpopulations aged 60 to 65 were not significant (MIC = 13.201; vtest = 0.491; p = 0.623). For the year 2020, subpopulations between 85 and 90 years are the most overrepresented (MIC = 13.909; vtest = 104.682; p = 0.000), and subpopulations between 30 and 35 years are the most underrepresented (MIC = 9.327; vtest = −92.725; p = 0.000), showing ageing of the cluster compared to 2009.
Cluster 1 has all variables underrepresented (see Fig. 6b). For the year 2009 in Cluster 1, the subpopulations between 30 and 35 are the least underrepresented (MIC = 2.804; vtest = −96.097; p = 0.000), and the most underrepresented are the subpopulations between 80 and 85 (MIC = 5.244; vtest = −130.039; p = 0.000). By 2020, the least underrepresented subpopulations in Cluster 1 are those between 30 and 35 (MIC = 2.541; vtest = −334.159; p = 0.000), and the most underrepresented are those between 85 and 90 (MIC = 5.062; vtest = −437.36; p = 0.000).
Cluster 2 has all variables overrepresented during 2009 (see Fig. 6c). For the year 2020, the most overrepresented age groups are the 80 to 85yearolds (MIC = 22.479; vtest = 325.474; p = 0.000), and the least overrepresented groups are the 30 to 35yearolds (MIC = 23.894; vtest = 161.203; p = 0.000).
Similarly, in Cluster 3, all variables are overrepresented (see Fig. 6d). For the year 2009, the subpopulations from 35 to 40 are the most overrepresented (MIC = 51.23; vtest = 142.501; p = 0.000), and the subpopulations from 85 to 90 are the least overrepresented (MIC = 18.939; vtest = 49.795; p = 0.000), showing that it is a representative cluster of adults. Similarly, in the year 2020, the subpopulations ranging from 35 to 40 are the most overrepresented (MIC = 51.42; vtest = 492.401; p = 0.000), and subpopulations over 90 are the least overrepresented (MIC = 17.28; vtest = 150.838; p = 0.000).
The visualization of the clusters on the map of Berlin is shown in Fig. 7, and the normalized size of clusters over time is shown in Fig. 3b. Interestingly, there is no change in the number of clusters over time, but there is an increase in the overall population density. This means that the cluster structure based on the age group dimension remains stable over the period observed.
Results based on gender.
The results of the cluster analysis allowed the identification of 3 clusters.
In general, the cluster analysis shows that the clusters represent the population density in residential areas. In other words, the clusters divided into male and female population densities correspond to Berlin’s more or less densified areas. In the case of the clusters, the marginal differences over time are reported below.
In Cluster 0, both variables are overrepresented (see Fig. 8a). For the year 2009, both male (MIC = 14.92; vtest = 139.279; p = 0.000) and female (MIC = 14.92; vtest = 139.279; p = 0.000) subpopulations are equally overrepresented in this cluster. For the year 2020, the male population (MIC = 45.058; vtest = 471.478; p = 0.000) is more overrepresented than the female population (MIC = 461.205; vtest = 461.205; p = 0.000).
Cluster 1 has both variables underrepresented (see, Fig. 8b). Both male and female residential population densities reached the same values in 2009 (MIC = 6.55; vtest = −137.675; p = 0.000). However, for 2020, the male population (MIC = 6.316; vtest = −471.675; p = 0.000) is only slightly more underrepresented than the female population (MIC = 6.233; vtest = −456.146; p = 0.000).
In Cluster 2, both variables are overrepresented (see Fig. 8c). Both male and female populations had the same spatial density for 2009 (MIC = 21.696; vtest = 50.164; p = 0.000). However, for 2020, male populations (MIC = 21.66; vtest = 179.541; p = 0.000) are more overrepresented than female populations (MIC = 21.458; vtest = 169.838; p = 0.000).
The visualization of the clusters on the map of Berlin is shown in Fig. 9, and the normalized size of clusters over time can be seen in Fig. 3c.
Results based on socioeconomics.
For the socioeconomic dimension, the cluster analysis resulted in the identification of four clusters.
From a qualitative point of view, Cluster 3 represents the places with the most significant socioeconomic problems. It can be seen that the areas corresponding to clusters 2 and 3 have a larger area in 2009, after the global subprime crisis, and the onset of the COVID19 pandemic in 2020 these clusters have slightly different shapes.
In Cluster 0, both variables are overrepresented and statistically significant (see Fig. 10a). In 2009, SGB III was the most overrepresented (MIC = 13.564; vtest = 15.813; p = 0.000), and SGB II was the least overrepresented (MIC = 12.404; vtest = 6.638; p = 0.000). For the year 2020, SGB III is the most overrepresented (MIC = 12.727; vtest = 68.251; p = 0.000), and SGB II is the least overrepresented (MIC = 12.423; vtest = 54.825; p = 0.000), with a decrease in the former group, and an increase in the latter since 2009.
On the contrary, in Cluster 1, the variables are underrepresented and statistically significant (see Fig. 10b). For 2009, Cluster 1 had SGB II as the least underrepresented socioeconomic variable (MIC = 2.388; vtest = −123.634; p = 0.000), and SGB III as the most underrepresented (MIC = 3.107; vtest = −130.744; p = 0.000). Similarly, for 2020, SGB II was the least underrepresented (MIC = 3.012; vtest = −411.076; p = 0.000), and SGB III was the most underrepresented (MIC = 3.227; vtest = −424.48; p = 0.000).
In Cluster 2, both variables are overrepresented and statistically significant (see Fig. 10c). For 2009, SGB III was the most overrepresented (MIC = 25.641; vtest = 89.369; p = 0.000), and SGB II was the least overrepresented (MIC = 26.481; vtest = 83.642; p = 0.000). Similarly, in 2020, SGB III was the most overrepresented (MIC = 25.985; vtest = 318.287; p = 0.000), and SGB II was the least overrepresented (MIC = 26.169; vtest = 304.748; p = 0.000).
Finally, in Cluster 3, both variables are statistically significant and overrepresented (see Fig. 10d). In 2009, SGB II was the most overrepresented (MIC = 55.375; vtest = 125.056; p = 0.000) and SGB III was the least overrepresented (MIC = 46.651; vtest = 113.809; p = 0.000). The same situation occurred in 2020, where SGB II was the most overrepresented (MIC = 54.557; vtest = 419.17; p = 0.000), and SGB III was the least overrepresented (MIC = 49.921; vtest = 396.516; p = 0.000). The map of Berlin is shown in Fig. 11, and Fig. 3d shows the normalized size of clusters over time.
In summary, the maps show that residential segregation in Berlin is a phenomenon that can be visualized on a geographical level. The analysis also detected the emergence of a cluster when analysing the migration background of Berlin’s populations. Finally, the results show that the clusters have small movements because the composition of the clusters changes over time and space.
5 Discussion and conclusion
This study aimed to examine the phenomenon of residential segregation from a dynamic point of view. According to our approach, residential segregation can be explored from different angles, for example, from the side of migration background, age group, gender, or other variables describing the economic situation of the population under study.
To open the discussion, we would like to recall that several studies have been carried out on the spatial distribution of Berlin’s subpopulations. In particular, we believe that the reporting of spatial densities excluding nonresidential areas, the separate analysis of dimensions that has already been documented by several researchers, and the use of dynamic rather than static cluster analysis are aspects that can help different disciplines, especially those that are looking for novel methodological new methodological ways to identify changes in population structure from cohort data. In this context, we briefly discuss some of the findings and then and then summarise the research undertaken.
5.1 Comparison with previous research
To provide a more comprehensive overview of the results, we will compare our findings with those of other researchers who have independently addressed the issue of residential segregation. First, we would like to stress that the proposed methodology allows us to identify changes that we can label macro and micro. By macro changes, we refer to the possibility of clusters appearing, moving, or disappearing over time. By micro changes, we refer to the internal changes that can occur in the composition of each cluster, which we have operationalized and visualized using bump charts. Table 3 summarizes the main results of our approach, together with selected previous studies.
5.1.1 Macro dynamics
The results allowed us to establish that there is evidence of a structural change over the period analyzed. Within this structural change, a new cluster emerged in 2015, coinciding with the peak of the migration wave in the context of the European migration crisis. Given the nature of the dynamic clustering algorithm we use, which uses all past data to assess whether a change in cluster structure is taking place, identifying the emergence of a new cluster structure requires an event at the demographic level that makes it possible. The migration crisis in Europe and Germany’s unprecedented refugee policy make the structural change we detect in Berlin a plausible interpretation of the data analysis results. The ability to detect the presence of residential segregation is the most salient finding of this study, as it demonstrates that the methodology can help to identify new patterns of residential segregation.
5.1.2 Micro dynamics
At the micro level, the bump charts show that some clusters have developed overrepresented subpopulations over time, others only underrepresented subpopulations, and a combination of both. The main trends identified can be summarized as follows:

Concerning ethnic residential segregation: In terms of microdynamic changes, the proposed method allows us to study the changes within each cluster. The richness of the results allows us to observe the overrepresented subpopulations in each cluster and the changes in the classification of each cluster, allowing us to observe the dynamics over time. The results are consistent. They continue to show the results of the now longpast migration waves of “temporary” guest workers (i.e. the socalled Gastarbeiter) from Turkey and Lebanon. However, it is only in the present work that we can observe the positioning of the Syrian and Chinese migrant subpopulations as the most overrepresented subpopulations as part of Cluster 7. The fact that both Syrian refugees and asylum seekers from China are known reality of recent immigration to Berlin. For example, Kate Martyr [48], an editor and video producer at DW’s Asia desk, reports on the surge in asylum applications from China to Germany, particularly from the oppressed Uighur minority. Finally, we observe the increase or decrease of the spatial areas occupied by the clusters in Berlin as the normalized cluster size changes, which was noticeable in 2015 due to the structural change of the clusters.

Concerning agegroup segregation: In general, the bump charts show slight changes in the ranking of the categories of variables describing the phenomenon of age segregation. Age segregation is a demographic phenomenon characterized in detail by Yamamoto, Kemper, and Nakagawa, who used data available before and after the fall of the Berlin Wall. Nakagawa found two clusters in West Berlin, characterized by higher adult densities in outer Berlin compared to populations in inner Berlin. Kemper compared East and West Berlin before and after reunification and found different degrees of segregation in these two areas. Finally, Masías et al. [29] find four clusters with different age group distributions in the city.
Through the application of dynamic analysis, our study confirms the existence of age group segregation phenomena, which is materialised in the four clusters we have found. The maps we present do not show idealized concentric zones, as suggested by earlier studies such as Nakagawa’s, but more complexshaped clusters that can be observed visually. We find that older people are concentrated in the peripheral areas of Berlin, spatially surrounding the other groups within the city, as seen in the maps provided. We also identified areas where young adults are found and clusters where children are overrepresented. We also observe that the standardized size of the clusters does not change significantly over time, which can be interpreted to mean that the spatial areas these clusters occupy in space remain relatively stable. This is highly consistent with the observation of Nakagawa, who stated that “residential segregation by age group is a very real phenomenon” [49, p. 134]. In our results, we show with greater detail that the phenomenon of residential age segregation is present in Berlin.

On socioeconomic residential segregation: We observed that the ranking of the variables remained stable, i.e. in the same ranking position in all clusters during the years studied. Compared to previous research, some similarities can be observed in the locations with the highest rate of people claiming state subsidies (see the maps published by Blokland [27], Fig. 13.3 in p. 257). Finally, we would like to report that we have observed a qualitative change that can be seen in the 2020 map, where cluster areas take on new shapes. The results of the method show that, at least visually, there are socioeconomically disadvantaged areas that only expanded in the years 2009 and 2020, which is reflected in the size of the clusters. We should bear in mind that 2009 was part of the subprime financial crisis and in 2020 the economy was under the stress of the COVID19 outbreak. We believe that the change in cluster shapes may be related to the event of the global COVID19 pandemic when many individuals in Berlin started to apply for social welfare. However, more research is needed to link this qualitative observation to a causeeffect relationship.

On residential segregation by gender: We found changes in the variables describing population densities by gender. The data analysis shows 3 clusters representing different densities of male and female individuals. However, we observe cluster densities that reflect a slight imbalance between females and males. Finally, the normalized cluster size does not vary significantly over time, which means that the spatial areas of the clusters have neither shrunk nor expanded spatially throughout observation.
Kröhnert and Vollmer [15] have argued that women from rural areas in Germany migrate to large cities such as Berlin more than men who remain in rural areas. Under this hypothesis, one might expect the possible emergence of clusters in which groups of internal migrants of women form clusters reflecting this phenomenon, which is still unknown to us. However, the variation in high, medium and low population density described by the clusters seems to reflect the variation in population density as a whole. Some changes are numerically small, but qualitatively significant for monitoring the expansion of gender residential segregation observed in other geographical regions (e.g. for examining population sex ratios in China and Saudi Arabia over time and space). Perhaps because the sex ratios in Germany are mostly balanced, the phenomenon can be observed when comparing rural areas with urban areas or between eastern and western Germany, i.e. when looking at data at the country level.
In this study, an analysis was conducted using a dynamic approach to describe the phenomenon of residential segregation in Berlin. As described in this paper, residential segregation is more of a complex dynamic phenomenon where different facets of Berlin’s subpopulations are over or underrepresented in clusters across the city. We believe that the use of dynamic cluster analysis may be of particular interest to researchers who would like to find patterns that emerge from the data rather than trying to explain or predict a variable from a survey or a multivariate index, as in both cases, it can be understood as a supervized analysis problem, which by definition involves the creation of a variable or index that directly represents residential segregation. In our methodological and theoretical approach, patterns emerge from data based on multivariate and nonblack box analysis methods.
Thus, at a high conceptual level, the analysis shows that there is no such thing as a subpopulation that isolates itself in residential areas. Instead, it can be represented as a multivariate phenomenon where clusters can be observed on the dimension of migration background, age groups, socioeconomic groups, or the dimension of gender. These dimensions may have causal relationships with each other. However, in this study, we have taken a more focused approach to represent the phenomenon of residential segregation, which has been extensively documented in Berlin, rather than generating an explanatory model, as is commonly attempted.
5.2 Future research
Future research would aim to apply this approach to data from other cities in Germany and worldwide. In particular, it would be exciting to study demographic changes in migration crises or birth shortages and case diffusion processes in times of pandemics. It would also be interesting to include other variables that represent the neighbourhood, the quality of life of people, or the transport systems. In this way, the representation of residential segregation would also have associated elements of the city’s infrastructure. Another idea is to analyze the clusters by looking at several dimensions together, for example, age and migration background, rather than studying them independently. But we explore these dimensions separately to illustrate the general thrust of our approach and also to contribute to protecting geoprivacy.
5.3 Practical applications
In this context, we believe that the approach followed in this study has multiple practical applications. Several tools describe the demography of Berlin, and some of them focus on measuring the integration of migratory subpopulations, such as the socalled “Integration Indicator Report”, which is based on data provided by Der Mikrozensus, das Sozioökonomische Panel, and the Programme for International Student Assessment [50]. Also, the annual indicators published by the Federal Statistical Office (Destatis) [51], the German Expert Council on Integration and Migration (https://www.svrmigration.de/jahresgutachten/) and the annual reports of the Organisation for Economic Cooperation and Development Integration Monitoring [52] provide an overview of the migration situation of the diverse communities in the different countries.
The differential aspect of our approach is that it allows us to observe demographic changes in residential areas over time from a global perspective. In addition, different sets of variables can be analyzed separately or together. For example, as shown in Table 1, little is known about digital segregation. The possibility of analyzing the dynamics of clusters allows for a better understanding of the impacts of territorial policies and social interventions. We believe that the future availability of spatial databases describing the information and communication technologies used will make it possible to generate new representations of the relationship between the different aspects describing the phenomenon of residential segregation.
5.4 Limitations
The clustering algorithm makes it possible to detect structural changes, but it does not provide direct knowledge of the exact size of the clusters in each period. Instead, it only gives an idea of the size of the clusters in terms of proportions, as the algorithm aggregates data from previous periods in its updating process in each new reevaluation, which can be seen as a limitation of the analytical approach. However, a more robust assessment of dynamic changes can be obtained with this strategy. The dynamic fuzzy clustering algorithm updates the clusters by incorporating the data previously evaluated in previous cluster updates. This means that it treats all previously analyzed data at each stage as entirely new. Using this strategy, the algorithm monitors structural changes rather than assessing yearonyear changes in the cluster. In this way, it detects changes in cluster composition when new data that differ from previously observed classes appear. Instead, the important thing is to assess how different the distribution of the data being aggregated is at each point in time. We can assess that this does not correspond to a local change but to a significant one because the algorithm detected this change considering all the previously available and evaluated data.
5.5 Conclusion
In this paper, we have proposed a methodology to explore and describe the demography of Berlin in residential areas. The proposed methodology allows us to make new observations on how different subpopulations are distributed in residential areas. In addition, as the analysis is carried out over time, new insights were gained into the changing internal composition of clusters, a rich diversity, and structural changes. We conclude that this novel approach, based on data science principles, allows us to identify patterns of residential segregation in Berlin in a more unified way. We encourage other researchers to develop new hypotheses about the demographic changes observed in residential areas and the factors that might explain them.
Data availability
The data that support the findings of this study are available from the register of residents (Einwohnerregister) from 2009 to 2020 available at the Statistical Office of Brandenburg (see www.statistikberlinbrandenburg.de). However, restrictions apply to the availability of these data, which were used under license for the current study and are not publicly available.
Notes
In our study \(d=2\).
In this Section, we have chosen to present the results with additional textual elaboration. This decision was made to make the results accessible to all readers, including those who are visually impaired or unable to perceive visual representations such as maps. By providing detailed written descriptions, we hope to improve the comprehensibility and completeness of the information presented.
Abbreviations
 LOR:

spatial planning area (i.e., Die Lebensweltlich orientierten Räume
 MIC:

Mean in cluster
 SGB II:

Sozialgesetzbuch Zweites Buch
 SGB III:

Sozialgesetzbuch Drittes Buch
 vtest:

valuetests
References
Nebe JM (1988) Residential segregation of ethnic groups in West German cities. Cities 5(3):235–244
Häußermann H (2013) Berlin: von der geteilten zur gespaltenen Stadt?: sozialräumlicher Wandel seit 1990. Springer, Berlin
Arin C (1991) The housing market and housing policies for the migrant labor population in West Berlin. In: Huttman E (ed) Urban housing segregation of minorities in western Europe and the United States. Duke University Press, Durham, pp 199–214
Nakagawa S (1993) Applying cohort analysis to residential segregation by age group in Berlin (West). Geogr Pol 61:133–142
Yamamoto K (1993) Spatial segregation of ethnic minorities in German cities. Geogr Rev Jpn, Ser B 66(2):127–155
Kemper FJ (1998) Restructuring of housing and ethnic segregation: recent developments in Berlin. Urban Stud 35(10):1765–1789
Kemper FJ (1998) Residential segregation and housing in Berlin: changes since unification. GeoJournal 46(1):17–28
Nakagawa S (1999) Internal migration in the territory of the former German democratic republic before German unification. Regional Views 12:15–26
Friedrichs J (2000) Ethnische Segregation im Kontext allgemeiner Segregationsprozesse in der Stadt. In: Harth A, Scheller G, Tessin W (eds) Stadt und soziale Ungleichheit. VS Verlag für Sozialwissenschaften, pp 174–196
Berlin HH (2007) From divided into fragmented city. HAGAR: Studies in Culture, Polity & Identities 7(1)
Schönwälder K, Söhn J (2009) Immigrant settlement structures in Germany: general patterns and urban levels of concentration of major groups. Urban Stud 46(7):1439–1460
Häußermann H, Kronauer M, Gornig M (2008) Desintegration und soziale Kohäsion in Berlin. Düsseldorf: HansBöcklerStiftung
Friedrichs J, Gespaltene TS (2009) Städte?: soziale und ethnische Segregation in deutschen Großstädten. Springer, Wiesbaden
Geraedts J (2009) Döner versus Curry WurstSegregation versus integration: Comparing two neighbourhoods in Multi Cultural Berlin [Master Thesis]. Available from https://theses.ubn.ru.nl/bitstream/handle/123456789/3081/Geraedts%2c_Joske_1.pdf?sequence=1
Kröhnert S, Vollmer S (2012) Genderspecific migration from eastern to western Germany: where have all the young women gone? Int Migr 50(5):95–112
Zimmermann KF, Constant A, Schüller S (2014) Ethnic Spatial Dispersion and Immigrant Identity. In: Beiträge zur Jahrestagung des Vereins für Socialpolitik 2014: Evidenzbasierte Wirtschaftspolitik  Session: Migration II. ZBW  Deutsche Zentralbibliothek für Wirtschaftswissenschaften, LeibnizInformationszentrum Wirtschaft, Kiel und Hamburg; 2014. p. 0–25. No. E05V1
Kapphan A (2013) Das arme Berlin: Sozialräumliche Polarisierung, Armutskonzentration und Ausgrenzung in den 1990er Jahren, vol 18. Springer, Wiesbaden
Mayer M (2013) New lines of division in the new Berlin. In: The Berlin reader. Transcript verlag, pp 95–106
Glitz A (2014) Ethnic segregation in Germany. Labour Econ 29:28–40
Jaczewska B, Grzegorczyk A (2016) Residential segregation of metropolitan areas of Warsaw, Berlin and Paris. Geogr Pol 89(2):141–168
GroßM, Rendtel U, Schmid T, Schmon S, Tzavidis N (2017) Estimating the density of ethnic minorities and aged people in Berlin: multivariate kernel density estimation applied to sensitive georeferenced administrative data protected via measurement error. J R Stat Soc, Ser A, Stat Soc 180(1):161–183
Working OECD (2018) Together for Local Integration of Migrants and Refugees in Berlin. OECD Publishing, Available from https://books.google.de/books?id=BiVtDwAAQBAJ
Helbig M, Jähnen S (2018) Wie brüchig ist die soziale Architektur unserer Städte? Trends und Analysen der Segregation in 74 deutschen Städten. WZB Discussion Paper
Kurtenbach S (2019) Digitale Segregation. Sozialräumliche Muster der Nutzung digitaler Nachbarschaftsplattformen. In: Heinze R, Kurtenbach S, Üblacker J (eds) Digitalisierung und Nachbarschaft. Erosion des Zusammenlebens oder neue Vergemeinschaftung, BadenBaden: Nomos, pp 115–142
Heider B, Stroms P, Koch J, Siedentop S (2020) Where do immigrants move in Germany? The role of international migration in regional disparities in population development. Popul Space Place 26(8):1–19
BartzokasTsiompras A, Photis Y (2020) Does neighborhood walkability affect ethnic diversity in Berlin? Insights from a spatial modeling approach. Eur. J. Geogr. 11(1):163–187
Blokland T, Vief R (2021) Making Sense of Segregation in a WellConnected City: The Case of Berlin. Urban SocioEconomic Segregation and Income Inequality, 249
Marcińczak S, Bernt M (2021) Immigration. segregation and neighborhood change in Berlin Cities, 103417
MasÍas V, Stier J, Navarro P, Valle MA et al. (2023) A novel methodological approach for analyzing the multifaceted phenomenon of residential segregation: the case of Berlin. Cities 141:104465
Arandelovic B, Bogunovich D (2014) City profile: Berlin. Cities 37:1–26
Gosnell HF, Schmidt MJ (1936) Factorial and correlational analysis of the 1934 vote in Chicago. J Am Stat Assoc 31(195):507–518
Price DO (1941) Factor analysis in the study of metropolitan centers. Soc Forces 20:449
Sweetser FL (1965) Factorial ecology: Helsinki, 1960. Demography 2(1):372–385
Benassi F, Bonifazi C, Heins F, Lipizzi F, Strozza S (2020) Comparing residential segregation of migrant populations in selected European urban and metropolitan areas. Spat Demogr 8:269–290
Olteanu M, Hazan A, Cottrell M, RandonFurling J (2020) Multidimensional urban segregation: toward a neural network measure. Neural Comput Appl 32:18179–18191
Dmowska A, Stepinski TF (2023) Spatiotemporal changes in racial segregation and diversity in large US cities from 1990 to 2020: a visual data analysis. EPJ Data Sci 12(1):30
Yamamoto K (1983) Dynamics of population and spatial segregation in Munich. Keizai Shirin (The Hosei University Economic Review) 50(3/4):1–59
Bundesamt S (2005) In: Bevölkerung und Erwerbstätigkeit: Bevölkerung mit Migrationshintergrund, Wiesbaden, Germany
Groß M, Kreutzmann AK, Rendtel U, Schmid T, Tzavidis N (2020) Switching Between Different NonHierachical Administrative Areas via Simulated GeoCoordinates: A Case Study for Student Residents in Berlin. Journal of Official Statistics JOS(2). 36
Rendtel U, Die RM (2018) Konstruktion von Dienstleistungskarten mit Open Data am Beispiel des lokalen Bedarfs an Kinderbetreuung in Berlin. AStA Wirtsch Sozialstat Arch 12(3):271–284
Erfurth K, GroßM, Rendtel U, Schmid T (2022) Kernel density smoothing of composite spatial data on administrative area level. AStA Wirtsch Sozialstat Arch 16(1):25–49
Erfurth K, GroßM, Rendtel U, Schmid T (2022) Kernel density smoothing of composite spatial data on administrative area level: a case study of voting data in Berlin. AStA Wirtsch Sozialstat Arch 16(1):25–49
Geofabrik GmbH. OpenStreetMap Daten für Berlin; 2020. http://download.geofabrik.de/europe/germany/berlinlatestfree.shp.zip
Crespo F, Weber R (2005) A methodology for dynamic data mining based on fuzzy clustering. Fuzzy Sets Syst 150(2):267–284
Li RP, Mukaidono M (1995) A maximumentropy approach to fuzzy clustering. In: Proceedings of IEEE international conference on fuzzy systems, vol 4. IEEE, pp 2227–2232
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
Kirk A (2016) Bump chart. SAGE Publications, Newbury Park
Martyr K (2020) China asylum claims to Germany more than double. DW News. Available from https://www.dw.com/en/chinaasylumclaimstogermanymorethandouble/a52396720
Nakagawa S (1990) Changing segregation patterns by age group in the Tokyo metropolitan areafrom the viewpoint of migration with cohort analysis. Geogr Rev Jpn, Ser B 63(1):34–47
Dietrich E, Köller R, Koopmans R, Höhne J (2011) Zweiter Integrationsindikatorenbericht. erstellt für die Beauftragte der Bundesregierung für Migration, Flüchtlinge und Integration (Stand Dezember 2011). Köln/ Berlin: Die Beauftragte der Bundesregierung für Migration, Flüchtlinge und Integration
Seuberlich M (2021) Statistisches Bundesamt/Statistische Landesämter. In: Andersen U, Bogumil J, Marschall S, Woyke W (eds) Handwörterbuch des politischen Systems der Bundesrepublik Deutschland. Springer, Wiesbaden, pp 880–884
OECD. International Migration Outlook 2021. OECD; 2021. Available from https://doi.org/10.1787/29f23e9den
Acknowledgements
We would like to thank the Department Migration, Integration, Transnationalization at the Social Science Research Center Berlin (WZB) for their academic support. Fernando Crespo acknowledges the support of Fondecyt Chile 1221562. We acknowledge the four anonymous reviewers. Any remaining errors are our own.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
The collaborative project included contributions from Víctor H. Masías H., who worked on the conceptualization, methodology, formal analysis, investigation, interpretation of results, writing  original draft, visualization and project management. Julia Stier contributed to the conceptualization, investigation, writing  original draft and project management. Pilar Navarro R. focused on the methodology and writing  original draft. Mauricio A. Valle, Augusto Vargas and Sigifredo Laengle worked on researching, writing  original draft and writing  proofreading and editing. Lastly, Fernando Crespo worked on the writing  original draft, methodology and formal analysis. The authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Cluster size over time
Appendix: Cluster size over time
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Masías H., V.H., Stier, J., Navarro R., P. et al. Evolving demographics: a dynamic clustering approach to analyze residential segregation in Berlin. EPJ Data Sci. 13, 21 (2024). https://doi.org/10.1140/epjds/s13688024004554
Received:
Accepted:
Published:
DOI: https://doi.org/10.1140/epjds/s13688024004554