Evolving demographics: a dynamic clustering approach to analyze residential segregation in Berlin

This paper examines the phenomenon of residential segregation in Berlin over time using a dynamic clustering analysis approach. Previous research has examined the phenomenon of residential segregation in Berlin at a high spatial and temporal aggregation and statically, i.e. not over time. We propose a methodology to investigate the existence of clusters of residential areas according to migration background, age group, gender, and socio-economic dimension over time. To this end, we have developed a sequential mixed methods approach that includes a multivariate kernel density estimation technique to estimate the density of subpopulations and a dynamic cluster analysis to discover spatial patterns of residential segregation over time (2009-2020). The dynamic analysis shows the emergence of clusters on the dimensions of migration background, age group, gender and socio-economic variables. We also identiﬁed a structural change in 2015, resulting in a new cluster in Berlin that reﬂects the changing distribution of subpopulations with a particular migratory background. Finally, we discuss the ﬁndings of this study with previous research and suggest possibilities for policy applications and future research using a dynamic clustering approach for analyzing changes in residential segregation at the city level.


Introduction
This manuscript examines the phenomenon of residential segregation in Berlin from a dynamic perspective, using data science to identify patterns in its human geography.
Previous research on residential segregation in Berlin has analyzed its different dimensions.For example, research has focused on residential segregation driven by ethnicity, ; residential segregation driven by age-group [1,6,7,12,20,21,23,24,29]; residential segregation driven by gender [4,8,15]; segregation driven by socio-economic factors [2, 3, 10, 12-15, 17, 18, 20, 23, 27, 29]; as well as residential segregation driven by digital segregation [24].Demography, economics, sociology, geography, and ethnographic studies have explored all these dimensions, among other disciplines.They all support the notion that there is an uneven, clustered, or patchy spatial distribution of subpopulations in residential areas of Berlin.We focus on the case of Berlin, a city in which historical events have changed both the city and its society.First, we are motivated by the Berlin case because a large body of research examines changes in residential segregation before and during the fall of the Berlin Wall.In fact, before the fall of the Wall, studies were based exclusively on data from West Berlin, as no unified public statistics were available.Then, after the fall of the Wall, data from both parts of the city became available for the first time, and research focused on understanding the structural changes that occurred as a result of the city's reunification.Finally, Berlin has served as a reference point for other comparative studies of residential segregation in other German cities in the context of immigration policy.Further details on the historical and research development of the phenomenon of residential segregation in Berlin can be found in [3,29,30].
Second, from a conceptual point of view, residential segregation occurs over time and space.However, to the best of our knowledge, there is no research in the case of Berlin that has carried out an analysis that includes these two components.For example, in Helbig's research on residential segregation [23], the author conducts a time series analysis but does not include the geographical dimension of residential areas.Another example is the work of Marcińczak and Bernt [28], in which they use hierarchical clustering on temporal data, which does not allow the identification of the emergence of new clusters or the disappearance of clusters over time.New methodological approaches are therefore needed to study the dynamics of residential segregation.
Third, we are motivated to explore the possible impact of the 2015 European migrant crisis on residential segregation in Berlin.As the city that received the highest number of refugees in Germany during the crisis, we are interested in examining whether the migration process led to the creation or elimination of clusters that shape the demographic composition of residential areas.The type of changes may include, for example, changes in the number of clusters over time (i.e.macrodynamics) and internal changes in subpopulations over time and space (i.e.microdynamics).Past studies looking at this period have not reported the existence of structural and internal changes within clusters, and one possibility for this situation may be the use of aggregate data or static clustering methods.
To reveal both structural and internal changes within clusters, we propose an exploratory analysis based on data science using a dynamic clustering algorithm.The objectives of this paper are: • Estimate population density according to dimensions such as migration background, age group, gender, and socio-economics.• Dynamically determine the number of clusters over time according to dimensions such as migration background, age group, gender, and those reflecting the socio-economic conditions.• Identify structural as well as intra-cluster changes over time.
• Determine the variables that are over-or under-represented for each cluster in a given year.As a result, our approach has led to discoveries about residential segregation.First, at a macro or structural level, a new cluster emerged in 2015, which we interpret as a result of the so-called European migrant crisis.At a micro or intra-cluster level, we report which subpopulations are under or over-represented in each cluster over time, revealing a rich dynamic of change in the city.By applying data science principles, it is possible to explore the phenomenon of residential segregation in an unsupervised and dynamic way.The contribution of this research is to present a dynamic analysis of the existing clusters in Berlin for the first time.In other words, the contribution of this paper is that it allows us to dynamically determine the number of clusters and the attributes that are more important over time in this analytical context.In this sense, an approach based on data science offers a huge field of application for the identification of changes in the city.
The remainder of the article is organized as follows.Section 2 provides an overview of previous research.Section 3 presents the methodological approach used to estimate population densities in residential areas and perform dynamic clustering based on calculating different subpopulations in Berlin.Section 4 presents the results, the clusters found, the criteria used to validate them, and the interpretation of the results.Finally, Sect. 5 presents a discussion and conclusions based on the research objectives.

Literature review
Data science is used to address complex problems related to sociological, economic and demographic factors.In particular, it is used to study residential segregation using unsupervised approaches.Multivariate and unsupervised methods are often preferred because there is no single view or way of quantifying residential segregation, and there is no baseline or ground truth for conducting supervised analyses.
A range of methodological approaches from the field of data science have been employed to study residential segregation.Spatial concentration patterns have been studied for a long time using a factorial approach [31,32], which is now better known as factorial ecology [33].Modern urban data science approaches also use this method.For instance, Benassi et al. [34] developed a composite index using multiple principal component analyses, which has been a revival of this approach.Recently, non-supervised machine learning methods have been employed to recognize patterns of residential segregation.For example, Olteanu-Raimond et al. [35] used traditional self-organizing maps, a type of neural network, to identify emerging patterns.Other researchers (see for example [36]) have used data science to improve the visualisation of changes in segregation and diversity in 61 major US cities between 1990 and 2020.Finally, Masías et al. [29] used unsupervised algorithms commonly used in image processing and remote sensing to generate visualizations and human-understandable information, based on concepts of cognitive psychology.
Approaching the study of residential segregation from a data science perspective, taking into account its spatial and temporal dimensions is a multidimensional problem.In the case of Berlin, for example, several dimensions of residential segregation have been studied.These include the ethnic dimension, which has been studied in Germany under the concept of migration background ; age or age-group segregation [1,6,7,12,20,21,23,24,29], which corresponds to the fact that different age groups are clustered in different parts of the city; gender segregation [4,8,15], which is a phenomenon associated with unbalanced gender ratios across space; social or socio-economic segregation [2, 3, 10, 12-15, 17, 18, 20, 23, 27, 29], where people are grouped with others with similar socio-economic characteristics, reflecting their economic opportunities; and the digital segregation dimension [24], which attempts to map access to social media and other digital technologies.The different emphases of some previous publications are summarised below (see As seen in Table 1, most previous studies focus on the ethnic aspect of residential segregation.The study of segregation by age group is the second most common.The third most studied aspect is social segregation.Finally, the least researched aspects are gender segregation and digital segregation.However, as can be seen, previous research has also been a study of more than one dimension at a time.Among the works cited, we would like to highlight the following ones: • Kemper [6,7], Arin [3], and Yamamoto [5,37] contribute to a conceptual, empirical and historical understanding of the emergence of ethnic residential segregation in Berlin from a geographical and economic perspective.• Nakagawa [4,8] and Kröhnert [15] understand gender segregation as a consequence of ongoing migration processes within Germany from a socio-demographic perspective.• Although Helbig's [23] contribution does not consider the demographic spatial dimension of residential areas (i.e.estimates of population density across residential areas), his work emphasizes the temporal dimension of residential segregation.• The contribution of Marcińczak [28], who conducted a cluster analysis of residential segregation in Berlin, used hierarchical cluster analysis to examine several years of demographic data.However, no dynamic analysis of cluster formation in Berlin was carried out, due to the nature of the clustering method used, which is static.Furthermore, this work is guided by a predefined interpretation of the clusters.• Kurtenbach's key study [24], which explored digital segregation in Berlin using data from a social media service designed to organise community life in neighbourhoods.
• Finally, innovative techniques for estimating population densities in residential areas in Berlin have been developed.For example, Groß [21] has developed methods for estimating anonymized spatial densities at a higher resolution.Building on this work, Masías et al. [29] have used non-negative matrix factorization to study different facets of residential segregation.Previous studies on the city of Berlin have adopted a non-dynamic approach.The lack of dynamic clustering methods has led other researchers to use, for example, hierarchical cluster analysis, which does not allow the temporal aspect to be taken into account.
In this context, we aim to perform a data analysis which has the advantage of not being a black box.This allows for a more direct interpretation of the clusters and takes into account the existing dynamics, being more accessible to interpret, which is not the case with factor analysis methods or those using black-box machine learning methods.

Methodological approach
The proposed methodological approach is three-fold: first, estimating the spatial density of diverse subpopulations over Berlin, with the exclusion of non-residential areas, employing a Multivariate Kernel Density Estimation; second, the spatial densities estimated in the first step are analyzed via a Dynamic Fuzzy C-Means clustering method; finally, human-readable information about the composition of the clusters and their interpretation is generated.The flow chart in Fig. 1 summarizes the methodological approach followed throughout this work.

Data source
The register of residents (Einwohnerregister) from 2009 to 2020, available at the Statistical Office of Brandenburg (see www.statistik-berlin-brandenburg.de), was used as the input for this step.We only used the information regarding the migration background, age group, gender, and socio-economic demographics for each LOR spatial planning area (i.e., Die lebensweltlich orientierten Räume).

Figure 1 Methodological steps for the dynamic analysis of residential segregation phenomena
As an indirect measure of the ethnic dimension of residential segregation, the category of migration background has often been used in German sociology.It was first defined in 2005, when it was used in the microcensuses.The official definition used in 2005 is as follows: an individual with a migrant background is defined as "all migrants who entered the current territory of the Federal Republic of Germany after 1949, and all foreigners born in Germany and all those born in Germany as Germans with at least one parent who immigrated to Germany or who was born as a foreigner in Germany" [38, p.6].In this context, the migrant background is instead referred to as a statistical category based on citizenship and an indirect record of the place of birth of the individual's parents.
Information on the demographic distribution by age group and gender in each LOR area for each year was also used.To estimate the density of males and females in the city of Berlin, we used data on the sex of individuals and information on the number of people in a given age group living in a given LOR.As sex ratios vary from country to country, and international or internal migration processes may have skewed age groups that differ from the destination population, and as this phenomenon has previously been reported as occurring in Berlin (see, [15]), we explore the possibility of residential segregation by gender.
Finally, people experiencing economic hardship in Germany are entitled to receive social benefits as defined in the Second and Third Book of the Social Code (SGB II and SGB III).In principle, any EU or non-EU citizen with a valid residence permit is entitled to SGB II and SGB III benefits after working in Germany for at least one year.
The SGB II (Sozialgesetzbuch Zweites Buch) and the SGB III (Sozialgesetzbuch Drittes Buch) are the two most fundamental laws of the German social security system.SGB II, or "Hartz IV", deals with social benefits for unemployed or low-income persons.SGB II also regulates the payment of unemployment benefits and social assistance.SGB III deals with employment promotion, vocational training, and education.It is aimed at helping people to find and keep a job and to improve their vocational skills.SGB III provides various measures to support job seekers, such as job counselling, placement services, and vocational training programs.Funding provisions and support for companies to create jobs and train their employees are also included.
In summary, while SGB II focuses on providing financial assistance to those in need, SGB III aims to promote employment and vocational training.For this paper, the number of people who obtained benefits under SGB II and SGB III in a given year and city location is considered a proxy measure of social or socio-economic segregation.

Multivariate kernel density estimation in the presence of measurement error
The spatial density of inhabitants is estimated following the work of [39], where a method is proposed to estimate the population density over areas with arbitrary shapes.That method is, in turn, based on a previous publication of the same author in which demographic estimates from rectangular spatial grids of different sizes are computed while introducing measurement error and data anonymization [21].Other worth mentioning areas of application of the present method are the estimation of ethnic minority settlement areas [21], regional childcare demand estimates [40], regional election analyses [41], or estimates of the incidence of Coronavirus infections over time and space [42].
In the present study, Berlin is divided into spatial units (Planungsräume) whose centroids contain the spatial coordinates (measured in degrees).The technique of [39] is then applied using the LORs (Lebensweltlich-Orientierte Räume) areas on the aggregated number of inhabitants with distinct migratory origins, age, gender, and socio-economic conditions living in each of those spatial units.To obtain corrected density estimates, the non-residential areas were discounted in the analysis (see [43]).
The model used in this work to estimate the corrected spatial density from heaped data (i.e., the arbitrary aggregation of data in a spatial area) in polygons of an arbitrary shape is based on a non-parametric estimation method: the Multivariate Kernel density estimation technique.This approach estimates a finite sample's joint probability density function of two or more continuous random variables.In simpler words, it is used to estimate the distribution or spread of the data across more than one dimension when only a finite number of data points are available.
Let X = {X 1 , X 2 , . . ., X n } be a sample from a multivariate a random variable with probability distribution described by the unknown density function f (x) to be estimated.Each random variable is two-dimensional in our case, i.e., X i = (X i1 , X i2 ), i = 1, . . ., n, being X i1 and X i2 the longitude and latitude coordinates, respectively, and X is the set containing all the available spatial coordinates.Then, the multivariate kernel density estimate at the two-dimensional point x is defined to be: where: • | • | denotes the determinant.
• K(•) is the kernel, a symmetric multivariate density function.This function assigns weights to the observed data points based on their distance from the point where we want to estimate the density.We use the standard multivariate normal kernel, i.e., K(x) = (2π) -d 2 e -1 2 x T H -1 x .• H is the bandwidth d × d 1 matrix, characterized by being symmetric and positive definite.It controls the window size in each dimension over which the kernel function operates.A small bandwidth will result in a density estimate that is very sensitive to the data (potentially too sensitive, resulting in over-fitting).In contrast, a large bandwidth may smooth out important features of the data (under-fitting).Therefore, the choice of H is critically important for the accuracy of the kernel density estimations.There exists a lot of discussion in the literature about the selection of the bandwidth matrix.Here, we use the approach of Wand and Jones, as it is done in [21].In short, a function that returns high values for points close to the data point and low values for points far away is created at each data point, the multivariate kernel.The final density estimate at point x is the average contributions from all these kernel functions centered at each data point X i .In this way, the density is high, where many data points are close together, and low, where the data points are spread out.
Since we have data spatially aggregated for each area of the city, rather than the exact coordinates, we use the approach of Groß et al. [21], that introduces measurement error to produce estimates of population density while anonymising the sensitive data.Formally, the actual values X = {X 1 , X 2 , . . ., X n } are unknown, and only the aggregated values over 1 In our study d = 2.
each area can be utilized, which are denoted by W = {W 1 , W 2 , . . ., W n }.They can be seen as a measurement with an introduced error of the actual coordinates of individual i, where i = 1, . . ., n.The objective is to estimate the density f (x), from which X is drawn, only with the values W i .
A naive kernel density estimator, which would use the aggregated values as the real coordinates in Equation ( 1), may lead to a spiky density far from the actual density of the true data.This effect becomes more noticeable as the sample size increases.Therefore, a model which contemplates the measurement error must be used.Under the assumption that the anonymization process is known, a measurement error model for W can be defined as , where π(W |X) refers to the conditional distribution of W given X, and with area(W i ) being the set of coordinates that lie within the area where W i belongs.Using the Bayes theorem formulation π(

the probability of X i
given W i is proportional to the probability of W i given X i times the probability of X i ), pseudo-samples of X i can be drawn from π(X i |W i ), which are used to estimate the density function f (x).In particular, following an iterative procedure, X i is drawn from the known conditional distribution π(W i |X i ) using π(X i ) as a weight.Since f (X i ) is unknown, and thus, π(X i ) as well, the multivariate kernel density estimator fH (x) defined in Equation ( 1) is used instead.At the beginning of the procedure, an estimate f (0) H (x) is calculated according to Equation (1) from the artificial geo-coordinates W i .After drawing the pseudosamples, the multivariate kernel density estimator is applied to these samples to estimate the density function f (1) H (x). In the following iterations, the density estimate f (N+1) H (x) is recalculated by utilizing the drawn pseudo-samples in the previous iteration N .In this way, the pseudo-samples provide a way to fill in the information lost due to data aggregation, and the density estimate is refined in each iteration.For more details on the steps of the algorithm, see [21].

Dynamic fuzzy c-means
This dynamic clustering algorithm, presented by Crespo and Weber [44] in 2005, relies on updating the structure of the current clusters based on relevant changes in the dynamic data.The period between the creation of a cluster structure and its update is called cycle, and its definition makes it possible to adapt the algorithm methodology to any probabilistic clustering algorithm, i.e., any clustering algorithm that determines degrees of membership.The degree of membership of an item to the clusters is used to identify changes in the structure of the clusters.
Changes in the structure of the clusters can be the creation of new clusters, elimination of clusters, or movement of the centers of the clusters.The following are the basic steps of the Dynamic Fuzzy C-Means: • 1. Run the fuzzy c-means algorithm using the initial data set.
• 2. Receive new data and merge it with the current data.In what follows, a detailed description of the mathematical aspects of the algorithm used here is provided.Let X 0 be the initial data set and X 1 , X 2 , . . ., X t be the new datasets the algorithm receives in each cycle t > 0. In the beginning, the traditional fuzzy c-means algorithm is run on the first data set, X 0 , with c ≥ 2 clusters and fuzzifier m > 0, so that it produces c clusters with its respective centers v j , for j = 1, 2, . . ., c, and the membership matrix M 0 n×c , being n the number of data points in X 0 .The components of this membership matrix are the membership degrees, i.e., its component at position (i, j), i = 1, . . ., n, j = 1, 2, . . ., c, is the membership degree μ i,j of the data point x i ∈ X 0 to cluster j.
Let X t be the new data chunk arriving into the dataset at cycle t > 0, which could produce changes in the current structure of the clusters because it contains data points that are not well classified by the current clusters.Let c t be the number of clusters at cycle t, n t the number of objects in the dataset X t and i = 1, . . ., n t the index of the new objects.To identify the data points producing changes, the following must be calculated: • pair-wise distance d(v j , v k ) between each pair of the current centers v j and v k , for all j, k = 1, 2, . . ., c t .• distance d(x i , v j ) between the new data point x i ∈ X t and the current centers v j , for all i = 1, . . ., n t and j = 1, 2, . . ., c t .• the membership degree μi,j of the new object x i ∈ X t to the cluster j, for all i = 1, . . ., n t and j = 1, 2, . . ., c t .Then, conditions shown in Equation ( 3) and Equation ( 4) must be evaluated on the new data to detect objects of X t that are incorrectly assigned to the current clusters, i.e., those objects that would involve a change in the current structure.
where α > 0 is a threshold parameter fixed beforehand by the decision maker or dynamically determined by the algorithm.
The two above conditions are used to define the indicator function (see Equation ( 5)), that is equal to one if, and only if, the data point x i ∈ X t is correctly classified by the current structure: If at least one new data object is not well classified, the criterion defined in Equation ( 6) is applied to decide whether new clusters should be created or if, conversely, moving the current centers is sufficient: where β ∈ [0, 1] is another threshold parameter that can be fixed previously or adjusted dynamically, and | • | represents the number of elements of a set.Whenever the condition defined in Equation ( 6) fulfils, new clusters must be created, and, in other cases, it is enough to update the centers of the current clusters.
If many new objects cannot be correctly assigned to the current clusters, i.e., new clusters are to be created, the optimum number of new clusters has to be determined.To do so, we select the number of clusters that maximize the structure strength [45], as it is done in the original paper [44].Nevertheless, any other procedure could be used to find the new number of clusters.Once the optimum number is determined, the fuzzy c-means algorithm is run from scratch using that number on the total dataset.
In other cases, when it is sufficient to move the current centers of the clusters, the current centers are combined with those representing the new data.The cluster centers representing only the new data are calculated using Equation ( 7) and combined with the previous centers as defined in Equation ( 8): where λ j is determined by Equation ( 9): Note that a data point x i is assigned to a cluster C j if and only if j = arg max k=1,2,...,c t {μ i,k }, being C j the set of data points that belongs to cluster j, ∀j = 1, 2, . . ., c t As a last step of the algorithm, a cluster is deleted if it has been a predefined number of cycles, T, without receiving new objects.For this purpose, each cluster has a counter that includes the number of cycles it has been active without any update.When the counter reaches the value T, it is deleted by removing its center and all the data belonging to it from the data set.

Cluster interpretation
To characterize a cluster with numerical variables, e.g.X, value-tests (v-test) are computed for each of those variables using the following statistic: where X is the mean of the variable X in the entire dataset, X C is the mean of X within the cluster C, n C is the number of objects in C, and s 2 is the global variance of X.The statistic follows a Student's t-distribution with n C -1 degrees of freedom, denoted by t n C -1 .
The v-test allows the interpretation of which variables characterize the clusters.If the value of the statistic in Equation ( 10) for a variable X in a cluster C is larger than 1.96, then it is interpreted that the variable characterizes the cluster.Additionally, the larger the value of the statistic, the better that variable characterizes the cluster, and the sign of the test indicates whether the variable is underrepresented (i.e., a negative sign) or overrepresented (i.e., a positive sign) in the given cluster, in comparison with all the data available for a given year.This statistic is very intuitive, as specific subpopulations may be overor under-represented when all are compared at the city level.

Results of the multivariate kernel density analysis
The results of the multivariate application of the kernel density estimation method are presented in Table 2, which shows in aggregate form, over the years measured, the mean values of the variables studied (i.e., migration background, age group, gender, and socioeconomic factors), the standard deviation, the minimum, the maximum, and selected percentiles.
As seen in Table 2, the residential densities of the German subpopulations have higher mean values, while those of the Chinese subpopulations have the lowest mean values.It can also be observed that the population densities of individuals with a migrant background from Turkey have a higher standard deviation over the years, meaning there have been changes in the residential densities over time.Subpopulations with a migration background from Vietnam reach the maximum residential density, which can be interpreted as these communities locating in the specific residential areas, while those from Ukraine reach the minimum residential density.These statistics also show that, over time, Berlin has an average population of young adults aged between 30 and 35, and the sub-population of elderly people aged over ninety has lower average values.
In addition, the highest values are found in the population aged 50-55 years, which is the population living in common areas of the city.Finally, there are only marginal differences in the distribution between men and women in all the descriptive statistics.However, it should be noted that in some areas of the city, the female population peaks almost twice as high as the male population.Similarly, SGB II and SGB III show similar population densities reflecting socio-economic problems, although SGB II shows slightly higher values in some descriptive statistics.

Cluster validation
The Bezdek partition coefficient, an indicator defined by James Bezdek, was used to validate and quantify the quality of clustering solutions on our time-varying data sets.The Bezdek partition coefficient of a fuzzy c-partition of n data points is defined as [46]: where u ik is the membership of object i to cluster k, and c is the number of clusters.This index takes the value of 1 when the clusters are perfectly differentiated, and each object belongs only to a single cluster, and the value 1/c when each object belongs simultaneously to each cluster, so the distinction between the different clusters is undetermined.Therefore, the extreme values of the Bezdek Partition Coefficient allow evaluation of the quality of the clustering solution generated.Also, the partition coefficient depends on the number of clusters; the more clusters there are, the lower the value of the Bezdek index, which means that clustering is fuzzy since its value is close to 1/c. Figure 2 plots the evolution of the Bezdek Partition Coefficient over the years 2009 to 2020 for the dimensions considered.
The dynamic clustering algorithm tries to make decisions that do not worsen the Bezdek partition coefficient too much so that the partition continues to have a good level of quality.The initial cluster number of the algorithm was chosen considering the best Bezdek partition index obtained.The period of cycle T has been set to 20 in order to keep the data The Bezdek partition coefficient indicates that the dynamic clustering solutions for the age group dimensions and the socio-economic variables improve over time.However, the coefficient behaves differently in the case of dynamic clustering based on migration background and clustering based on gender variables.In the case of dynamic clustering based on variables describing migration origin, it can be observed that the coefficient decreases until 2015 when the dynamic clustering algorithm detects the emergence of a new cluster, which reflects a new cluster structure.After this year, the coefficient improved and remained relatively stable but declined after 2018.
In the case of dynamic clustering based on gender variables, the Bezdek partition coefficient remains relatively stable over the years.It was only in 2017 that the coefficient values started to fall, but this reflects only a certain instability of the clusters.As we will see below, those clusters arise due more to changing population densities across the city than to uneven differences in the density of men and women in Berlin's residential areas, which are often remarkably similar.
The Bezdek partition coefficient generally shows that the dynamic clustering solutions obtained improve over time.It also shows us that when a decrease in the coefficient is observed, the emergence of a new cluster structure can be expected, in this case, when the migration background variables are considered.Therefore, the solutions obtained from the data represent valid cluster solutions.

Clustering results
To characterize the change of the clusters considering migration background, age group, gender, or socio-economic variables, we will present the clustering results in the years 2009 and 2020 in terms of their Mean in Cluster (MIC, the mean of a given variable in a given cluster) and the v-test, both indicating whether a variable is under-or over-represented in a given cluster and year and their corresponding statistical significance.Finally, the normalized size of the clusters (see, Fig. 3) and in absolute terms (see, tables in the Appendix) are presented. 2o visualize the micro changes and trends in the clusters over time, we generated a series of bump charts to describe which categories of variables were significant over time in each cluster.A Bump Chart "shows how quantitative category rankings have changed over time.They are typically structured around a temporal x-axis with equal intervals from the earliest to the latest.Quantitative rankings are plotted using joined-up lines that effectively connect consecutive points positioned along a y-axis (typically top = first)" [47].After evaluation, each v-test value is assigned a rank, and each variable's ranks for a given year are plotted in descending order.The graph also shows that the values are grouped into different clusters based on a threshold to determine whether the variable is over-represented, underrepresented, or significant in a given cluster.To do this, the graphs use the critical values (i.e., 1.96 and -1.96 for a two-tailed test at a 5% significance level).Therefore: • If the v-test value is greater than 1.96, a variable is considered to be over-represented in a given cluster.• If the v-test value is less than -1.96, a variable is considered to be underrepresented.
• If the v-test value is between -1.96 and 1.96, then a variable is not considered significant.The bump chart is used here to visualize the micro-dynamics of residential segregation.On the Y-axis, the names of the variables are listed according to the value obtained for each year and cluster.This provides very informative visual information about the composition of the clusters because while the y-axis represents the ranking of the variables, the x-axis represents the years, and the connecting different lines show how the ranking of the different categories changes over time.
Results based on the migratory background.In terms of migrant background and clustered residential areas, the city of Berlin has a diverse and mixed population.The dynamic cluster analysis shows seven clusters from 2009 to 2014 and a total of eight clusters from 2015 to 2020.In 2015 a change in the clusters' structure was detected.The change corresponds to the emergence of Cluster 7. From a qualitative point of view, it can be seen that the change in the cluster structure occurred in the same year as the so-called European migration crisis.
Results based on the age group.The dynamic cluster analysis revealed that the population of Berlin is grouped in residential areas in a structure of four different clusters of age groups.Qualitatively, Cluster 3 is located in the city centre, Cluster 2 is located  Cluster 2 has all variables overrepresented during 2009 (see Fig. 6c).For the year 2020, the most overrepresented age groups are the 80 to 85-year-olds (MIC = 22.479; vtest = 325.474;p = 0.000), and the least overrepresented groups are the 30 to 35-year-olds (MIC = 23.894;v-test = 161.203;p = 0.000).The visualization of the clusters on the map of Berlin is shown in Fig. 7, and the normalized size of clusters over time is shown in Fig. 3b.Interestingly, there is no change in the number of clusters over time, but there is an increase in the overall population density.This means that the cluster structure based on the age group dimension remains stable over the period observed.Results based on gender.The results of the cluster analysis allowed the identification of 3 clusters.
In general, the cluster analysis shows that the clusters represent the population density in residential areas.In other words, the clusters divided into male and female population densities correspond to Berlin's more or less densified areas.In the case of the clusters, the marginal differences over time are reported below.
The visualization of the clusters on the map of Berlin is shown in Fig. 9, and the normalized size of clusters over time can be seen in Fig. 3c.
Results based on socio-economics.For the socio-economic dimension, the cluster analysis resulted in the identification of four clusters.
From a qualitative point of view, Cluster 3 represents the places with the most significant socio-economic problems.It can be seen that the areas corresponding to clusters In Cluster 0, both variables are overrepresented and statistically significant (see Fig. 10a).In 2009, SGB III was the most overrepresented (MIC = 13.564;v-test = 15.813;p = 0.000), and SGB II was the least overrepresented (MIC = 12.404; v-test = 6.638; p = 0.000).For the year 2020, SGB III is the most overrepresented (MIC = 12.727; v-test = 68.251;p = 0.000), and SGB II is the least overrepresented (MIC = 12.423; v-test = 54.825;p = 0.000), with a decrease in the former group, and an increase in the latter since 2009.
On the contrary, in Cluster 1, the variables are underrepresented and statistically significant (see Fig. 10b).For 2009, Cluster 1 had SGB II as the least underrepresented socio-economic variable (MIC = 2.388; v-test = -123.634;p = 0.000), and SGB III as the most underrepresented (MIC = 3.107; v-test = -130.744;p = 0.000).Similarly, for 2020, SGB II was the least underrepresented (MIC = 3.012; v-test = -411.076;p = 0.000), and SGB III was the most underrepresented (MIC = 3.227; v-test = -424.48;p = 0.000).Finally, in Cluster 3, both variables are statistically significant and overrepresented (see Fig. 10d).In 2009, SGB II was the most overrepresented (MIC = 55.375;v-test = 125.056;p = 0.000) and SGB III was the least overrepresented (MIC = 46.651;v-test = 113.809;p = 0.000).The same situation occurred in 2020, where SGB II was the most overrepresented (MIC = 54.557;v-test = 419.17;p = 0.000), and SGB III was the least In summary, the maps show that residential segregation in Berlin is a phenomenon that can be visualized on a geographical level.The analysis also detected the emergence of a cluster when analysing the migration background of Berlin's populations.Finally, the results show that the clusters have small movements because the composition of the clusters changes over time and space.

Discussion and conclusion
This study aimed to examine the phenomenon of residential segregation from a dynamic point of view.According to our approach, residential segregation can be explored from To open the discussion, we would like to recall that several studies have been carried out on the spatial distribution of Berlin's subpopulations.In particular, we believe that the reporting of spatial densities excluding non-residential areas, the separate analysis of dimensions that has already been documented by several researchers, and the use of dynamic rather than static cluster analysis are aspects that can help different disciplines, especially those that are looking for novel methodological new methodological ways to identify changes in population structure from cohort data.In this context, we briefly discuss some of the findings and then and then summarise the research undertaken.

Comparison with previous research
To provide a more comprehensive overview of the results, we will compare our findings with those of other researchers who have independently addressed the issue of residential  • The clusters can identify areas where there is a higher density of people applying for unemployment benefits.Qualitatively, it can also be observed that there was a change in the distribution of these residential densities across the city in 2009 and 2020.segregation.First, we would like to stress that the proposed methodology allows us to identify changes that we can label macro and micro.By macro changes, we refer to the possibility of clusters appearing, moving, or disappearing over time.By micro changes, we refer to the internal changes that can occur in the composition of each cluster, which we have operationalized and visualized using bump charts.Table 3 summarizes the main results of our approach, together with selected previous studies.

Macro dynamics
The results allowed us to establish that there is evidence of a structural change over the period analyzed.Within this structural change, a new cluster emerged in 2015, coinciding with the peak of the migration wave in the context of the European migration crisis.Given the nature of the dynamic clustering algorithm we use, which uses all past data to assess whether a change in cluster structure is taking place, identifying the emergence of a new cluster structure requires an event at the demographic level that makes it possible.The mi-gration crisis in Europe and Germany's unprecedented refugee policy make the structural change we detect in Berlin a plausible interpretation of the data analysis results.The ability to detect the presence of residential segregation is the most salient finding of this study, as it demonstrates that the methodology can help to identify new patterns of residential segregation.

Micro dynamics
At the micro level, the bump charts show that some clusters have developed overrepresented subpopulations over time, others only underrepresented subpopulations, and a combination of both.The main trends identified can be summarized as follows: • Concerning ethnic residential segregation: In terms of micro-dynamic changes, the proposed method allows us to study the changes within each cluster.The richness of the results allows us to observe the overrepresented subpopulations in each cluster and the changes in the classification of each cluster, allowing us to observe the dynamics over time.Through the application of dynamic analysis, our study confirms the existence of age group segregation phenomena, which is materialised in the four clusters we have found.The maps we present do not show idealized concentric zones, as suggested by earlier studies such as Nakagawa's, but more complex-shaped clusters that can be observed visually.We find that older people are concentrated in the peripheral areas of Berlin, spatially surrounding the other groups within the city, as seen in the maps provided.We also identified areas where young adults are found and clusters where children are over-represented.We also observe that the standardized size of the clusters does not change significantly over time, which can be interpreted to mean that the spatial areas these clusters occupy in space remain relatively stable.This is highly consistent with the observation of Nakagawa, who stated that "residential segregation by age group is a very real phenomenon" [49, p. 134].In our results, we show with greater detail that the phenomenon of residential age segregation is present in Berlin.
• On socio-economic residential segregation: We observed that the ranking of the variables remained stable, i.e. in the same ranking position in all clusters during the years studied.Compared to previous research, some similarities can be observed in the locations with the highest rate of people claiming state subsidies (see the maps published by Blokland [27], Fig. 13.3 in p. 257).Finally, we would like to report that we have observed a qualitative change that can be seen in the 2020 map, where cluster areas take on new shapes.The results of the method show that, at least visually, there are socio-economically disadvantaged areas that only expanded in the years 2009 and 2020, which is reflected in the size of the clusters.We should bear in mind that 2009 was part of the subprime financial crisis and in 2020 the economy was under the stress of the COVID-19 outbreak.We believe that the change in cluster shapes may be related to the event of the global COVID-19 pandemic when many individuals in Berlin started to apply for social welfare.However, more research is needed to link this qualitative observation to a cause-effect relationship.• On residential segregation by gender: We found changes in the variables describing population densities by gender.The data analysis shows 3 clusters representing different densities of male and female individuals.However, we observe cluster densities that reflect a slight imbalance between females and males.Finally, the normalized cluster size does not vary significantly over time, which means that the spatial areas of the clusters have neither shrunk nor expanded spatially throughout observation.Kröhnert and Vollmer [15] have argued that women from rural areas in Germany migrate to large cities such as Berlin more than men who remain in rural areas.Under this hypothesis, one might expect the possible emergence of clusters in which groups of internal migrants of women form clusters reflecting this phenomenon, which is still unknown to us.However, the variation in high, medium and low population density described by the clusters seems to reflect the variation in population density as a whole.Some changes are numerically small, but qualitatively significant for monitoring the expansion of gender residential segregation observed in other geographical regions (e.g. for examining population sex ratios in China and Saudi Arabia over time and space).Perhaps because the sex ratios in Germany are mostly balanced, the phenomenon can be observed when comparing rural areas with urban areas or between eastern and western Germany, i.e. when looking at data at the country level.In this study, an analysis was conducted using a dynamic approach to describe the phenomenon of residential segregation in Berlin.As described in this paper, residential segregation is more of a complex dynamic phenomenon where different facets of Berlin's subpopulations are over-or under-represented in clusters across the city.We believe that the use of dynamic cluster analysis may be of particular interest to researchers who would like to find patterns that emerge from the data rather than trying to explain or predict a variable from a survey or a multivariate index, as in both cases, it can be understood as a supervized analysis problem, which by definition involves the creation of a variable or index that directly represents residential segregation.In our methodological and theoreti-cal approach, patterns emerge from data based on multivariate and non-black box analysis methods.
Thus, at a high conceptual level, the analysis shows that there is no such thing as a subpopulation that isolates itself in residential areas.Instead, it can be represented as a multivariate phenomenon where clusters can be observed on the dimension of migration background, age groups, socio-economic groups, or the dimension of gender.These dimensions may have causal relationships with each other.However, in this study, we have taken a more focused approach to represent the phenomenon of residential segregation, which has been extensively documented in Berlin, rather than generating an explanatory model, as is commonly attempted.

Future research
Future research would aim to apply this approach to data from other cities in Germany and worldwide.In particular, it would be exciting to study demographic changes in migration crises or birth shortages and case diffusion processes in times of pandemics.It would also be interesting to include other variables that represent the neighbourhood, the quality of life of people, or the transport systems.In this way, the representation of residential segregation would also have associated elements of the city's infrastructure.Another idea is to analyze the clusters by looking at several dimensions together, for example, age and migration background, rather than studying them independently.But we explore these dimensions separately to illustrate the general thrust of our approach and also to contribute to protecting geo-privacy.

Practical applications
In this context, we believe that the approach followed in this study has multiple practical applications.Several tools describe the demography of Berlin, and some of them focus on measuring the integration of migratory subpopulations, such as the so-called "Integration Indicator Report", which is based on data provided by Der Mikrozensus, das Sozioökonomische Panel, and the Programme for International Student Assessment [50].Also, the annual indicators published by the Federal Statistical Office (Destatis) [51], the German Expert Council on Integration and Migration (https://www.svr-migration.de/jahresgutachten/) and the annual reports of the Organisation for Economic Co-operation and Development Integration Monitoring [52] provide an overview of the migration situation of the diverse communities in the different countries.
The differential aspect of our approach is that it allows us to observe demographic changes in residential areas over time from a global perspective.In addition, different sets of variables can be analyzed separately or together.For example, as shown in Table 1, little is known about digital segregation.The possibility of analyzing the dynamics of clusters allows for a better understanding of the impacts of territorial policies and social interventions.We believe that the future availability of spatial databases describing the information and communication technologies used will make it possible to generate new representations of the relationship between the different aspects describing the phenomenon of residential segregation.

Limitations
The clustering algorithm makes it possible to detect structural changes, but it does not provide direct knowledge of the exact size of the clusters in each period.Instead, it only gives an idea of the size of the clusters in terms of proportions, as the algorithm aggregates data from previous periods in its updating process in each new re-evaluation, which can be seen as a limitation of the analytical approach.However, a more robust assessment of dynamic changes can be obtained with this strategy.The dynamic fuzzy clustering algorithm updates the clusters by incorporating the data previously evaluated in previous cluster updates.This means that it treats all previously analyzed data at each stage as entirely new.Using this strategy, the algorithm monitors structural changes rather than assessing yearon-year changes in the cluster.In this way, it detects changes in cluster composition when new data that differ from previously observed classes appear.Instead, the important thing is to assess how different the distribution of the data being aggregated is at each point in time.We can assess that this does not correspond to a local change but to a significant one because the algorithm detected this change considering all the previously available and evaluated data.

Conclusion
In this paper, we have proposed a methodology to explore and describe the demography of Berlin in residential areas.The proposed methodology allows us to make new observations on how different subpopulations are distributed in residential areas.In addition, as the analysis is carried out over time, new insights were gained into the changing internal composition of clusters, a rich diversity, and structural changes.We conclude that this novel approach, based on data science principles, allows us to identify patterns of residential segregation in Berlin in a more unified way.We encourage other researchers to develop new hypotheses about the demographic changes observed in residential areas and the factors that might explain them.

• 3 .• 5 .
Look for relevant changes in the structure of clusters.• 4. If relevant changes exist, update the structure of clusters.Repeat until no new data arrive.

Figure 2
Figure 2 Cluster validation using Bezdek's partition coefficient for the dynamic clustering based on migration background, age group, gender, and socio-economic variables

Figure 3
Figure 3 Normalized cluster sizes over time

Figure 4
Figure 4 Ranking of migration background variables by cluster and year

Figure 5
Figure 5 Continued

Figure 6
Figure 6 Ranking of age group variables by cluster and year

Figure 6 Continued
Figure 6 Continued

Figure 7
Figure 7 Dynamic clustering results visualized according to age-group variables

Figure 7
Figure 7 Continued

Figure 8
Figure 8 Ranking of gender variables by cluster and year

Figure 9
Figure 9 Dynamic clustering results visualized according to gender variables

Figure 9
Figure 9 Continued

Figure 10
Figure 10 Ranking of socio-economic variables by cluster and year

Figure 10
Figure 10 Continued

Figure 11
Figure 11 Dynamic clustering results visualized according to socio-economic variables

Figure 11
Figure 11 Continued

Table 1
Previous research on the dimensions of residential segregation in Berlin

Table 2
Descriptive statistics of the estimated population densities for the dimension of migration background, age group, gender, and socio-economic variables between 2009 and 2020.The values in blue color indicate the maximum and the values in red indicate the minimum for each dimension

Table 3
Summary of our results in comparison with selected previous studies

Table 3 (
Continued) The results are consistent.They continue to show the results of the now long-past migration waves of "temporary" guest workers (i.e. the so-called Gastarbeiter) from Turkey and Lebanon.However, it is only in the present work that we can observe the positioning of the Syrian and Chinese migrant subpopulations as the most over-represented subpopulations as part of Cluster 7. The fact that both Syrian refugees and asylum seekers from China are known reality of recent immigration to Berlin.For example, Kate Martyr [48], an editor and video producer at DW's Asia desk, reports on the surge in asylum applications from China to Germany, particularly from the oppressed Uighur minority.Finally, we observe the increase or decrease of the spatial areas occupied by the clusters in Berlin as the normalized cluster size changes, which was noticeable in 2015 due to the structural change of the clusters.• Concerning age-group segregation: In general, the bump charts show slight changes in the ranking of the categories of variables describing the phenomenon of age segregation.Age segregation is a demographic phenomenon characterized in detail by Yamamoto, Kemper, and Nakagawa, who used data available before and after the fall of the Berlin Wall.Nakagawa found two clusters in West Berlin, characterized by higher adult densities in outer Berlin compared to populations in inner Berlin.Kemper compared East and West Berlin before and after reunification and found different degrees of segregation in these two areas.Finally, Masías et al. [29] find four clusters with different age group distributions in the city.

Table 4
Cluster size over time-based on migration background variables

Table 7
Cluster size over time-based on socio-economic variables