Network of families in a contemporary population: regional and cultural assortativity

Using a large dataset with individual-level demographic information of 60,000 families in contemporary Finland, we analyse the variation and cultural assortativity in a network of families. Families are considered as vertices and unions between males and females who have a common child and belong to different families are considered as edges in such a network of families. The sampled network is a collection of many disjoint components with the largest connected component being dominated by families rooted in one specific region. We characterize the network in terms of the basic structural properties and then explore the network transitivity and assortativity with regards to regions of origin and linguistic identity. Transitivity is seen to result from linguistic homophily in the network. Overall, our results demonstrate that geographic proximity and language strongly influence the structuring of network.


I. INTRODUCTION
Human families include parents, children, grandchildren and lasting pair bonds between usually unrelated spouses, so that families typically encompass at least three family generations and two kin lineages [1]. This complexity of familial ties allows for various kinds of associations between different extended families, for example, through marriage and intermarriage within a kin group. Family members usually help each other by providing emotional, practical and financial support [2]. In addition, we learned in a recent study that parents, children, grandchildren and siblings are also known to stay geographically close to each other even in contemporary wealthy and globalised societies [3], which can lead to genetic homogeneity in certain region. At the same time, the fertility differences and migration patterns can affect which families contribute most to the overall population. However, there are few studies in which the structural properties of network of extended families have been investigated.
Here we investigate the properties of a network of families, using a unique and nationally representative register dataset from contemporary Finland. We investigate the overall network characteristics of the connected components as well as the roles played by the following factors: (i) spatial proximity, as measured at a regional level (ii) language preferences, as indicated by language (N.B. Finland has two national languages: Finnish and Swedish). (iii) genetic relatedness, as measured through assumed biological relatedness between horizontal layers of the network.
We have the following research questions: 1. How is the network of families structured? We identify which regions most of the observed network stems from, as well as the connected components and their geographical origins.
2. What does "clustering" in the network imply? We explore transitivity within the largest connected components by type of language spoken. 3. How is the network of kins structured? We investigate the influence of the structure of the network of families on the patterns of biological relatedness resulting between individuals within the same generation. people constituting 11 -16 per cent of the total cohort. This dataset, consisting altogether 677,409 individuals, including the index persons' parents and parents' other children, i.e., siblings and half-siblings as well as the index persons' and their (half-)siblings' children and children's children. In the case of half-siblings, the data includes the half-sibling's other parent, either mother or father (randomly selected), to avoid including two half-siblings that are not genetically related. Thus the data comprise extended families of four generations: the zeroth generation comprising of mothers and fathers; the first generation comprising of index-persons and their siblings and half-siblings; the second generation comprising of the children; and the third generation comprising of the grandchildren. We could thus also separate between cousins and second cousins within the same horizontal family generation.
For each individual, the data has demographic information including time of birth, place of birth (administrative regions called "Maakunta" in Finnish), time of death, time of marriage (and divorce), yearly information of the place of residence (region). The currently demarcated administrative regions or Maakunta's in Finland exhibit a substantial degree of cultural and economic similarity including recognizable regional dialects, symbols and local food traditions [6]. For our analysis we consider here 18 out of the 19 regions, excluding Ahvenanmaa (theÅland Islands) region owing to its small population and being separated from the mainland of Finland. A few regions stand out historically and culturally: Uusimaa is the region of Finland's capital Helsinki and the largest urban settlement in the country: 30 per cent of the total population lives in Uusimaa. The former capital Turku is now the third largest city and located in Varsinais-Suomi region, while the second largest city Tampere is situated in the region of Pirkanmaa. The region in the middle and Western Finland, in the various Pohjanmaa regions, is known for a history of agriculture and entrepreneurship, and also for its comparatively high fertility. A religious sect within the Protestant church, called the Laestadians live in this area, especially in the Pohjois-Pohjanmaa region. Laestadians do not approve of modern contraception, which has contributed to a larger proportion of large families with four or more children both among the members of this sect and also among their non-Laestadian neighbours [7]. Also many regions and especially the Northern and Eastern regions of Finland have witnessed emigration to the Southern regions, especially to Uusimaa region [8]. Finally, Finland has a national minority of Swedish-speaking Finns, comprising around 6 per cent of the total population. Swedish speaking Finns are typically living in the Western and coastal areas or regions of the country.

III. METHODS
We construct the network using the data of the extended families of the index-persons. A node in the network is a 'family' comprising of an index individual and his or her parents, full siblings and half-siblings, children and grandchildren, reflecting the four family generations in our data as described above. A link between two families is a 'parental union', defined as a male and a female who are married and who have at least one common child. We identify the links by searching for individuals that belong to multiple families. Note, that the presence of a such a person in a given family would also mean the presence of one of the parents (mother or father), with whom the person is related genetically. Thus identifying such persons in turn allowed us to identify links between families -parental unions consisting of a male and female, each from a different family, whose union has resulted in one or more offspring. Once such a parental union was found we attributed the year of birth of the first offspring to this union. Together, the set of parental unions (links) and the sets of families (nodes), constitute the network of families.
Additionally, we assign to each family, a reference year and a region of origin. The reference year, taken here as the year of birth of the index person, allows for a gross comparison between the generation of individuals belonging to different families. Our aim is to study the parental unions that link different families. Therefore, for a given family we focus on the birth regions of those individuals who have children. (We exclude the generation 0 mothers and fathers, as by definition they belong to the same family). However, not all individuals in a given family will have the same birth region. In cases when the birth regions of the reproducing adults in a single family are extremely diverse, the assumptions with regard to the regional influences become weak. In contrast, the number of families where all the reproducing adults are born in the same region is expected to be smaller. Therefore we calculate the number of families in which at least a fraction θ of its reproducing adults are born in the same region (see the Appendices). We choose θ = 0.6 which allows to include 81% of of the original number of families, while also fulfilling the criterion of having a large majority of the reproducing adults in a given family being born in the same region, assigned as the region of origin for the family. We assume that the region assigned to a family has over time influenced the different generations at multiple levels including social, cultural and genetic inheritance, so that transitivity and assortativity may further intensify the cultural and genetic density.
The index persons were chosen randomly from the whole population of Finland. In this sense, the network that we construct is sampling of the real network in place. However, the features that emerge from sampling seem to have resulted from the influences of diverse regional factors. In this sense the sampled network provides us with the "lower limits" on the structural characteristics present in the actual network. Thus the LCC is dominated by families from a particular region. Overall, for 13% of the families that have links, the region of origin is Pohjois-Pohjanmaa region. This is a large contribution since the largest number for region of origin is 14% of all families, coming from the Uusimaa region around the capital Helsinki. However, the dominance by Uusimaa region is visualised on the extreme left, in the case of the smaller clusters (including families that could not be linked), where the largest contribution is from this region. This comparison is shown in Fig. 3(a). Indeed, the total number of families (with or without links) that

B. Transitivity
For the second research question we investigated transitivity by considering the triangles that may reflect the transitivity with regard to family relations [9]. In Fig  Expecting that the presence of triangulations would lead to an increase in the number of linkages in the neighbourhood of the corresponding nodes, we probe the strongly connected regions of the network.
We extract the components by performing a k-core decomposition [10]. For the LCC we find that, k max , the maximum value of the degree (k) for which a core exists is 3 (i.e. a family belonging to the core is connected to three or more families). Therefore, the full LCC with 957 nodes could be partitioned into 3 shells. The outermost shell (a family has atleast one connection) has 600 nodes, the shell in the middle (a family has atleast two connections) has 310 nodes, and the central core has 47 nodes. While as a whole the LCC has an average degree of 2.5, the value at the core is 3.7. In addition, the concentration of families belonging to region 17 (Pohjois-Pohjanmaa) increases from the outermost shell (44%) to the core (58%).
We obtain the cores of the four largest components of the network, as depicted in Fig. 4.
Whereas, the core of the LCC (Core-LCC) is dominated by families from the region 17 (Pohjois-Pohjanmaa), as described above, the core of one of the two second largest clusters  Table I.  First we extract the kin graphs shown in Fig. 5 from the four cores shown in Fig. 4.
The kin graphs corresponding to the cores of the family graphs reveal compositions very similar to the family graphs themselves in terms of the birth regions of the individuals. The dense linking found between the families in the Core-SLCC-a is converted into a clustering between kins in the Core-SLCC-a-kin. The different parameters characterizing the structures of these kin graphs are summarized in Table I. For the average shortest path length (d) and the clustering coefficient (CC), we provide values corresponding to a Erdős-Rényi model for random linkages with similar values for number of nodes and edge-density. We also use the information of the types of kinship to measure the following quantities. We calculate the average coefficient of relationship r by summing over all the genetic relatednesses for all the links in a given subgraph and then dividing by the total number of links in the subgraph.
We also provide r sum = k ˙ r , which is the average of aggregated genetic relatedness at nodes.
Among the four kin graphs corresponding to the cores in the network of families, the CC appears to be the highest in the Core-SLCC-a-kin, and as such results from the transitive triangulations observed in the Core-SLCC-a composed of Swedish speaking families. In the Core-SLCC-kin, in contrast to the rest three, the fact that a random individual could be found linked to the highest number of close kins is evidenced from the high values of the average degree k and the average aggregated genetic relatedness r sum . Interestingly, the average relatedness in the network appears to be high in Core-TLCC-kin, which is due to the presence of half-sibling relationships. Under the criterion, d/d random > ∼ 1 and CC/CC random ≫ 1 [11], all the four graphs appear to be small worlds in terms of structure.
Additionally, we include in the analysis the kin graphs directly derived from the four largest clusters in the network of families (without being restricted to their cores). Remarkably, for the LCC-kin, the kin graph corresponding to the LCC in the network of families, which is far more larger in size compared to the LCC-core-kin (N = 1052 and N = 58), the small world character appears to be preserved if not enhanced as observed from the amplification of ratio CC/CC random with only marginal increase in the value of d/d random .
In Fig. 6 we show the frequencies of different types of kinships that are found in the entire kin network network. The relationships that are most abundant (around 30% in each case) are the first-cousin, first-cousin-once-removed and second-cousin. Relationships that mainly originate from family ties formed due to multiple marriages of individuals are present in smaller number. For each kind of relationship we also provide the fraction of cases where

D. Assortativity
Finally, we characterize the network of families as well as the network of kins in terms of the assortativity coefficient. In general, this coefficient is employed to characterize the nature of ties in a networks [12]. For example, in a large social network where individuals are characterized by their age, a positive assortativity would indicate that people of comparable age prefer to associate with each other, while a negative assortativity would indicate the opposite. The assortativity coefficient (a) is defined such that it lies between −1 and 1.
When a = 1, the network is perfectly assortative, and when a = −1, the network is called showing that the region of origin remains important for sociality of Europeans today [3]. Intensities of internal migration are known to be higher in Finland and Scandinavia compared to Southern European countries [13], yet although a large proportion of Finns migrate to another region during the time of young adulthood, many eventually move back or closer to their region of origin once they have children themselves or after retirement [14]. Interestingly, the first and second largest connected components were predominantly populated by families rooted in a few specific regions. Furthermore, the kin graphs corresponding to the cores of the four largest connected components were all dominated by one of the two national languages, Finnish or Swedish, the latter spoken by 6% of the total population but represented much more in some regions. The fact that cultural homophily, in terms of religion and language, plays a major role becomes evident in our investigation of the presence of transitive relations between families. We found that the concentrated presence of a minority group of people with Swedish being their mother tongue, is reflected in the proliferation of triangles. Thus the majority of members in the families in the transitive core part of the (one of the two) second largest connected component came from the region 15 (Pohjanmaa) and were Swedish speaking. It is known that 40% of the Swedish speaking population of Finland resides in this particular region. Furthermore, the kin cores of this particular connected component has the largest proportion of degree and clustering as well as a higher estimate of the assumed genetic relatedness than the largest connected component has. The patterns revealed through the structure of the network is consistent with the genetic clustering found in the Swedish speaking population of Pohjanmaa [15].
In the network of families, the ties are constituted by two individuals of opposite sex who jointly parent one or more children (the first born is included in the kin network).
In general, each family has a number of reproducing individuals and some become part of the linkages in the sampled network when the family of the opposite sex partner is also present in the data. Therefore, under a simplistic description, the larger the family (and hence larger is the number of reproducing adults), the more is the chance of this family to have a link. The plot of the number of links per family against the number of reproducing adults for the different regions is shown in Fig. 8(a). Here we have taken into account all families, even those that could not be linked. This approach is expected to reduce the sampling bias. The linear correlation is r = 0.85 and a fit suggests a linear relationship.
As discussed, the region 17 (Pohjois-Pohjanmaa) and its neighbours ( can not be solely judged by the aspect of regional assortativity and large sized families resulting from higher fertility rate. There appears to be a tendency for the large families to get connect to each other. This is shown in Fig. 8(b). For a family of a given size (measured in terms of the number of reproducing adults) we calculate the average size of the connected families. This is similar to the nearest neighbours average connectivity, which is used to quantify the degree correlations in networks [16]. A positive slope corresponding to the region 17 (Pohjois-Pohjanmaa) indicates the presence of such correlations. Similar correlation is also present in the region 15 (Pohjanmaa). For the rest of the regions we did not find any significant correlation. The case of the region 01 (Uusimaa) is illustrated where the slope is not different from zero. This kind of "degree assortativity" originating likely from religious reasons in addition to the regional assortativity could be the reason for them dominating in the largest connected component [17]). In fact it was demonstrated in [17] that when such assortativity is high a "core group" is formed by high degree nodes on which a largest connected component grows but contrary to expectations does not grow steadily and does not extend into the rest of the network. The scenario is very similar to our case, and such high assortativity and resulting impedance in the growth of the largest component could additionally imply that the true underlying network (from which the data is sampled) is not a small world [18,19]. It may be surprising to a certain extent as the fragments listed in Table I are small worlds.
In sum, the general patterns of linkages found within this representative sample of a national population are indicative of a high assortativity in the network of families. Both the region of birth and the language appear to function as cultural attractors in the network and increase the clustering and transitivity. We can distinguish between two patterns of regional effects in this network, either showing "metropolitan" family linkages or the "cultural" family Here, the regional and linguistic identity seems to result in a strong regional connectivity in terms of family ties.  -1966-1975, 1976-1985, 1986-1995, 1996-2005 where, M is total number of edges in the network. (b) Links at multiple generations: This kind of triangulation occurs when an offspring becomes a parent. For example, A (male) and B (female) get married and C is born. By definition, C belongs to both the F 1 (paternal family) and F 2 (maternal family). C gets married to D from family F 3 .
In this case, both links (F 1 ,F 3 ) and (F 2 ,F 3 ) result from the link between C and D.