Can co-location be used as a proxy for face-to-face contacts?

Technological advances have led to a strong increase in the number of data collection efforts aimed at measuring co-presence of individuals at different spatial resolutions. It is however unclear how much co-presence data can inform us on actual face-to-face contacts, of particular interest to study the structure of a population in social groups or for use in data-driven models of information or epidemic spreading processes. Here, we address this issue by leveraging data sets containing high resolution face-to-face contacts as well as a coarser spatial localisation of individuals, both temporally resolved, in various contexts. The co-presence and the face-to-face contact temporal networks share a number of structural and statistical features, but the former is (by definition) much denser than the latter. We thus consider several down-sampling methods that generate surrogate contact networks from the co-presence signal and compare them with the real face-to-face data. We show that these surrogate networks reproduce some features of the real data but are only partially able to identify the most central nodes of the face-to-face network. We then address the issue of using such down-sampled co-presence data in data-driven simulations of epidemic processes, and in identifying efficient containment strategies. We show that the performance of the various sampling methods strongly varies depending on context. We discuss the consequences of our results with respect to data collection strategies and methodologies.


Introduction
In the recent years, several methods have been developed to gather quantitative data on human interactions using wearable sensors and complement more traditional methods based on surveys [1,2,3]. Current data collection methods range from the use of Bluetooth or WiFi signals in mobile phones [4,5,6,7,8,9] to the specific development of dedicated sociometric sensors [10,11,12,13,14,15,16,17,18,19] and enable researchers to record and measure physical proximity events between individuals in various social contexts. Depending on the specific technology considered however, spatial resolution varies and the resulting "contacts" detected can range from co-presence in a room or a part of a building to close face-to-face encounters. The resulting data is often temporally resolved and has been increasingly used in various contexts including the study of human behaviour, the validation of models of human interactions and data-driven models of epidemic spreading [20,21,3].
Despite the increasing availability of techniques to measure even high-resolution temporal contact networks however, a number of limitations remain. In particular, measures cannot be carried out for arbitrarily large population sizes. It is thus of crucial interest to infer contacts or build contact proxies from data with lower spatial resolution data or coming from other sources. In this spirit, several studies have considered the issue of inferring social ties from email exchanges [22], mobile phone data [23], or co-location at geographic scale [24]. Other works try to infer close proximity in specific settings from individual attributes [25] or from a very precise localisation of individuals [17], or, at geographical scale, from the similarity of the WiFi signals received from a large enough number of WiFi routers [26].
Here instead, we do not try to infer specific contacts between pairs of individuals but rather investigate if a coarse co-location information on individuals allows us to reach an overall picture of the contact patterns in the population of interest. To this aim, we leverage several data sets collected by the SocioPatterns collaboration [27,13] in various contexts: these data include both detailed information about close, face-to-face encounters between individuals and a location tracking of individuals with low spatial resolution. It is thus possible to build two temporal networks where nodes represent individuals and links correspond respectively to a face-toface contact or to a co-presence event, where co-presence is defined with respect to the localisation of two individuals within the same spatial area. We first compare the structural and statistical properties of these two temporal networks and show that they share some important properties, although the co-presence network is much denser, due to the lower spatial resolution involved in its definition. We thus investigate several methods of down-sampling the co-presence signal in order to create surrogate contact networks, in the spirit of [28,29], and compare these surrogate data to the actual networks of face-to-face contacts. We focus first on several statistical characteristics of temporal and aggregated networks, and quantify the ability to identify central nodes in the contact network from the surrogate data. We then consider the possibility to use the surrogate data in numerical simulations of data-driven models for epidemic spread. In particular, we compare the outcome of simulations of a standard model of epidemic propagation when surrogate or actual contact data are used, and we explore the possibility to identify the most efficient containment strategies from this limited information [30]. Our results turn out to depend strongly on the data collection context, highlighting the limitations of coarse co-presence networks with respect to detailed face-to-face data.
2 The co-presence network 2.1 Data sets We use data collected by the SocioPatterns collaboration in various contexts. These data were gathered using wearable sensors able to detect face-to-face close range proximity (1.5 m) of participants wearing the sensors on their chests. In addition, the sensors broadcast a signal that can be received by RFID readers located in the environment. In open space, each reader can receive signals from sensors situated within a range of ∼ 30 m, while the actual reception range in a building depends on its specific structure and on the nature of its walls, floors and ceilings. Each reader thus defines a coarse spatial area and the sensors' signals can be followed when the individuals carrying them change area. For each sensor, we define its "spatial location" at each time as the set of readers receiving its broadcasted signal at this time, and we define two individuals to be in co-presence if they share the same spatial location, i.e., the same exact set of readers have received signals from both individuals. We use data sets from various social contexts: a workplace, with data collected in two different years (InVS13, InVS15), a hospital (LH10), a primary school (Lyon-School), a scientific conference (SFHH) and a high school (Thiers13), see Table 1. In each case, we thus consider a temporal network of face-to-face contacts and a temporal network of co-presence between individuals, both at the temporal resolution of 20s. A contact (resp. co-presence) event between two individuals is then defined as a set of successive time-windows of 20s during which the individuals are detected in contact (resp. co-presence), while they are not in the preceding nor in the next 20s time window. While the conference data does not include any other information on the participants and does not exhibit any particular group structure [36], the other populations under study can each be divided into groups: departments for the workplace, classes for the school and the high school, and roles (patients, doctors, nurses) in the hospital. In these cases, the overall structure of networks aggregated over a certain time window can be summarised, in addition to usual quantities such as the density, the clustering coefficient or the degree distribution, by contact (resp. co-presence) matrices that give the fraction of pairs of individuals of different groups who have been in contact (resp. in co-presence). Moreover, temporal features of interest include the distributions of durations of contact or co-presence events, of the time elapsed between successive events, of the numbers and aggregated durations of such events between pairs of individuals (the latter quantity yields a natural definition of the weight of a link between individuals in the aggregated network).
We will show in the main text the results corresponding to the InVS15 data set, and we refer to the Supplementary Information for the results obtained with the other data sets. We make also available as Supplementary Files the temporally resolved contact and co-presence networks.

Co-presence and contact networks
We first compare some features of the co-presence and contact networks, both temporal and for networks aggregated either on the whole data gathering period or over daily temporal windows. We show in Fig. 1 the distributions of event and interevent duration, as well as the distributions of number and cumulative duration of events for individual pairs. The co-presence events show broad distributions of these quantities, similarly to the contact events and with similar slopes: using only copresence data yields approximate information on the functional shape of the contact duration distributions. The distributions of durations and numbers of events are however typically broader for co-presence, with heavier tails, and the distribution of inter-event durations tend to be less broad (see also SI). This is not surprising as the criterion for being in co-presence is less strict than for being in contact. We observe the strongest differences between co-presence and contact distribution functional shapes for the primary school data. This could be related by the fact that the spatial resolution is in that case quite low, with all the schoolyard being covered by one single reader, and some readers covering more than one classroom. Overall, using only co-presence data would lead to over-estimations of the contact durations and aggregate durations. We compare moreover in Figs. 2 -3 and Tables 2 -3 the overall structures of the contact and co-presence networks, aggregated over daily time windows. The co-presence aggregated networks are much denser than the contact network, with a larger average degree, a larger average clustering coefficient and larger cliques, as expected once again given the lower spatial resolution required for co-presence We compare the average degree (k) network density (ρ), clique number (ω) and average clustering (c) of daily aggregated networks, for the contact network (c subscript), the co-presence network ( subscript), and the sampled co-presence networks (subscripts 1 to 3 according to the sampling method). Values are averaged over all the days of the study. In the case of SFHH, since on the second day there was activity only during the morning, only the values of the first day are reported. *The network is too large and too dense for the clique number to be determined in reasonable time via the usual algorithm. events. In some cases (school, conference), the aggregated networks are even close to being fully connected (see for illustration Fig. 3). Despite this strong difference in the overall density of links, the contact and co-presence matrices giving the density of links between and within each group, averaged across days, are very similar ( Table 2). The similarity is particularly high for the hospital data and, even for the lower value obtained for the high school data, the matrices displayed in the SI show that the overall structure in classes and groups of classes can be inferred from the co-presence data alone.  Given the simultaneous discrepancies in density values and similarities in the networks group structures, we investigate if the data exhibits a scaling law between the number of individuals present in an area and their contact activity, as found at geographical scale in phone communication [37] and Twitter data [38]. Figure 4 and the similar figures shown in SI show the results obtained in the various contexts. Apart from the office cases (InVS13 and InVS15), we observe indeed a correlation between the median of the number of contacts and the number of individuals present. This correlation exhibits a power law shape, with an exponent around 1.5 (see figures in SI). However, huge, context-dependent fluctuations are observed. For instance, in the InVS15 case, the trend is strongly influenced by the numerous instances of an absence of contacts despite potentially large values of the number of individuals present in the area. This is a consequence of the fact that a given reader can receive signals from the sensors of individuals located in different offices. In other areas such as a cafeteria, many more contacts occur with potentially a similar or even smaller number of individuals. Overall, very large fluctuations of the number of contacts, at given number of individuals present, are thus observed, because on the one hand of the low spatial resolution of the co-presence data, and on the other hand of the variety of contexts corresponding to the areas covered by different RFID readers. The stronger correlation is observed for the SFHH conference data, probably because the various areas covered by the readers corresponded to similar contexts, namely different areas of the exhibition and poster rooms.
3 Sampling co-presence data

Sampling methods
As the temporal network of co-presence bears some similarities with the actual contact data, but contains much more events and leads to much denser aggregated networks, we consider the possibility to down-sample the co-presence data: for each pair of individuals, each contact event is indeed included in a co-presence event of the same individuals. Each co-presence event might thus correspond to one or more contact events. As we cannot determine exactly the correct down-sampling to be performed if we have access only to co-presence data, we study here three simple sampling methods. We remind here that we do not try to infer the real contacts but rather to obtain a down-sampled version of the co-presence network that is statistically similar to the real contact data. Moreover, as the total number and duration of actual contacts cannot either be easily guessed from the co-presence data alone, we consider the actual total contact time T c as the (only) parameter of the sampling, and we fix it to its empirical value. The sampling methods we consider are the following: • Sampling 1: Sampling of co-presence times. We define a co-presence list as a list of individuals present at the same time t in the same area. Each copresence list is thus stamped with its time of occurrence t. We create n copies of each co-presence list , where n is the number of distinct individuals in , and create in this way of a global pool of co-presence lists. We then sample T c lists uniformly at random from the pool without replacement. Each list has thus a probability proportional to the number of individuals it contains to be chosen. From each chosen list, we choose at random a pair i, j of individuals, obtaining a triplet (t, i, j) where t is the time-stamp of the list (we take care to avoid repetitions: if (t, i, j) has already be obtained in a previous random draw, we repeat the random selection). The sampled temporal co-presence network (i.e., the surrogate contact network) is formed by the union of these triplets. • Sampling 2: Sampling of co-presence times with completion. We constitute a pool of lists exactly like in the previous method. We then sample a triplet (t, i, j) as in the previous method, and add all the other triplets (t , i, j) that belong to the same co-presence event to create the surrogate contact event. We iterate this until we reach a cumulative contact time T c , while discarding repetitions.
• Sampling 3: Sampling of co-presence events. We consider directly the list of co-presence events between individuals, (t, i, j, τ ) (co-presence event between individuals i and j, starting at time t and with duration τ ), and sample events from this list, without replacement, adding them to the list of surrogate contact events until we reach a cumulative contact time T c . For each data set, we create 100 instances of surrogate contact networks for each sampling method. We compare in the following the properties of these surrogate contact networks with the real face-to-face contact data. Activity timeline Data Sampling 1 Sampling 2 Sampling 3 Figure 5 Properties of the sampled co-presence networks -InVS15. We compare several properties of the contact network from the original data set with the surrogate contacts obtained by sampling of the co-presence data: overall timeline of contact activity, distributions of degree, weight w and number of contacts per link n in the network aggregated over the whole data collection period, and distributions of the contact duration τc and inter-contact duration τ i . Tables 2 -3 provide elements of comparison between the surrogate contact networks and the empirical data (see also SI). The first observation is that the contact activity timelines are in general broadly recovered, except for the primary school (see SI), while the detailed intra-day activity variations are not always properly reconstructed in the surrogate data (except for the hospital data, see SI). The strongest deviations are observed for the second sampling method for the conference and high school data.

Figures 5 -6 and
The first sampling method, given it samples separately times of co-presence, yields an exponential distribution of surrogate contact duration, in contrast with actual data and other sampling methods in which broad distributions are observed. Broad distributions of inter-contact durations and of the numbers of contacts between individuals are also obtained, with however slopes that depend on the context. For instance, the second sampling method systematically leads to a distribution of contact durations that is broader than for the real contacts. The third method yields a distribution of contact durations similar to the real one for the InVS13, LH10, and SFHH cases, but gives results similar to the second method in the other cases.
We now turn to the properties of networks aggregated over daily periods or over the whole data collection. At the daily level, we show in Table 2 that the similarity of the contact matrices obtained from the surrogate data with the empirical one is very high, and most often larger than the similarity of the original co-presence matrix. For networks aggregated over the whole data collection, Fig. 5 shows the distributions of degrees and of weights (see also SI). The first sampling method leads to an overestimation of degree values (resulting in a shift of the distribution), the second method tends to shift the distribution to lower degree values (except for the conference case), and the third method yields context-dependent over-or under-estimations of degree values. Note that the distributions of degrees of the copresence networks are not shown in the figure as the degree values are very strongly overestimated. Distributions of weights (aggregated contact durations) recover well the ones of the data for all sampling methods, and are closer than the ones of the co-presence networks. For each data set we compute the cosine similarity between the neighbourhoods of each nodes from each daily network, averaged for all nodes and all pairs of daily networks. The neighbourhood of a node n is defined as the vector of the link weights between n and every other nodes (if the link does not exist the weight is set to zero). We compare the values obtained for the contact data, the co-presence data, and for the networks generated by each sampling method of the co-presence data, averaged over 100 realisations for each sampling method. For reference, we also compute as null model the average similarity when links in the contact data are shuffled randomly within each daily network.
To investigate intermediate timescales of aggregation, Table 4 quantifies the similarity between networks aggregated in different days. The measure is defined as the average cosine similarity between all pairs of instances of a node's neighbourhood, averaged over all nodes. We see that the similarity is higher for the co-presence networks, as expected since the networks are denser. The sampling method 1 generates networks that are more similar than the data, and the other two methods generate networks that are less similar (with the exception of the LH10 case, and the method 2 in the InVS13 case). In the cases of the method 2 for the LyonSchool data, and the methods 2 and 3 for the Thiers13 data, the sampled networks are even almost as different as they would be after a random shuffling of the links.
In addition, Fig. 6 gives the evolution of the average degree and strength for networks aggregated in increasingly long time windows. First, the evolution of the real average aggregated strength is usually better recovered than for the degree by the various sampling better. Second, which sampling method recovers better the evolution of the degree is again context dependent. However, in all cases the sampled data are much closer to the contact data than the co-presence network, which overestimates very strongly these quantities.  Figure 7 Node ranking similarity. InVS15 data set We plot for each co-presence sampling method the Jaccard similarity between the top N % nodes in the real and surrogate contact data, when ranked according to their degree k, their strength s or their betweenness centrality b vs. N . The plot shows the median similarity and the shaded areas give the 90 % confidence interval.

Node centralities
In a network, more "central" nodes are usually considered as important, as they might play an important role for instance in spreading processes (or other dynamical phenomena) occurring in the network. It is thus of interest to understand whether the most central nodes in the contact network can be identified either in the raw co-presence data or in the surrogate contact data built from the co-presence information. As there are several ways of determining central nodes in a network, we consider here three of the most well-known centrality measures and apply them to the networks aggregated over the whole data collection: degree k, strength s and betweenness b of nodes in the aggregated networks. For each instance of each sampling method, we thus build the resulting surrogate aggregated contact network and rank nodes according to each centrality measure. We then compute the Jaccard similarity index between the top N % nodes in the real contact network and in the surrogate one. We plot in Fig. 7 the median similiarity with the 90 % confidence interval, as a function of N , for the InVS15 case (see SI for the other cases). 28.4 (0.360) 42.6 (0.559) 15.8 (0.117) For each dataset we compute the maximum coreness, and report between parenthesis the Jaccard index between the k-core of the contact network and the k-core in the original and sampled co-presence data (results are averaged over 100 realisations for each sampling method).
In general, no sampling method recovers correctly the most central nodes for low values of N . The best results are obtained for the conference data with similarities around 0.2 − 0.4. The similarity values increase as N increases but reach most often only values of ∼ 0.5 when considering the top 50 % nodes, meaning that only 25 % of the most central nodes are identified when using the surrogate data. The best results are obtained for the first sampling method for the LyonSchool case and for the LH10 case, with similarities reaching 0.6 − 0.7. Results are typically better than the random baseline but do not outperform the detection of most central nodes based on the whole co-presence network. In terms of the most central nodes as defined by the k-core decomposition (we recall that the k-core of a network is the maximal subgraph such that all nodes in the subgraph have at least degree k, and k is called the coreness), the overestimation of degrees in the co-presence network leads to an overestimation of the maximum coreness, while sampling leads to values closer to the ones of the contact data, but once again in a context-dependent way. The maximum core itself is only partially recovered in the whole and in the sampled co-presence networks (see Table 5).

Using surrogate contact data in epidemic simulations
We have seen in the previous section that none of the three sampling methods yields a perfectly accurate description of all the relevant features of the true contact network: each sampling method yields surrogate data with both interesting similarities and potentially important discrepancies with respect to the original contact data. We now consider the issue of using such surrogate data in simulations of spreading processes: as precise data on face-to-face contacts is not always available, it is important to understand if co-presence information can allow us to obtain on the one hand an accurate prediction of the outcome of an epidemic process, and on the other hand a reliable estimation of the impact of containment measures. In particular, it is important to be able to classify potential containment strategies to determine which one(s) are most adequate.
To this aim, we consider the paradigmatic Susceptible-Infectious-Recovered (SIR) model for epidemic spreading. In this model, susceptible (S) individuals can become infectious (I) at rate β when in contact with an infectious node. Infectious nodes recover spontaneously at rate µ and enter an immune recovered (R) state. Simulations start with a single infectious individual chosen at random and carried out until there are no infectious individuals left in the population, i.e., individuals are either still susceptible or have been infectious and have then recovered. The impact of the epidemics is then quantified by the final fraction n i of individuals in the R state.
We set β = 0.0004 and vary µ by tuning the reproductive number R = β/µ. For each value of R, we measure the fraction P (n i > 20 %) of "large" outbreaks in which the fraction n i of the population that was reached by the outbreak is at least 20 % and the distribution of the sizes n i of these large outbreaks. We average the results over 10 000 simulations performed on the empirical contact network. For each sampling method, we build 100 different instances of the surrogate contact network, and perform 100 simulations on each surrogate network.
We also consider several simple methods to mitigate the spread, namely the vaccination of a number of individuals in the population, under the assumption of a perfect vaccine efficiency: vaccinated individuals cannot become infectious nor transmit the disease and thus slow down and hinder the propagation. We consider the vaccination of (i) 5, 10 or 20 individuals chosen at random (ii) the most central 5, 10 or 20 individuals, where centrality is measured according to either degree, strength or betweenness in either the real or surrogate contact networks (iii) when the population is structured in groups, the vaccination of all individuals in one group. Figures 8 and 9 summarize our results for the InVS15 dataset (see SI for the figures obtained with the other datasets). In terms of the evaluation of the impact of a spreading process, results are context dependent. The simulations performed on the surrogate data obtained with the first method generally lead to an overestimation of the epidemic risk, except for the hospital data. When using the second sampling method, we obtain a good estimation of the risk for the conference, school and highschool data but an underestimation for offices and hospital data. The third method on the other hand leads to a correct estimation for the offices and hospital data but an underestimation for the school and highschool and an overestimation for the conference.
We show in Fig. 9 the impact of the various vaccination strategies, quantified through the ratio of the probabilities of large outbreaks with and without vaccination, as well as the ratio between the median sizes of these large outbreaks. We rank the strategies according to their efficiency in the real contact network, in order to visualize easily whether the surrogate networks lead to the same classification of the  Figure 9 Vaccination strategies. InVS15 data set. We plot the ratio between the vaccination and no vaccination cases of the fraction of the total number of outbreaks that reach at least 20 % of the population (top), and of the median size of these outbreaks (bottom) for different vaccination strategies, for the original data and the reconstructed networks. The vaccination strategies are ordered by decreasing efficiency, based on the effect on the real contact data. The group * strategies consist in vaccinating one or several groups entirely; the group rand strategy vaccinates ng random nodes, where ng is the average group size; the rand n strategies randomly vaccinates a specified fraction n of nodes; the b n, k n, s n strategies vaccinate the top n % nodes according to, respectively, betweenness centrality, degree and stregnth ranking. strategies: indeed, even when the impact of each specific strategy is not accurately quantified, it would be interesting at least to understand which methods are most efficient. Results are once again uneven and context dependent (see also Table 6). In several cases such as SFHH the ranking of strategies obtained from the sampled co-presence is overall respected (Kendall's tau of 0.818 for the sampling method 1 on the size of outbreaks), while it can be strongly reshuffled in other cases (for instance in the Thiers13 case).

Discussion and conclusion
In this paper, we have investigated whether low resolution co-presence information can be used as a substitute for detailed face-to-face proximity data, both from the point of view of extracting large-scale structural and statistical features of the temporal contact network in a population and in data-driven models of epidemic processes in a population. We have considered several data sets collected in various contexts that contain both high-resolution data on face-to-face contacts between individuals and a coarser location data, both with temporal resolution. The location data can thus be transformed into a co-presence temporal network between individuals. Given its lower spatial resolution, this co-presence data contains much more events than the contact data, leading to much denser aggregated networks: indeed, all individuals in a given area are considered as co-present, while only some of them are typically engaged in a face-to-face contact. Despite this expected issue, a number of properties related to group structure and statistical distributions of temporal properties are similar in contact and co-presence data, with similar matrices of densities of links between groups and broad distributions of (aggregate) contact durations.
We have thus examined several methods to downsample the co-presence networks to create surrogate contact networks with overall the same amount of contact time than the real contact data. The surrogate data statistics are in general closer to the real contact data than the raw co-presence, in particular regarding the distribution of node degrees and link weights (and their evolution in networks aggregated over increasing time windows). These results mean in particular that the distribution of aggregate contact durations, a very important property that has a strong impact on the unfolding of processes on networks such as epidemic processes, could be approximately retrieved from simple sampling processes of the co-presence data and thus fed into data-driven models of populations. Several other properties, such as precise value of the average degree, average clustering or size of largest cliques and cores, turn out however to be strongly context-dependent. Moreover, the most central nodes of the contact network are not better identified than using the bare co-presence information.
We have moreover investigated the use of such surrogate contact data in numerical simulations of spreading processes in a population. Overall, simulations performed on surrogate data obtained with one of the sampling method yield results close to the ones obtained with the real data, while the other methods over-or underestimate these results, but the best method turns out to depend on context (Note however that all these methods give obviously results much closer to the one of the real contact network than if raw co-presence is used, given co-presence overestimates strongly the contacts and thus yields a strongly overestimated epidemic risk). We moreover investigated the possibility to rank containment strategies according to their efficiency, and found that this ranking is once again context dependent: in some cases, simulations on sampled co-presence networks allow us to uncover the most efficient vaccination strategies for containing a spread on the real contact data, while in other cases the rankings differ quite strongly.
In conclusion, we showed that co-presence data, while yielding interesting insights into some of the large scale properties of the contact network, is not easily usable to build in a reliable and systematic fashion surrogate contact data that reproduces detailed features of the real contacts and could be used in numerical simulations to predict the outcome of spreading processes and the impact of containment strategies, at least for processes involving contagion at short distances [39] (note that, while more sophisticated sampling procedures might be devised, they would most probably involve more parameters and/or more additional information not present in the raw co-presence data, and would also most probably still give context-dependent results). We note however that even coarse location information has been shown to be a useful additional information whenever the precise contact data is incomplete [29]. Optimally, data collection with wearable sensors should thus contain both high resolution data about relative positions of individuals, in order to detect face-to-face proximity, and coarser co-presence information to inform for instance on mobility patterns within buildings or complement potential data losses.