Temporal social network reconstruction using wireless proximity sensors: model selection and consequences

The emerging technologies of wearable wireless devices open entirely new ways to record various aspects of human social interactions in a broad range of settings. Such technologies allow to log the temporal dynamics of face-to-face interactions by detecting the physical proximity of participants. However, despite the wide usage of this technology and the collected datasets, precise reconstruction methods transforming the raw recorded communication data packets to social interactions are still missing. In this study we analyse a proximity dataset collected during a longitudinal social experiment aiming to understand the co-evolution of children’s language development and social network. Physical proximity and verbal communication of hundreds of pre-school children and their teachers are recorded over three years using autonomous wearable low power wireless devices. The dataset is accompanied with three annotated ground truth datasets, which record the time, distance, relative orientation, and interaction state of interacting children for validation purposes. We use this dataset to explore several pipelines of dynamical event reconstruction including earlier applied naïve approaches, methods based on Hidden Markov Model, or on Long Short-Term Memory models, some of them combined with supervised pre-classification of interaction packets. We find that while naïve models propose the worst reconstruction, Long Short-Term Memory models provide the most precise way to reconstruct real interactions up to ∼90%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}${\sim} 90\%$\end{document} accuracy. Finally, we simulate information spreading on the reconstructed networks obtained by the different methods. Results indicate that small improvement of network reconstruction accuracy may lead to significantly different spreading dynamics, while sometimes large differences in accuracy have no obvious effects on the dynamics. This not only demonstrates the importance of precise network reconstruction but also the careful choice of the reconstruction method in relation with the data collected. Missing this initial step in any study may seriously mislead conclusions made about the emerging properties of the observed network or any dynamical process simulated on it.


Introduction
The precise observation of the dynamics of face-to-face interactions of people have been a major challenge in social studies [1]. Such observations were commonly limited to smallscale observations for short periods of time [2]. Meanwhile such studies promise keys to understand better the formation of social ties, the emergence of social groups, psychological well-being [3], or how information or influence diffuse via personal interactions. Recently developed new technologies of wearable wireless devices made possible a giant leap in this direction, as they allowed for large-scale experiments to observe offline interactions in multiple settings like in schools [4,5], museums or conferences [6], hospitals [7,8], or even between animals [9,10]. These experiments highlighted novel behavioural patterns [11,12] and their consequences on ongoing dynamical processes like epidemic spreading [13][14][15][16] or the adoption of different behavioural forms [12]. However, all these studies have some methodological similarities. First, although they have been implemented in rather different ways using centralised radio-frequency identification (RFID) [11,[16][17][18][19][20][21] or autonomous LPWD [8,22,23] technologies, they all provide the same output as sequences of radio signals pairs. Second and more importantly, using the collected data streams, the reconstruction methods of temporal interactions were commonly based on naïve assumptions [18,24], which may seem convenient at first, but have indisputable consequences on the reconstructed event structure and any observed process taking place on it. To bridge this shortcoming, in this paper, starting from the recorded raw communication data, we explore multiple temporal network reconstruction methods to find the best way to rebuild the original sequence of interactions. In addition, we will show that even a small improvement in the reconstruction accuracy may have dramatic effects on the dynamics of ongoing processes simulated on the temporal network.
The dynamics of human actions and interactions are conventional topics in social and behavioural sciences, while their modern studies articulated a new distinguishable area in the field of computational social science [25] called human dynamics. This transition has been fuelled by some radically new digital data collection technologies [26], which allow to track human behaviour at the individual level within large populations. On one hand, they provide automatised data collection methods for in vivo observations of millions of people without the usual observational biases but with certain limits in control and reproducibility. Popular methods rely on mobile communication devices [27], or online communication platforms [28] just to give a few examples. On the other hand, online experimental settings, crowd-sourcing services, and new behavioural tracking technologies provided by Internet-of-Things (IoT) solutions like RFIDs [17], IoT-LPWD [8,22,23], in addition to reality mining/personal logs [29,30], real-time surveillance [31], or smart/GPS enabled devices [32,33] provide the opportunity to precisely observe the behaviour of a selected group of people in more controlled settings.
However, these new opportunities open new challenges as the recorded raw data do not translate easily to knowledge. To bridge this gap, careful methodologies need to be developed to obtain unbiased precise proxy descriptions of human behaviour. For example, in case of RFID and LPWD tracking, information about the physical proximity of two people comes as pairs of data sequences recording mutually shared packets with certain frequencies and varying signal strength. However, these devices are not flawless, they are influenced by external noise, and may lose packets for several reasons, e.g., due to interference with other devices, communication overload, physical obstacles, humidity level, or misplaced orientation of interacting people. The reconstruction of the real temporal interactions from such noisy and incomplete data sequences translates to a multidisciplinary challenge involving signal processing, communication protocols, social behavioural studies and advanced techniques of statistical data analysis. Supervised learning methods can be especially useful for this task, as they are able to recognise recurrent dynamical patterns associated with ongoing or missing interactions. As a result, we may obtain a sequence of proxy interactions between people, which can be best represented as a temporal network [34] coding simultaneously the dynamics and structure of the social fabric. The precise estimation of these interactions has important consequences. Misplaced or wrongly identified interactions may lead to completely different network dynamics and in turn to falsely claimed time-respecting paths. Such paths determine the possible ways how information or influence can spread in the observed temporal structure, thus even minor change in their structure can lead to radically different outcomes of co-evolving processes like information diffusion [35,36], opinion dynamics [37][38][39] or epidemic spreading [40].
In this paper, we focus on human proxy data collected in a large-scale social study called DyLNet [22]. This experiment, explained in details in Sect. 2, aims to understand the language development of children in their early age, by observing the verbal and social interactions of pre-school children and their teachers over three years. Physical proximity and verbal interactions of the participants are recorded using voice enabled decentralised Low Power Wireless tags. By focusing on the proximity data, and relying on ground truth data recorded simultaneously in controlled settings, we explore several supervised reconstruction methods of the temporal social interactions. As demonstrated in Fig. 1, first we build a binary interaction sequence from the raw data of packet exchange between the LPW badges of a pair of individuals, and then use it to reconstruct the time and duration of the mutual interactions among the participants in order to obtain a temporal network representation of the social interaction dynamics. As we explain in Sect. 3.1, we translate the first level problem to a classification task, while in Sect. 3.2 we explore multiple naïve and advanced supervised learning methods to solve the final reconstruction problem of the Figure 1 Temporal network reconstruction pipeline. Starting (a) from mutually observed packets of pairs of LPWD tags, (b) we train models using the raw and annotated data to (c) reconstruct interaction and non-interaction periods between individuals to (d) reconstruct events and ultimately (e, f) build a temporal network dynamical interaction sequences. Further, in Sect. 4, via data-driven simulation of spreading processes, we demonstrate that while commonly used naïve reconstruction methods consistently overestimate the number of interactions, using advanced supervised learning methods, even a minor improvement in the reconstruction performance can have radical effects on the dynamics of an ongoing process.

Experimental setting and data collection
Our primarily analysed dataset was collected during a large-scale social experiment, which employed IoT-LPWD technology to track human interactions. More precisely, it involved, in each year of the project, about 170 pre-school children (between age 2 1 2 and 6 1 2 ) in 7 classes, together with about 30 teachers and assistants. During this experiment, their physical proximity and verbal interactions have been recorded using voice enabled LPWD badges. The experiment has been ongoing in a French pre-school over three years, for one week in each month during school periods (i.e. 30 weeks in total). The original goal of this study has been to analyse the relations between child socialisation and language development in pre-school, by describing the co-evolution between social network dynamics (changes in the social relationships) and the language dynamics in the networks (interindividual influences and changes in language skills).

Data collection
For the purpose of our analysis, we concentrate on the proximity data recorded by LPWD badges worn by each of the participants. In our architecture, developed by SEQUANTA [23], each badge is associated with a unique ID and uses the IEEE 802.15.4 low-rate wireless standard to communicate a . Badges broadcast a 'hello' packet with 0 dBm transmission power for 384 μs in every 5 seconds. For communication, they use the carrier-sense multiple access (CSMA) protocol, thus they first listen to the dedicated channel, and only transmit packets if it is clear to avoid collision. When a badge is not transmitting, it is listening to capture incoming packets from other devices in its vicinity. Due to the CSMA protocol, a random offset can appear in the timing of the packet transmission, which can accumulate and cause considerable delay as compared to a global clock signal. For this reason, the time of each badge is synchronised with a central device, connected to a computer, which propagates the global time to all badges. Badges which fall off synchronisation of the time stop transmitting packets. This way, each sent packet is time-stamped. Beyond the broadcasted global time signal our architecture is completely distributed as it builds up from autonomous badges, which record data locally on their own flash memory card. Every day, badges are plugged on a USB board to re-charge their batteries overnight, and at the end of each measurement period (typically after a week), we gather data from their flash memory card to a computer, and re-initialise it ahead of the next recording session. Note that there are other wireless architectures developed for similar purposes, such as OpenBeacon [17] or Open badges [30], but they typically use RFID technology with centralised communication protocols and a different philosophy for data collection.

Ground truth datasets
Beyond the experiments, we collected multiple annotated ground truth datasets for training and validation. The first one (GT1) was recorded by a researcher sitting in a classroom and logging the state of interaction/no-interaction between a pair of children and their relative orientation using the scan sampling method (7 pairs of children were observed for Σ = 3 hours 35 minutes, in ranges of 17-54 minutes). Advantage of this setting was to gather annotated data in-situ with all noise and interference present (the class containing 20-25 individuals wearing a badge). To obtain annotated data in more controlled settings, a second dataset (GT2) was recorded by statically positioning a pair of children, at a given distance and orientation from each other for ∼10 minutes periods, in a separate room (4 pairs of children positioned in two different settings each, observation time in a given position range was 2-10 minutes, summing up to ∼56 minutes of observation). These data provided stationary observations of packet exchange, however, it failed to capture the dynamicity of social interactions. Finally, a third dataset (GT3) was recorded by a researcher sitting in a classroom and logging the relative distance and orientation of the individuals present, both children and adults, during regular class activities (28 individuals wearing a badge observed for 30 min). Advantage of this dataset was to make possible a direct comparison of the RSSI values collected by the badges with the actual distance between social partners. Both the datasets GT1 and GT3 were collected using the scan sampling method [41] (data recorded at intervals of 10 seconds for GT1, and at intervals of 2 minutes for GT3) with the Animal Observer application for iPad [42]. For a more detailed description on how ground truth data were collected see Appendix A.

Environmental dependencies and parameters
During the experiments, each badge recorded a time-stamped sequence of packets, which were broadcasted by other badges in its vicinity. More precisely, a sequence recorded by a given badge consists of (t, ID, RSSI) tuples, where t is the time of observation, ID is the unique identifier of the observed badge, and RSSI is the received signal strength indicator of the observed packet transmitted as a radio signal. In Fig. 2a we show the distribution of RSSI values recorded over a week (24 hours, 196 badges). Values were observed between -24 and -94 dBm with a bimodal distribution. One peak, above -45 dBm, corresponds to the situation when badges are stored in a box close to each other thus communicating with strong radio signals. The other peak, below -75 dBm, corresponds to any other observations, including ones capturing real social interactions with an -94 dBm ceiling value hardwired in the badges configuration. Observed RSSI values can depend on external factors like distance, body orientation, battery status, or humidity conditions. While battery status should not be an issue here as badges are charged overnight, and we can control for distance and orientation (as explained below), we cannot account for changing weather conditions, which can cause some fluctuations in our measurements. In addition, the potential conflict of signals within 1 μs may induce accidental loss of observed packet in case of interactions within large social groups.
Social as well as verbal interactions depend upon the relative distance and orientation of the participants, which should be reflected by the RSSI values of captured packets in our experiments. The strength of transmitted radio signals depends on distance and orientation as they are effectively absorbed by the water of the human body. These dependencies are demonstrated by the measurements depicted in Fig. 2b and c. There, in panel Fig. 2b, the density plot of RSSI values (y-axis) of captured packets shows a non-linear negative correlation with the distance between participants (x-axis). This measure suggests, for a realistic distance of maximum 1.5-2 meters for verbal interactions, a corresponding RSSI range between -70 · · · -75 dBm, while high intensity regions for lower RSSI values are to each other. However, only visually inspecting these results, it is very difficult to determine a precise RSSI threshold separating real and false social interactions. To better solve this task, next we frame this question as a classification problem to distinguish between packets indicating real and false social interactions that we can use then to reconstruct the temporal network.

Temporal network reconstruction
In our pipeline, we are going to reconstruct the temporal network from raw data in five main steps, as demonstrated in Fig. 1. First, we discuss how to arrive from the recorded data to a handshake pair sequence, whose items indicate mutual handshakes between interacting badges. Then we perform a binary state classification process, turning handshake pair sequence into the binary sequence where each item indicates mutual social interaction state. Finally, we propose several methods to reconstruct the real dyadic temporal interactions with duration, which in turn provides us with a temporal network capturing time-varying interactions between a larger group of individuals.
We separate the binary state classification step from event reconstructions as our first goal was to create a binary sequence of interactions from the raw data that we can apply earlier defined methods on. In addition, we found this approach necessary as we identified two potential sources of errors effective during the reconstruction process. One is due to the fluctuations of recorded signal strengths of transmitted packets, while the other is caused by packet loss or interferences, which induce uncertainties in present or absent handshake pairs. This second type of errors makes it difficult to reconstruct events with longer duration, a problem for which earlier studies provided overly simplified heuristic solutions only with limited precision. We will explore various methods relying on this two-steps approach, but show also a method, which solves the problem at once by taking packets with signal strength values as input and directly reconstructing temporal interactions with duration.

Binary signal reconstruction
Physical proximity between two participants, A and B, within the right distance range and orientation should appear as a sequence of consecutive mutual 'handshakes' of badges for the duration of their interaction. To obtain the sequence of these handshakes, we take the sequences of packets observed by the badges of A and B and match those packets, which correspond to the mutual observation of the two participants. In other words, since packets are transmitted every 5 seconds, we match two packets into a single handshake event (see Fig. 1b) if they appear within ±2.5 seconds to each other and they refer to the opposite ID (for A from B and for B from A). Missing packets are also recorded in the handshake sequence (see Fig. 1b empty red arrows) and are assigned a default RSSI value, -95 dBm, out of the possible RSSI range, that we can easily distinguish from observed packets. To clearly distinguish 'fake signal' from 'real signal' , we appended one item called 'pair_state' to each handshake RSSI pair to indicate the number of 'real signals' in the pair. This variable can take values 0, 1 or 2, and it enables to code for the presence of 'fake signals' while keeping the normalisation of RSSI values possible for the coming reconstruction methods. Thus the encoding of each handshake pair becames a vector (RSSI A , RSSI B , pair_state), forming a sequence of handshake events recording all information for the reconstruction task.
In order to determine if a handshake pair should be considered as a state of social interaction, we use GT1 where we recorded the start and end time of each social interaction so that we could mark each handshake pair as interaction or non-interaction event. This is shown in Fig. 3a, where we plot handshake pairs using their RSSI values as coordinates. Colours code a handshake being an interaction (green) or non-interaction (red). Since different handshake pairs could appear with the same RSSI values but different interaction states, in Fig. 3a we represent with a small pie chart their fraction at a given location. The strong diagonal component indicates that the RSSI values of mutual observations are very similar to each other as expected, while the interactions seem to separate from noninteractions around ∼-70 dBm, which corresponds well to the earlier estimated threshold  range. To solve this classification problem in a more systematic way, we trained a logistic regression model on the annotated GT1 dataset. As input we gave vectors of handshake pairs and we used their annotated labels for the training task. As output we received a probability for each state to be a real social interaction and we thresholded this probability at 0.5 to assign 0/1 states to each handshake pair. The obtained decision boundary is shown as a grey line in Fig. 3a, which appears to be linear, except close to the boundaries where saw-teeth appears due to the two dimensional projection of a three dimensional decision surface. With this method we reached a 77.28% accuracy with 10-fold cross validation (for further details see Table 1) to classify a handshake pair as real social interaction or not. This way we can turn our sequence of handshakes into a binary signal (demonstrated in Fig. 1c), by assigning 1/0 to interaction and non-interaction events in every 5 seconds.

Interaction state reconstruction methods
Using the obtained binary sequences, what we call now on un-reconstructed sequences, our next task is to reconstruct the real interactions, which appeared between pairs of participants. The general problem here is to identify false interaction events, which were induced by interference and thus should appear as actual non-interactions, and reconstruct true ones, which were missed due to packet loss. As this is the most challenging task in our methodological pipeline, we are going to follow three different methodological tracks. We will start with a naïve approach commonly used in the literature, then we will explore variants of the Hidden Markov model (HMM) and the Long Short-Term Memory (LSTM) model to find the best solution for this dynamical reconstruction task. Note that while the naïve method only reconstructs interaction periods, the two learning methods naturally adapt to the inverse problem and also reconstruct non-interacting periods with falsely observed interactions in the middle.

Naïve reconstruction model
Consecutive binary signals in a sequence (following each other in 5 seconds here) can be merged into long interaction periods (we call them events) with duration equal to the length of the continuous interaction. These events are separated by non-interaction gaps.
If such a non-interaction gap is induced by an accidental packet loss, it is assumingly very short. On the other hand, if it is due to a real break of social engagement, it may occupy a longer period. Based on this assumption we can design a very simple reconstruction method, where we merge two interaction periods if they are separated only by a sequence of non-interaction events shorter than a given gap threshold value. This naïve reconstruction method is demonstrated in Fig. 3c, where we assume with gap = 1 to reconstruct the sequence observed in Fig. 3b. This method has been used conventionally in most of the RFID social experiments so far, typically choosing the threshold to be gap = 0, thus merging only directly consecutive interaction packets (what we call non-reconstructed method here) or gap = 1 corresponding to a gap smaller than 40 seconds. This choice has been challenged recently by Elmer et al. [24], who identified the optimal threshold being 75 seconds for the best reconstruction accuracy.

Hidden Markov model
The second reconstruction method we chose is the Hidden Markov Model. To set the parameters for HMM with supervised learning method we used the GT1 dataset with the annotated states of handshake sequences as states sequence, and the sequence of binary states (explained in Sect. 3.1) as the observations. After training on the annotated data (GT1), we determined the values of the conditional transition probability of hidden states as transition matrix, conditional emission probability from hidden states to observation states as emission matrix and the initial states probability as start matrix. In turn, we used these as parameters for the Viterbi algorithm to solve the most likely sequence problem and use the output as the reconstructed sequence.
To enrich the information coded in the input sequence, instead of providing a sequence of binary values for each time step, we define a backward window, which contains some short term information before the actual state being reconstructed. More precisely, as demonstrated in Fig. 3d, we define a tuple of win number items (there win = 3), where the last one is yet the state to reconstruct while the others are the previous states in the sequence. Applying these envelop definitions to a unit of binary state sequence, we create an envelop with backward signals for each signal, thus transforming a binary state sequence into an embedded envelop sequence. Subsequently, we use these transformed envelops instead of binary signals to define the hidden states, observation states, as well as determining all the matrices. Finally, as an output of the Viterbi algorithm we obtain a sequence of envelops, with last item of each envelop as the predicted interaction/noninteraction state of each time step. Note that we tried multiple other envelop methods (not reported here) coding different distance information between actual and last interaction packets but received worst performance than in the actual case.

Bi-directional LSTM model
The Hidden Markov Model has two limitations in terms of reconstructing the real interaction signal. First, it can only consider states from the past, while states in the future may be also important for the actual state to predict. Second, it is a Markov model thus it can consider only short-term temporal correlations between the actual and previous states. We tried to overcome this shortcoming by introducing longer observation windows for each state, which helped to learn longer temporal correlations yet they were very limited to the actual window size.
Bidirectional recurrent neural networks propose simultaneous solutions to these two problems as they can be trained using input information in the past and future of a specific time frame [43] (for demonstration see Fig. 3e). Especially the Bidirectional Long Short-Term Memory (BiLSTM) model has been shown to perform well on dynamical signal reconstruction. This model was initially adopted in speech recognition and showed to improve model performance on sequence classification problems. In practise, it trains two LSTM models on a complete input sequence from opposite directions, one on the input sequence as it is, and another on a reversed copy. The output of each time step from the two LSTM models are merged and passed to the next layer, this way providing some additional context to perform the learning task better.
We applied this model in three different settings to find the best performing one. In one case, that we call BiLSTM-bin, the input of the model was a sequence of binary states we obtained from the classifier's binary output as explained in Sect. 3.1. In the second setting, that we call BiLSTM-logi, the input sequence was also generated from the classifier but, instead of binary states, it was a sequence of probabilities obtained as the direct output of the logistic regression before thresholding it. Finally, the third case, that we call BiLSTM-RSSI, is not relying on the sequence of classified states, but instead it takes directly sequences of encoded handshake pair vectors (RSSI A , RSSI B , pair_state). This solution has the advantage to skip one step of the reconstruction pipeline and to use a more complex set of information, but it needs to solve the same problem using noisy RSSI signals without pre-processing.

Event reconstruction
To train all these models we used GT1 since it was recorded in the most realistic setting. These data were built from 7 observation clips of 1290, 3060, 3200, 1230, 1740, 1350 and 1030 sec, covering 3 hours 35 minutes combined. For training and validation purposes, we divided evenly observations longer than 3000 seconds (2 clips) into 3 shorter periods, and retained 10 observation clips all with length between 1000 and 1740 seconds (for a total of 3 hours and 18 minutes). To determine the best hyper-parameter set for each method, we applied a nested cross validation strategy. In the outer loop, we selected one clip (each of them once) for testing purposes thus we kept it out from the training of the model at this round. From the remaining 9 clips we perform the traditional 9-fold cross validation. Considering all combinations, we could compute the average accuracy over 9 possible divisions of training-validation sets in order to screen hyper-parameter dependencies. Subsequently, we could repeat it 10 times to obtain the average test accuracy with the selected best hyper-parameters. Note that while computing averages we took into account the variance in length of the actual clips used for validation or testing.

Hyper-parameters
All BiLSTM models have two hyper-parameters to define their architecture, the number of hidden neurons and hidden layers. In our computations, we decided to use all architectures with a single hidden layer, which was a sufficient choice for the relatively small training data we have. At the same time, with grid search we explored the dependency of the models on the number of hidden neurons. The results summarised in Appendix B suggested that the performance of the models were weakly depending on this hyper-parameter, but suggested different optimal values for their best performance as summarised in Table 2.
The most important hyper-parameter controlling the performance of all of the methods was the window size, which determined the length of temporal correlation a given method could consider. For the naïve model, this window can be associated with the gap parameter (see Fig. 3c). In case of the HMM model, as shown in Fig. 3d, it is the size of the window   Table 2. Horizontal white bar inside box is median and white star is average that the model considers from the past to infer the actual state. Finally for the BiLSTM models, this window was defined as an envelope of equal number of states before (past) and after (future) relative to the actual state to reconstruct (see Fig. 3e).
To choose the best window size, we took it as a parameter to compute the dependency of average accuracy values over the validation sets. As results in Fig. 4a depict, the reconstruction accuracy of each of the models shows strong dependency on the selected window size. First, in the case of the naïve method, by increasing the filled non-interaction gap size the accuracy reaches a maximum at gap = 6 b . This corresponds to a gap length of 35 seconds, which is somewhat smaller than the gap window size of 75 seconds reported by Elmer et al. [24] on another RFID dataset. In the case of the HMM model, the best performance corresponds to the same window size win = 6. For the BiLSTM models, the accuracy increases with the window size but reaches a plateau at window size win = 27 for the BiLSTM-bin, win = 25 for the BiLSTM-logi, and win = 27 for the BiLSTM-RSSI, after which the reconstruction accuracy decreases.

Performance of network reconstruction
After computing the average accuracy values over the test sets, surprisingly all methods performed relatively well the reconstruction task (see Table 2 and Appendix B for the confusion matrices). Even the non-reconstructed sequence reaches a surprisingly high accuracy of 77.28%. On the other hand, the naïve method, commonly used in other studies, performs significantly better with 83.36%, closely matching the performance (84.25%) of the considerably more complicated model of HMM. It is evident, however, that from all tested models, the BiLSTM methods perform the best to solve the temporal network reconstruction. They all provide accuracy at least 4% better than any other method reaching 88.34% for the BiLSTM-bin method, closely matching the values of 89.02% and 90.03% for the BiLSTM-logi and BiLSTM-RSSI methods respectively. More importantly, the best performing BiLSTM methods are also the ones providing accuracy values with the smallest fluctuations over the different test cases. This is reflected by the standard deviation values reported in Fig. 4b and Table 2 where all performance measures are summarised. In summary, these results suggest that the pipelines with the binary classification and logistic regression provide one of the best performances, but the BiLSTM-RSSI model trained directly on the RSSI values of interaction pairs provide just as good but simpler solution.

The reconstructed temporal network
The different models we introduced may reconstruct the temporal network with different characteristics. First of all, difference may arise as some models would label the same event to be present and some others as being absent interaction. This can be easily demonstrated by looking at the rates of reconstructed interactions by each model, as shown in Fig. 5a for a single morning period (2.5 hours). There, evidently, the highest event rate appears for the unreconstructed signal (naïve method with gap = 0) where we only merge consecutive packets labelled as interactions by the binary classifier. Relative to the unreconstructed sequence, each reconstruction method reduces considerably the rate of identified interaction events. The naïve method, being still a very simple model, which merges events maximum 35 seconds apart, appears with the second highest event rate. Subsequently, the HMM method provides a lower event rate while BiLSTM-logi, BiLSTM-bin and BiLSTM-RSSI methods are closely grouped with lowest rates, reconstructing about four times less events than in the unreconstructed case.
Despite these large differences in the reconstructed volume, the P(τ ) inter-event time distributions between interactions on single links (shown in Fig. 5b) and the P(dur) distribution of duration of interactions (Fig. 5c) appear with very similar shapes. These distributions all depict broad tails ranging over several orders of magnitudes and can be approximated well with power-law functions with exponents of α = 1.8 and γ = 2.1 respectively. Interestingly, this scaling is very similar to earlier observations in independent RFID studies [18,44]. In one way, this match verifies our experimental setting and observations, and at the same time it suggests that heterogeneities present in the interaction dynamics of face-to-face interactions may be universal with similar characteristics in independent systems.
To demonstrate the structure of the reconstructed network, we chose the BiLSTM-RSSI method as it was one of the best performing models with the smallest variance in accuracy.
Using this method, we reconstructed the events recorded in five consecutive mornings (15 hours observation combined) for 165 children and 25 adults (teachers, assistants and interns), and aggregated the obtained interaction sequences into a static network structure. Link weights in this representation were defined as per hour interaction rates between participants. This network is visualised in Fig. 5d where we draw links with width proportional to the time the connected nodes interacted during periods when they were both present. The size of the nodes reflects their degrees, while their colours are associated to the class they belong to (with darker colours indicating teachers, assistants and interns). This network structure appears with several interesting characteristics. First of all, the network is heterogeneous in degree, which is a common characteristic of social networks. Second of all, it well recovers the expected community structure where children of the same class connect densely together including the teaching staff in charge of that group.

Spreading processes on reconstructed networks
Temporal social interactions are far from being random but highly correlated in time and structure. They are characterised by heterogeneous bursty dynamics [45], which potentially appear due to causal correlations between events. Such causally related adjacent events, sharing at least one person in common, build long time respecting paths [34,36], which are extremely important as they determine how information/epidemics/influence can flow in the temporal network.
Consequently, the precise inference of temporal interactions in a network is extremely important not only to study the emergent structure but any ongoing process, like language evolution or information spreading or epidemics. To demonstrate this issue, here we take all the different event reconstruction methods we explored, and study how the temporal networks reconstructed in different ways influence the dynamics of a simple information spreading process. More precisely, we use the susceptible-infected (SI) model [46] as one of the simplest prototypical models of information spreading, which in turn can be used to simulate the fastest possible spreading under certain conditions. This model, defined on temporal networks, assumes that each node in the network initially is in susceptible state except for a single randomly selected node, which was set to be infected initially at a randomly selected time. Infection can be transferred with rate β from an infected to a susceptible node (i.e. S β − → I) only at the time and direction of their temporal interactions. In case β = 1 the model is equivalent to a breadth-first-search process realising the fastest possible information spreading scenario with given initial conditions in the actual temporal network. However, if β < 1, the process would be arguably less sensitive to local fluctuations in the temporal networks, as it could take alternative routes than the shortest paths to reach nodes, thus would spread slower on the same network. Note that due to the finite observation period of temporal interactions, in our simulation we divided a 150 minutes long observation period into a 30 minutes and a 120 minutes time windows. We selected 800 random seeds from the first window and simulated the SI process for 120 minutes in each case. This way we obtained simulated spreading curves with the same length that we could easily average.
To depict our simulation results, in Fig. 6a we show the average spreading curves for each model for β = 1 case, while in the inset for lower β values the average times the process reached 90% infection on each reconstructed networks. Figure 6b shows the corresponding distributions of time to reach 90% infection in each case, again when β = 1. All these results indeed demonstrate large differences between spreading dynamics simulated on temporal networks reconstructed with the different methods, despite they all relied on the same raw observation sequences. Not surprisingly, in general, the speed of spreading is largely determined by the overall number of events that the different models reconstructed, as already shown in Fig. 5a. Larger number of interactions means larger number of possible transitions between the same set of nodes and over the same period.
Following this logic, not surprisingly the unreconstructed network spread the infection the fastest, while BiLSTM models were the slowest. However, there is an important exceptions, which reflects our main conclusion here. The naïve method reduced by more than ∼90% the event rates as compared to the unreconstructed sequence, but when it turns to disseminating information, this seems to make no difference. It is suggested by the corresponding spreading curves in Fig. 6a, which are almost indistinguishable, and by the distributions of 90% infection time which appear with almost the same average and standard deviation (see Fig. 6b). At the same time, these results seem to be consistent over a range of β values. From Fig. 6a inset it is evident that at small β values the spreading is strongly stochastic, fluctuations are very large, and the process takes a long time to spread. However, as we increase β the spreading becomes faster on each network. More importantly, after an initial β regime, the spreading processes evolve with similar relative speeds on the different structures as observed in case of the deterministic β = 1 case. This indicates that even for stochastic settings (β < 1) the dynamical process is sensitive to the precise reconstruction of the underlying temporal network.
In conclusion, when using wireless proximity sensors to capture temporal interactions, (a) it is very important to carefully reconstruct events from the raw data and not only rely on simplistic intuitive conditions, otherwise the constructed temporal network will be biased by noise and overestimated event rates and will lead to unreliable outcomes of simulated dynamical processes; and (b) it is not enough to choose the best reconstruction method by its final accuracy, but it is crucial to choose carefully the reconstruction pipeline, which balances between good reconstruction performance and matching the purpose of the actual system under study.

Discussion
The goal of this work was manyfold. First, we developed a filtering and temporal network reconstruction pipeline to obtain the best approximation of temporal social interaction sequences from proximity data recorded via wearable wireless devices. We used ground truth data recorded in various settings and explored different reconstruction strategies involving supervised methods of classification and sequence reconstruction. We found that, while all tested methods provide reasonable performance, naïve methods commonly used in the literature show the worst performance. At the same time, bi-directional LSTM methods, which take into account information from the past and future of the actually predicted state, solve the reconstruction task the best, with accuracy up to ∼90%.
Furthermore, we wanted to highlight the importance of precise reconstruction of temporal interactions from raw data. Over the last few years, experiments using wearable wireless devices provided an ideal way to study collective social phenomena through the precise recording of temporal social interactions of people/animals in various settings. At the same time, these datasets became inductive resources to study ongoing dynamical processes such as epidemics [13,47], opinion dynamics [48], etc. evolving on the temporal social fabric. However, without the careful reconstruction of social interactions, any study addressing the dynamics or structure of the evolving networks or any ongoing collective dynamics would risk to draw wrong or inaccurate conclusions. We demonstrated the sensitivity of this issue by simulating susceptible-infected processes on the reconstructed networks, which in turn follow significantly different scenarios depending on the actual method used for event reconstruction, even for those with comparable accuracy.
Finally, we wanted to showcase a large-scale longitudinal social experiment, which records the proxy social and verbal interactions of hundreds of pre-school children and their teachers and assistants with high temporal resolution. Our experiment is ongoing, but in the end it will provide observations about the dynamics of language development and social network of children over three years. Using these data in our upcoming research, we plan to study linguistic and social similarities at different levels of organisation of the social network: the collective level (the whole school, classes, or children groups with similar socio-cultural background considered as a community), the intermediate level (friendship groups), the dyadic level (connected pairs of children) and the individual level (each child with their specific characteristics).
First, we plan to detect the temporal relationships between social networks dynamics and changes in children language. We shall adopt two approaches to disentangle the mutual influences between socialisation and language. To evaluate social influence on language skills, we will test whether the change of social distance between individuals predicts the linguistic distance between them as well. On the contrary, to assess the effect of language on social relationships (homophily), we will also investigate whether the linguistic distance between individuals predicts the social distance between them. Interesting future research direction would be to uncover the fine grained differences between the reconstructed networks to understand which of their features are important to reconstruct and which are insignificant for dynamical processes.
Second, we aim at measuring, quantifying and modelling the processes that have longterm influence on social and linguistic development. Our three-year longitudinal followup design makes it possible to analyse the processes underlying the dynamics between changes in the social network and language skills. In particular, we will examine the effect of integration within a new community: does a community always have a homogenising role, by absorbing linguistic change, or, by contrast, can it accommodate the linguistic usages brought by new members and augment these by disseminating them through the community?
As any data-driven study intending to predict or infer human behaviour, our study has also limitations. First of all, the collected data contain certain noise, which cannot be reconstructed with any actual method. Noise is also inevitably present in the ground truth data, which at the same time code only a finite set of configurations used for training, while rare and exceptional scenarios may remain unobserved. These limitations together with the stochastic nature of human behaviour lead to an always perfectible reconstruction of human traits of actions or interactions. Finally, although we payed special attention on denoising, pre-filtering, model selection, and the exploration of the hyper-parameter space of each model, surely the optimal inference pipeline we identified is not universal but may be different in the case of data from other wireless proximity sensors.
Beyond scientific merit, our results highlight the importance of the careful design of event reconstruction in studies using wireless sensors. We demonstrated this in the case of LPWD based experiments recording social interactions, but it is important more generally in any study relying on similar data collection methods. This way, we hope that our study contributes not only to the better design of coming scientific studies but also to future emerging technologies.

Appendix A: Experimental design of ground truth data collection
While designing the ground truth data collection, special attention has been paid to the feasibility and reliability of our observation method, to decrease human errors during observations while obtaining a meaningful ground truth dataset. To meet with all these requirements we followed the following logic and conditions: • To record the ground truth data for GT1 and GT3 we had a researcher in the classroom who was monitoring the behaviour of children to record their interaction state, their relative orientation or position at regular intervals. To quantify interactions, distance and orientation, we chose a scan sampling strategy (i.e. observations of states at predetermined time intervals) [41]. Another option was focal sampling (i.e. continuous recording of interaction events) [41], however social interactions (and kids position/distance too) were sometimes too short and fluctuating for continuous observation, thus it was impossible to record the beginning and end of each interacting event without using video-recording. In addition, for scan sampling, we made trials to find the appropriate time intervals between two observations (scans) that allowed us to record data without loss (i.e. the shortest step possible for recording the data of interest without taking the risk to miss one observation point). In the end, we chose 10 second steps for pair observations (GT1) and 2 minute steps for group observations (GT3).
• For GT1 observations, we worked during free play time to be able to observe spontaneous play interactions. To decrease the possible noise due to fleeting behaviours and random moves, observations were carried out on older children from the middle (4-5 years old) or the grand class (5-6 years old). Importantly, we focused only on one pair of kids at a time to record their state (interacting or not) and relative position every 10 seconds. This way we reduced to the minimum the possible human observation errors in this setting.
Regarding the scoring of the state of interaction/no-interaction, we set the criteria of interactions based on the literature [49] and the field expertise of the participating researchers from earlier similar experiments. Specifically, we considered two children "interacting" if they were within arm's reach (i.e. less than 1 meter from each other), either playing together (e.g. cooperatively manipulating construction blocks, kitchen toys, puzzles...) or playing alongside (e.g. making a drawing next to each other). These situations typically involved talking to each other at times.
• Regarding GT3, we selected the most appropriate and stable conditions to observe distance and orientation of as many children as possible. In practice, we performed observations only with older children (grand class) when their movements inside the classroom were limited for an extended period of time. This was possible during specific activities that involved standing still (like collectively sitting on a bench to listen to the teacher reading a book, sitting around tables in small groups to do written work...) rather than freely moving around (like during free play time).
• Measuring distances between children is very much subject to inter-(and even intra-) individual variations. To minimise such fluctuations in recording the distances in GT3, we used a customised behavioural observation app originally developed to record the positions and distances of animals in a fixed environment (Animal Observer application for iPad [42]). This app projects a scaled map of the classroom with indicated reference objects (furnitures, doors, windows, etc.) and allows to record by an observer the positions of individuals on this map as the function of time.
Using this temporal location dataset inter-individual distances of children have been computed afterward. Relative to the objects on the classroom map the location of children could be estimated precisely with a very small error margin (around 10 cm), something that would have been impossible to accurately assess through "naked eye" estimations and "hand" recording.
To further check the impact of potential errors due to human encoding, we conducted a randomisation experiment where we induced noise in the already collected data. More precisely, we randomly selected the 5%, 10%, 15%, and 20% of observation points from the ground truth data sequences and flipped the annotated flags from interaction to noninteraction or vice versa to add noise to our original observations. Through remeasuring the accuracy change of the BiLSTM-RSSI method on these randomised data we found that the average and variance of accuracy is rather robust against such small induced noise, only having the average to decrease slightly as summarised in Table 3.

B.1 Naïve method
The naïve method has a single parameter gap, which determines the maximum length of non-interaction gaps between two interaction events to be filled automatically with interaction states. gap = 0 is a special case as it belongs to the non-reconstructed signal where no state has been filled. We explored gap from 0 to 9, corresponding from 0 to 45 seconds of non-interaction gaps, with 5 seconds incremental step size. We found that in our setting the best reconstruction can be reached with gap = 6 with accuracy reaching 0.834 as summarised in the confusion matrix in Table 4. Consequently, if a longer than 35 second gap appears in the interaction sequence of two individuals, the two participants most probably broke their actual social interaction, thus events before and after the gap should be considered separately.

B.2 Hidden Markov model
When parametrizing the HMM with annotated data, we used maximum likelihood estimation to compute three matrices. Take transition matrix for example, if the frequency of hidden state i at t transiting to hidden state j at t + 1 is A ij , then the estimated transition probabilityâ ij is computed as follows:â ij = A ij / N-1 j=0 A ij , where N is the number of hidden states. Same method is applied for computing emission matrix and initial matrix.
With embedded envelop sequence, we first pad 0 (indicating non-interaction states) at the beginning of each sequence, with size of windowsize -1. We then use transformed envelop signals instead of binary signals to define the hidden states, observation states, as well as determining all the matrices. Finally, as an output of the Viterbi algorithm we obtain a sequence of envelop, the last item of each envelop being the predicted interaction/noninteraction state of each time step.
The reconstruction accuracy and the confusion matrices of the HMM methods are shown in Table 5 for window size win = 6.

B.3 Bi-directional LSTM methods
For each BiLSTM method we used an envelop with size of winsize located symmetrically on the middle state which we wanted to reconstruct (as demonstrated in Fig. 3e). We pad void signals at the beginning and at the end of each sequence, with size of winsize/2 on     each side. More precisely, the padded void signal for BiLSTM-RSSI is a vector (-95, -95, 0), for BiLSTM-logi is a vector (1, 0) and for BiLSTM-bin is a single number 0. We merged the outputs of the two LSTMs using concatenation, which provided double size of outputs to the next layer. For the training, we split our labelled data into 10 clips with each around 25 mins then use nested cross validation to select best hyper-parameters and examine the performance of each reconstruction method. The confusion matrix of the three BiLSTM reconstruction tasks are shown in Table 6, Table 7 and Table 8. The accuracy of the BiLSTM-RSSI reached 0.9003, which is the best among all the tested methods.