Competition-driven modeling of temporal networks

We study the problem of modeling temporal networks constrained by the size of a concurrent set, a characteristic of temporal networks shown to be important in many application areas, e.g., in transportation, social, process, and other networks. We propose a competition-driven model for the generation of such constrained networks. Our method carries out turns of competitions along the timeline where each node in a network is labeled with a probability to gain outgoing edges in competitions. We present a thorough theoretical analysis to investigate the cardinality and degree distributions of the generated networks. Our experimental results demonstrate that our model simulates real-world networks well and generates networks efficiently and at scale.


Introduction
Problem. Synthetic temporal networks are widely used to understand and verify the behavior of real networks [9,22]. In recent years, various models have been proposed to generate synthetic networks satisfying certain characteristics. However, the distribution of the concurrent set size (CSS) [25] has not been considered in previous studies. CSS distribution refers to the mapping from a timestamp in a network domain to the number of edges active at this timestamp. 1 In many real-world networks, the CSS follows one or more distributions. Consider an example shown in Fig. 1. The CSS of transport trips by free hired vehicles [3] in New York City follows a Poisson distribution in the morning and then a normal distribution for the rest of the day. Another example is the CSS of the raw dataset used in BPI challenge 2014 [2], which follows a linear-transformed power-law distribution in most of the time domain. These examples show that studying the behavior of the CSS can provide a better understanding of real-world networks. Further, modeling of the CSS can aid in the generation of realistic synthetic temporal networks. Finally, a deeper understanding of CSS can provide guidance towards efficient network processing approaches [24]. Contributions. In this paper, we directly address the generation issues in the CSSconstrained temporal networks, particularly: • We propose a new model, namely a competition-driven model (CDM), to generate the CSS-constrained networks. • We present a theoretical analysis of the CDM and show how CDM affects several important characteristics of generated networks. • We carry out an in-depth experimental study which demonstrates that CDM can simulate the real-world networks well and that its generation process is scalable.
Organization. The rest of this paper is organized as follows. Section 2 presents related work. Section 3 presents our proposed model. Section 4 gives a theoretical analysis of our model and demonstrates its characteristics. Section 5 presents experiments that evaluate the performance of our model. Section 6 concludes the paper with a summary of our findings.

Related work
Activity-driven network (ADN) model, proposed by Perra et al. [17], is the most wellstudied network model in the state of the art. This model initializes each node v with a firing rate a v drawn from a given probability distribution F(x). At each timestamp t and with probability a v , node v becomes active and generates m instant outgoing edges linked to the other nodes randomly. Several studies have been carried out to extend the model in both structural and temporal fields. For structural extensions, prior studies concentrate on the selection of the edge destination [6,13,14,21]. Alessandretti et al. [6] extend each node with an attractiveness value representing its probability to be selected as the destination of edges. Other works extend the model with a reinforcement mechanism, which exhibits the preference of nodes to connect to previously contacted nodes [13,14,21]. Another collection of works concentrate on the incorporation of community structure [14,16]. Laurent et al. [14] introduce focal closure and cyclic closure, which gives rise to the community structure in the network. Nadin et al. [16] initialize each node to a community. In each turn, a node could either connect other nodes within (or outside) the same community with probability μ (or 1μ). For temporal extension, Sunny et al. [20] introduces the duration for edges so that edges are lasting entities rather than instant ones. Besides ADN, there are also other categories of temporal network generation models. The Renewal process model extends the Gillespie algorithm [11] to model the network generation where each node is modeled as a Poisson process and the superposed nodes are regarded as the inter-event time distribution [8,15]. Holme [12] and Speidel et al. [18] model the generation as labeling of static links with temporal aspects. Starnini et al. [19] and Zhang et al. [23] model the generation as a process involving agents performing a random-walk in the unit square. Each agent interacts with its neighbors every time a random walk is performed. To deal with situations where information of entities is missing, Cho et al. [10] proposes a self-exciting process model where the event rate between each pair of entities is modeled as a Hawkes process.
To the best of our knowledge, there is no research on modeling and generation of CSSconstrained networks. In the next section, we give the preliminaries and the details of our model.

Model
We next present the basic definitions and state the network generation problem studied in this paper. Then, we propose our competition-driven model which serves as a solution to the stated problem.

Preliminaries
Temporal network. A temporal network is modeled as a graph G = (V , E, T), where V and E are respectively the set of nodes and temporal edges in G. T is a temporal domain of the network. Each edge e ∈ E is represented by a tuple (u, v, t s , t e ) where (1) u, v ∈ V represent a directed link from node u to v, denoted (u, v); (2) t s , t e is a pair of timestamps s.t. t s ≤ t e , representing the active lifespan of link (u, v), denoted [t s , t e ].
Activity behavior. We start by presenting the definition of nodes' activity behavior. Given a node v ∈ V and time t ∈ [1, T], we say v is active at t if there exists e ∈ E such that e.u = v and e.t s = t. Otherwise, we say v is inactive at t. Also, we define edges' activity behavior. Given an edge e ∈ E and time t ∈ [1, T], we say e is active at t if t ∈ [e.t s , e.t e ] (or is an active edge at t) and ends at time e.t e + 1. Note that multiple active edges are allowed between the same pair of nodes at an arbitrary time t.
Snapshot. In order to obtain an instant status of a temporal graph, G = (V , E, T) could also be viewed as a sequence of static graphs G = {G(1), G(2), . . . , G(T)}, where G(1), G(2), . . . , G(T) respectively represents the instant status of G at t = 1, 2, . . . , T. We call G(t) a snapshot at time t. Formally, given a temporal network G = (V , E, T) and a timestamp t ∈ [1, T], the corresponding snapshot is represented as   Besides the values for C(t) and V , we consider additional characteristics of generated networks as the structural and temporal characteristics of real networks are heterogeneous. For example, inter-event time (IET) in some real networks follows a power-law distribution with notable heavy tail [15,18,21]. This makes the IET distribution an important and necessary parameter in network generation. Similar results could also be found for the real degree of nodes [16]. We summarize the important characteristics of our generated networks in Table 1.
Relative degree. For convenience of comparison and analysis, degree of a node needs to be stable in its distribution across networks of different sizes. For this purpose, we define a relative degree of a node as follows. Given a temporal graph G = (V , E, T) and ∀v ∈ V , we call |E| the relative degree of v, where δ(v) is the number of edges outgoing from v. As defined, A(v) denotes the proportion of edges starting from a given node v.
Inter-event time (IET) distribution. The distribution captures the activity behavior of nodes. Given temporal graph G = (V , E, T) and ∀v ∈ V , we collect the distinct start times of edges outgoing from v, denoted an inter-event time (IET). We assume that τ follows a probability distribution I(f , τ , τ ), where f is a parameter distribution function, τ and τ are minimum and maximal IETs respectively.

Competition-driven model
In this work, we propose the competition-driven model (CDM) to generate the CSSconstrained networks accurately and efficiently. Table 2 presents the input parameters used in our model. In CDM, each node is associated with a power value (v) and next active time nat(v). The former determines v's strength in edge generation while the latter determines its next time to be active. That is, a node with higher (v) has a higher chance to become the source of newly generated edges at time nat (v). Values for (v) are drawn from a parameter probability distribution f (e.g., the power value distribution) and values for nat(v) are drawn from the IET distribution I. Network generation. Using the above model, the procedure of network generation is shown in Algorithm 1. A dedicated active-list structure named Active is maintained to store the set of active edges in real-time. The basic idea of network generation is traversing the CSS distribution C(t) in time and adjusting the size of Active (denoted |Active|) according to C(t) at a certain timestamp t. For this aim, we define the following two basic operations to maintain Active.
• InsActive(Active, e): insert the edge e into Active; Return 1 for success and 0 for failure. Algorithm 1: The network generation using CDM Input: while n > 0 do 16 Draw a participant p from S t as source.

17
if it is the first time for p to be drawn in this turn then 18 Draw an IET τ from I Algorithm 2: PruneActive Input: active list Active, timestamp t, number of pruned edges n Output: the set of pruned edges D For ease of maintenance, edges in Active are sorted by their end-time in ascending order. In this way, the complexity of InsActive and DelActive is logarithmic in |Active|. With the structure, the specific operations to be carried out at any given time t could be determined: when C(t) is smaller than |Active|, some of the existing edges should be forcibly deactivated and removed from Active in order to satisfy |Active| = C(t) constraint. We define one additional operation for Active in order to deal with this situation: • PruneActive(Active, t, k): select k edges, set their end-time to t, and delete them from Active Return the collection of deleted edges. The procedure of PruneActive in our work is shown in Algorithm 2. Here, we apply the end-time-first pruning strategy to prune Active. That is, we select the top-k edges with minimal end-time from Active, reduce their end-time to t, and delete them from Active. Various pruning strategies can be used in PruneActive. We choose end-time-firstpruning for the following reasons. First, this strategy provides the best efficiency because edges in Active are sorted by their end-time. Second, end-time-first-pruning also helps to preserve the duration distribution in the generated network, which is a desirable network characteristic.
Additionally, a total of n = C(t) -|Active| edges should be generated and inserted into Active. The algorithm first collects the set of nodes (t) = {p t 1 , p t 2 , . . . , p t k } with nat(v) ≤ t. We call these nodes in the collection participants at current time t. 2 Continuously, the algorithm constructs a probability distribution S t (p) by normalizing (p t i ) for each i ∈ [1, k]. We call S t (p) the competition distribution and it reveals the probability for each participant to "win" in each turn of the coming competition at time t. With the constructed S t (p), the algorithm carries out a n-turn competition to generate new edges. In each turn, a participant p ∈ (t) is first selected as the source of link according to S t (p). Next, a duration d is generated from duration distribution D, and another node v is selected uniformly from the remaining nodes as the destination In this way, a new temporal edge (p, v, t, t + d -1) is created and inserted into Active. And if it is the first time for p to win in this turn, algorithm updates nat(p) to t + τ , where τ is drawn from I to determine its next time to be active. Similar turns are repeated until n turns have been carried out, which means n new edges have all been created in this competition. Note that if p does not win any turns in the competition, nat(p) is not updated and p would be continuously considered as a participant in the competition at the next timestamp. This way, p's IET is prolonged until it can win at least one turn in a competition.
The iterative competitions are repeatedly carried out until C(t) is completely traversed in time. The complexity of the generation algorithm is O(T · |V | + |E| · log |E|), where |E| is the total number of generated edges. Note that |E| is not an input parameter to the CDM and its exact value can only be known when the network is completely generated. Similarly, for each node v ∈ V , relative degree A(v) is also known after the generation since they depend on |E|. In the following section, we present the theoretical analysis of how values for |E| and A(v) in the produced networks are influenced by the generation algorithm.

Analysis
Two natural questions about the CDM are: (1) As the number of edges |E| is not an input parameter to the algorithm, what is the expected cardinality for the generated network?
(2) Similarly, what would the relative degree A(v) be like? Answers to these questions respectively help to evaluate the necessary storage cost for generation and investigate the structural characteristics of the generated network. In this section, we provide an analysis to answer these two questions. For ease of analysis, we make the assumption that the activity behavior of both v ∈ V and e ∈ E follow the Poisson process and I, D follow exponential distributions with λ 1 , λ 2 parameters, respectively. Besides, we assume each participant in a competition can win at least one turn so that their IETs are not prolonged and follow I strictly.
Cardinality. For t ∈ [1, T], let O(t) demonstrate the number of edges that should be generated at timestamp t. The equation to describe the relation between network cardinality |E| and O(t) could be written down as follows: (1) Let R(t) demonstrate the number of remaining edges at time t after DelActive is invoked. The equation to describe O(t) is as follows: That is, given timestamp t, O(t) merely contributes to the cardinality when C(t) > R(t). For example, Fig. 2 presents a collection of edges generated using the CDM and C(t) = {0 : 1, . . . , 3 : 1, . . . , 5 : 3, 6 : 3, 7 : 4, 8 : 3}. Each edge is represented by its interval. Consider the edge generation at t = 7 is completed and we are going to generate the collection of edges at t = 8. Active at t = 7 contains the edges e 1 , e 2 , e 3 , e 4 . Then DelActive deletes e 1 , e 2 from Active since they both end at t = 8. In this way, only two edges e 3 , e 4 survive in Active after the edge deletion, hence R(8) = 2. Since C(8) = 3 and Equation (2) gives O(8) = C(8) -R(8) = 1, this means that a single edge needs to be generated at t = 8.
As cardinality analysis is generally used for network storage and construction time estimation, here we use the worst-case method to estimate the output of Equation (1). In this worst case, we assume that R(t) is always smaller than C(t). Then, the worst-case equation for the maximum number of generated edges |E| m is described as follows: From Algorithm 1, we know that R(t) consists of the set of edges active at both t -1 and t. Then, R(t) can be computed as follows: where P d (t) 3 represents the probability for each e ∈ E(t -1) to end at time t. In our example, we can estimate that P d (8) is approximately 0.5 since there is C(7) = 4 and R(8) = 2.
Here, we use the knowledge of the stochastic process to make further deduction on P d (t). Considering a probability event , Poisson process uses the following equation to express and compute the probability that happens k times in duration [t, t + τ ]: Note that P d (t) can be also described as the probability that an edge active at t -1 is going to end at time t. Based on our assumed Poisson process for e and exponential distribution for D, P d (t) can be transformed into the following: By substituting Equation (6), (4) into (3), we could obtain the following equation of describe the expected cardinality for synthetic network.
With this equation, the complexity of CDM becomes more intuitive. Also, maximal memory cost in network generation can be evaluated.
Relative degree. Next, we give the derivation of the relative degree A(v). The equation to describe A(v) is as follows: where o(v, t) is the number of generated outgoing edges starting from v at time t. The value of o(v, t) relies on whether v is active at t. Based on our assumption and letting P a (t) 4 be the probability for v ∈ V to be active at t, the equation is as follows: In this way, the equation to describe o(v, t) is as follows: According to Algorithm 1, S t (v) could be computed as follows: By substituting Equation (11) into (10), we could obtain: with p = λ 1 e -λ 1 .
The combination of Equations (8) and (12) Aligning Equations (13) with (12), we could obtain the following equation which illustrates the factors impacting A(v): Equation (14) reveals that the relative degree of node v is influenced by following factors: • |B(v)|, the number of timestamps when v is active (i.e., the number of competitions v participated). The larger |B(v)| provides more opportunities for v to earn outgoing edges. • | (t)|, the number of participants in competition at time t. The larger | (t)| tends to weaken S t (v), which in turn leads to less outgoing edges from v. • (v), the power value of v. The larger (v) tends to enhance S t (v), which in turn leads to more outgoing edges from v. • O(t), the number of generated edges at time t. The larger O(t) leads to more outgoing edges from v when S t (v) is fixed. In order to mine more underlying factors on A(v), we introduce the mean-field method to simplify the variables in the model and regard the inferred result as the benchmark. Let A(v) be the mean static degree of node v. The equation to describe the mean field is as follows: where B is the mean number of competitions v participated. is the mean number of participant at time t. O is the mean number of edges that should be generated at time t.
Corresponding equations to describe these mean-field parameters are as follows: = E (t) = λ 1 e -λ 1 |V |, By substituting equation (16), (17), (18) into (15). The mean degree equation could be transformed as follows: Equation (19) reveals the two characteristics of the relative degree in our model: first, as the number of nodes |V | increases, the relative degree of each node will drop because |V | determines the sum of cumulative adding in denominator. Second, given the number of nodes |V |, more involved participants make the distribution of A(v) much closer to (v) as it makes ) closer to 1. That is, relative degree A(v) is exactly reflected by (v) in the most ideal situation. The larger nodes set size makes A(v) closer to (v).

Experimental evaluation
In this section, we present our experimental investigation for the CDM. We aim to answer the following questions. First, we would like to know if CDM could simulate real networks with both structural and temporal characteristics preserved. Second, we investigate to what extent various graph configuration parameters influence the synthetic networks generated by the CDM.   Table 3 and their CSS distributions are shown in Fig. 3. Experiments We run two categories of experiments to investigate the performance of CDM. The first category of experiments deals with the quality of network simulation by the CDM. We investigate real networks' relative degree, IET, duration, and CSS distribution to obtain a graph configuration to be used in network generation. Specifically, we obtain graph configurations in two ways. The first method is called the frequency configuration, in which statistical estimations of real measures are directly used as input schema. The second method is called the fitted configuration, in which for IET and duration distributions are used directly as estimated from real networks and we use the fitted result for I and D distributions. We use the power-law cut-off model y = k · τ α · e -τ τc + h. The values used in frequency and fitted configuration for each network are shown in Fig. 4. The parameter k is the CSS coefficient which represents the times that basic CSS value is enlarged.
The second category of experiments investigates the scalability of the CDM. We use instances with two types of CSS: (1) the linear C(t) ∼ t and (2) the Gaussian C(t) ∼  N(702, 180.0 2 ). The former case aims to investigate the CDM performance in monotonic increasing CSS and the later case aims to investigate the CDM performance in nonmonotonic CSS. The default setup for the remainder of the configuration is shown in Table 4. These configuration parameters are either popularly used in existing benchmark for network modeling and generation [7] or supposed to impact the underlying structures in networks. To investigate such impact in various networks generated by using CDM, we vary the configuration parameters as follows. (1) We set |V | in [500, 750, 1000, 1250, 1500] to investigate the nodes cardinality impact in CDM. By setting various |V |, we obtain the networks involving either more or less entities. (2) We set CSS coefficient k in [1, 5, 50, 500, 5000] × 10 7 to investigate the CSS coefficient impact. By setting various k, we obtain the networks with either higher or lower CSS value at each timestamp.
(3) We set |E| in [20,40,60,80, 100] million to investigate the edges cardinality impact. 5 By setting various |E|, we generate either small or large networks. (4) We set τ in [1, 10, 100, 1000, 10,000] to investigate the IET impact. By setting various τ , the intensity of IET heavy tail (i.e., the existence of long IETs) in generated graph can be controlled. (5) We set d in [1, 10, 100, 1000, 10,000] to investigate the duration impact. By setting various d, the existence of lasting edges can be controlled. We use six measures to evaluate the result: the distribution of (1) network generation time, (2) relative degree, (3) closeness, (4) IET, (5) duration, and (6) stability [22]. Given a node v ∈ V , the closeness is a measure of how close v is to any other nodes in the network. The measure is computed as the inverse of the average distance from v to any other nodes in the network, which is shown as follows: where dist(v, u) represents the minimal distance (i.e., number of hops) from node v to u.
The stability is a summary of v's evolving degree structure in time, which is measured based on the notion of degree rank. Given the set of snapshots {G(1) . . . G(T)}, the degree rank of v in snapshot G(t) is computed as follows.
where δ t (v) is the number of out-going edges from v in G(t). With these notions, given the set of snapshots {G(1) . . . G(T)} the stability of v is computed as follows.
where σ v is the standard deviation of v's degree rank over all snapshots. As a trade-off between measuring accuracy and efficiency, for each network in this experiment, we compute the stability based on 1000 uniformly selected snapshots.

Results and analysis
Quality of network simulation. Table 5 reports the generation time for the two types of simulations (frequency and fitted configurations) for different networks. We note that  5 Relative degree of simulation result even the largest network with around 800K edges could be constructed in less than 15 seconds. This indicates that CDM is highly efficient in network generation. Figures 5, 6, 7, 8, 9 report the measures in the real, frequency-simulation, and fitting-simulation networks. We note that in each subplot, the trend of different curve is similar to each other and the differences are minimal. This indicates that CDM could simulate real networks well.
Scalability of network generation. Next, we investigate the performance of CDM in different categories of networks. Table 6 reports the construction time of various networks in CDM and following results could be drawn from it. First, by varying k, the construction time in the monotonic increases steadily. This is expected because higher k leads to larger |Active| in each competition so that the insert of a newly generated edge becomes more costly. In the non-monotonic, however, the construction time in the nonmonotonic sharply decreases at the very beginning and then increases steadily. This is because the low k in the non-monotonic significantly increases the times of invoking  PruneActive, which is unproductive to the generation of new edges. Second, by varying |V |, the construction time keeps increasing. This is expected because higher |V | generally leads to more participants in a competition which further makes each turn more costly. Third, by varying the desired |E|, the construction time steadily increases because   creases because more lasting edges increases the maintenance cost of Active. Overall, the result demonstrates that CDM could generate both small and large networks efficiently. Next, we concentrate on the remaining structural and temporal measures. Figure 10 reports the degree distribution in various networks. Several observations can be made here. First, we note that higher |V | pulls down the degree proportion of each node, which is expected in Equation (19). Second, the higher k lifts the front of the distribution curve while the tail still keeps stable. This is because a higher coefficient value provides more opportunities for higher-power nodes to obtain outgoing edges in each competition so that the high-power nodes could fully take their advantage in their involved competitions. Third, higher τ pulls down the front and lifts the rest of the curve. This is because the appearing of longer inactive period allows lower-power nodes to participate in more competitions without competing with higher-power nodes. Finally, higher d pulls down the front part because lasting edges lead to a limited number of edges to be generated at each timestamp. This restricts the degree advantage of high-power nodes. Figure 11 reports the IET distribution in various networks. The dashed line is there to illustrate the "ends" of the lines, they cannot be seen otherwise because of the significant overlap between the lines that correspond to different studied parameters. We start by drawing two general conclusions about the IET results. First, the configured I is well modeled in networks generated by the CDM since we can observe the heavy-tails in power-law distribution. Second, the networks with a higher proportion of small IETs tend to have smaller maximal IET. For convenience, we call this the IET aggregation nature in the generated networks. Third, the maximal IET in a generated network can be larger than configured τ . This is because nodes with lower (v) may never win in a competition so their IET would be continuously prolonged.

Figure 11
The IET distribution in various networks. We note that (1) higher k tends to enhance the IETs' aggregation one small values; and, (2) higher τ , d tends to weaken the aggregation The rest of the results observed in Fig. 11 include the following. First, IET aggregation of the non-monotonic networks tends to be weaker than the monotonic networks. This is because the generation of non-monotonic networks involves the invoking of PruneActive, which introduces the period with no competitions and extends nodes' IETs. Second, higher k enhances the IET aggregation because this provides more opportunities for lower-power nodes to win in competitions. Third, higher τ weakens IET aggregation as expected. Finally, higher d also weakens the IET aggregation because lasting edges reduce the opportunities for lower-power nodes to win in competitions. Figure 12 reports the duration distribution in various networks. We first note that the configured D is also modeled well because of the observed heavy-tails. Second, higher d pulls down the durations' aggregation on small values for the similar reason as in IET. Besides, duration proportion in networks generated by the CDM tends to be much more stable since they are hardly impacted by other factors (in comparison to the IET).
Finally, we present the result and analysis of node stability in various networks. A general situation drawn from the resulted statistics is that low-power nodes are generally more stable than higher-power nodes. This is expected because the temporal degree of the low-power is generally small or even negligible comparing to the global maximal degree. To be more specific, considering a node v ∈ V , the global maximal temporal degree might probably vary in v's inactive period. Since small (v) generally leads to small degree, lower-power nodes tend to be less sensitive to the variation of the global maximal degree. Oppositely, higher (v) generally leads to in-negligible degree. This makes high-power nodes much much more sensitive to the variation. Figure 13 reports the node stability in various networks and several results could also be drawn from it: first, the higher |V | lifts the stability curve in both monotonic and nonmonotonic networks. It is because the higher |V | leads to the lower degree distribution so that a batch of higher-power nodes become less sensitive and more stable. Second, the higher k lifts the stability curve in both categories because a higher CSS coefficient leads to the increase of the maximal and a higher proportion of stable nodes. Third, higher τ pulls down the curve in both categories because nodes' longer in-active period can intensify the variation of the maximal. Fourth, higher d pulls down the curve in both categories because lasting edges can increase nodes' temporal degree and intensify the variation of the maximal. Finally, we note that nodes in the non-monotonic are less stable than that in the monotonic when the rest of the configuration is the same. This demonstrates that PruneActive, which mainly considers the generation efficiency and duration distribution in this paper, weakens the stability of nodes in generated networks. So in the future, we would consider various methods used in PruneActive and investigate their influence on stability.

Conclusion
We proposed the CDM to solve the generation problem of CSS-constrained temporal networks. CDM is designed to take the CSS distribution as an input, which is a vital characteristic of many real networks, as guidance to generate synthetic temporal networks constrained by the CSS. We present theoretical analysis which shows that the cardinality and nodes' relative degrees of a network generated by the CDM could be predicted. Our experimental results also demonstrate that CDM can simulate real networks well and generate networks efficiently.

Figure 12
The duration distribution in various networks. We note that higher d tends to pull down the durations' aggregation on small values Figure 13 The node stability in various networks. We note that (1) higher |V|, k tend to lift the curve; and, (2) higher τ , d tend to pull down the curve In future work, we plan to study various Active-pruning and destination-selecting strategies and their impact on the networks generated by the CDM.