Fast filtering and animation of large dynamic networks

Detecting and visualizing what are the most relevant changes in an evolving network is an open challenge in several domains. We present a fast algorithm that filters subsets of the strongest nodes and edges representing an evolving weighted graph and visualize it by either creating a movie, or by streaming it to an interactive network visualization tool. The algorithm is an approximation of exponential sliding time-window that scales linearly with the number of interactions. We compare the algorithm against rectangular and exponential sliding time-window methods. Our network filtering algorithm: (i) captures persistent trends in the structure of dynamic weighted networks, (ii) smoothens transitions between the snapshots of dynamic network, and (iii) uses limited memory and processor time. The algorithm is publicly available as open-source software.


INTRODUCTION
Network visualization is widely adopted to make sense of, and gain insight from, complex and large interaction data.These visualizations are typically static, and incapable to deal with quickly changing networks.Dynamic graphs, where nodes and edges churn and change over time, can be effective means of visualizing evolving networked systems such as social media, similarity graphs, or interaction networks between real world entities.The recent availability of live data streams from online social media motivated the development of interfaces to process and visualize evolving graphs.Dynamic visualization is supported by several tools [18,13,5,16].In particular, Gephi [5] supports graph streaming with a dedicated API based on JSON events and enables the association of timestamps to each graph component.
While there is some literature on dynamic layout of graphs [8,9,10], not much work has been done so far about developing information selection techniques for dynamic visualization of large and quickly changing networks.Yet, for large networks in which the rate of structural changes in time could be very high, the task of determining the nodes and edges that can represent and transmit the salient structural properties of the network at a certain time is crucial to produce meaningful visualizations of the graph evolution.
In this paper, we contribute to filling this gap by presenting a new graph animation tool that: • processes a chronological sequence of interactions between the graph nodes; • dynamically selects the most relevant parts of the network to visualize, based on a scoring function that weights nodes and edges, removing no longer relevant portion of the networks and emphasizing old nodes and links that show fresh activity; • produces a file representing the network evolution or, alternatively, connects to the Gephi graph visualization tool interface for live visualization of the evolving graph; and • is fast enough to be applied to large live data streams and visualize their representation in form of a network.
In Appendix A we present the source code of the visualization tools with the related documentation, four dynamic graph datasets and instructions to recreate respective visualizations introduced in the latter part of this manuscript.

RELATED WORK
Graph drawing [30,44] is a branch of information visualization that has acquired great importance in complex systems analysis.The rapid development of computer-aided visualization tools and the refinement of graph layout algorithms [31,23,24,29] allowed increasingly higher-quality visualizations of large graphs [28].As a result, many open tools for static graph analysis and visualization have been developed in the last decade.Among the most known we mention Walrus [36], Pajek [6,14], Visone [11], GUESS [1], Networkbench, and NodeXL [42].
Visualization of dynamic graphs has received considerable attention recently.Despite static visualizations based on sliding time windows [2], alluvial diagrams [41], or matrices [45,43,25] have been explored as solutions to capture the graph evolution, dynamic graph drawing remains the technique that has attracted more interest in the research community so far.
Much work has been done to define efficient update operations on graphs [39,27] and also in the definition of criteria and guidelines for a good visualization of graph evolution with animations [21], with the goal of preserving the mental map [35,20] that the observer has of the graph structure.Methods to preserve the stability of nodes and the consistency of the network structure leveraging hierarchical organization on nodes have been proposed [37,38,3].More in general, some work has been done to adapt spectral and force-directed graph layouts [7] to incremental layouts that recompute the position of nodes at time t based on the previous positions at time t-1 minimizing displacement of vertices [12,15,8,22] or to propose new "stress-minimization" strategies to map the changes in the graph [10].
Recently, the interest in depicting the shape of online social networks [26,19] and the availability of live data streams from online social media motivated the development of interfaces to process and visualize the evolution of graphs in time.Dynamic visualization is supported by tools like GraphAEL [18], GleamViz [13], Gephi [5], and GraphStream [16].
Exploratory work on information selection techniques for dynamic graph visualization has been done in the past, including solutions based on temporal decay of nodes and adges [17], node clustering [40], and centrality indices [32,4].However, none of the modern visualization tools provide features for the detection of the most relevant components of a graph at a given time.

ALGORITHM
First, we introduce an algorithm that takes in input a chronological stream of interactions between nodes (i.e., network edges) and converts it into a set of graph updates that account only for the most relevant part of the network.Then we illustrate how to convert the sequence of updates into image frames that can be combined into a movie depicting the network evolution.Alternatively, the updates can be fed directly to the Gephi Streaming API to produce an interactive visualization of the evolving network.

Input data format
The data required as input is an ordered chronological sequence of interactions between nodes.The interactions can be either pairwise or cliques of interacting nodes.For instance, the following input: represents the occurrence of an interaction between nodes n1 and n2 at epoch time t1 and an interaction between n1, n3, and n4 at epoch time t2.Entries with more than two nodes are interpreted as interactions happening between each pair of members of the clique.Repetition of the same entry with the same time encodes the intensity of interaction.

Differential network updates
The input data is processed by an algorithm that assigns scores to nodes and edges.The score is initialized at 0 for new nodes and edges, and it is updated for each line of the input.When processing an input line t, n1, ..., n k , the score of each node ni|i ∈ {1, ..., k} is incremented by a value ∆i: Also the score of the edges (ni, nj)|i, j ∈ (1, ..., k) ∧ i = j connecting nodes involved in the interaction are incremented by δi,j: In general the increments to the scores can be adapted to the task, e.g. the above formulas give less importance to interactions happening in large cliques.Alternatively, for one of our case studies we use another increments, defined as: To emphasize the most recent events and penalize stale ones, a forgetting mechanism that decreases the scores of all edges and nodes is run periodically every F forget frames by multiplying the scores by a forgetting constant C forget .The algorithm outputs for the purpose of the visualization Nv nodes with the highest scores, that are not singletons, and edges that have scores above a certain treshold S min edge .The algorithm has two phases: buffering and generation of differential updates (see Figure 1).In the first stage, at most N b nodes with the highest scores are saved in a buffer together with the interactions among them.Whenever a new node, that does not appear in the buffer yet, is read from the input, it replaces the node in the buffer with the lowest value of the score.If an incoming input line involves a node that is already in the buffer, then its score and scores of its edges are increased by ∆i and δi,j, respectively.In the second stage of the algorithm, the differential updates to the visualized part of the network are created.To this end, the Nv nodes in the buffer with the highest scores are selected.The subgraph induced by the Nv nodes is compared with the subgraph in the previous frame and a differential update is created.Each of the differential updates corresponds to a frame of the final visualization.The updates are created per every time interval, that is determined at the beginning of the algorithm with the parameter corresponding to time contraction Tcontr.Value of this parameter set to 10 means that the time will flow in the visualization 10 faster than in the data given as the input.
The differential updates are written in output in the form of a JSON file formatted according to the Gephi Streaming API (see bit.ly/16uGJKm).In short, each line of the JSON file corresponds to one update of the graph structure and contains a sequence of JSON objects that specify the addition/deletion/attribute change of nodes and edges.We also introduced a new type of object to deal with labels on the screen, for example to write the date and time on the screen.

Computational complexity
We call the numbers of buffered and visualized nodes N b and Nv, respectively.The computational complexity of the buffering stage of the algorithm is O(EN b ), where E is the total number of the pairwise interactions read (the cliques are made of multiple pairwise interactions).The memory usage scales as O(N 2 b ).The second, frame-generating, stage has computational complexity of O(F Nvlog(Nv)), where F is a total number of frames, that is a fraction of E and commonly it is many times smaller than E. The memory trace of this stage is very low and scales as O(Nv).We summarize, that our method has computational complexity that scales linearly with the number of interactions.It is therefore fast, that is, able to deal with extremely large dynamic networks efficiently.

Visualization
The JSON stream produced by the algorithm is fed to a python module that builds a representation of a dynamic graph, namely an object that handles each of the updates and reflects the changes to its current structure.The transition between the structural states of the graph determined by the received update can be depicted by a sequence of image frames.In its initial state, the nodes in the network are arranged according to the Fruchterman Rehingold graph layout algorithm [23].For each new incoming event, a new layout is computed by running N iterations of the layout algorithm, using the previous layout as a seed.Intermediate layouts are produced at each iteration of the algorithm.Every intermediate layout is converted to a png frame that is combined through the mencoder tool (bit.ly/zBryy) to produce a movie that shows a smooth transition between different states.To avoid nodes and edges to appear or disappear abruptly in the movie, we use animations that smoothly collapse dying nodes and expand new ones.A configuration file allows to modify the default movie appearance (e.g, resolution, colors) and some layout parameters.

CASE STUDIES
We test our method on datasets very different from each other in nature, size, and time span.The datasets and movies produced from each dataset are described next.Additionally, we publish them online (see Appendix A and Additional Materials).

Twitter
We use data obtained through the Twitter gardenhose streaming API, that covers around 10% of the tweet volume.We focus on two events: the announcement of Osama Bin Laden's death and the 2013 Super Bowl.We consider user mentions and hashtags as entities and their co-occurrence in the same tweet as interactions between them.
The first video (Figure 2A) shows how the anticipation for the Super Bowl steadily grows on early Sunday morning and afternoon, and how it explodes when the game is about to start.Hashtags related to #commercials and concerts (e.g., #beyonce) are evident.Later, the impact of the #blackout is clearly visible.The interest about the event drops rapidly after the game is over and stays low during the next day.
The video about the announcement of Bin Laden's death (Figure 2B) shows the initial burst caused by @keithurbahn and how the breaking news was spread by users @brianstelter and @jacksonjk.The video shows that the news appears later in #cnn and is announced by @obama.The breaking of this event in Twitter is described in detail by Lotan [34].

IMDB movies
We use a dataset from IMDB of all movies, their year of release and all the keywords assigned to them (from imdb.to/11SZD).We create a network of keywords that are assigned to the same movies.For this dataset we use the score increments defined by Equation 3, due to the fact that the most popular movies have many keywords attached to them.Our video (Figure 2C) shows interesting evolution of the keywords from "character-name-in-title" and "basedon-novel" (first half of 20th century), through "martial-arts" (70s and 80s) to "independent-film" (90s and later), "anime" and "surrealism" (2000s).

Patents
We use a set of US patents that were issued between 1976 and 2010 [33].We analyze the appearance of words in their titles.Whenever two or more words appear in a title of a patent we create a link between them at the moment when the patent was issued.To improve readability we filter out stopwords and the generic frequent words: "method", "device" and "apparatus".Our video (Figure 2D) demonstrates that at the beginning of the period techniques related to "engine" and "combustion" were popular, and later start to cluster together with "motor" and "vehicle".Another cluster is sparked by patents about "magnetic""recording" and "image""processing".It merges with a cluster of words related to "semiconductor" and "liquid""crystal" to form the largest cluster of connected keywords at the end of the period.

Discussion
The datasets in our case studies are fairly diverse in topicality, time span, and size, as shown in Table 1.Nevertheless, our method is able to narrow down the visualization to meaningful small subgraphs with less than 300 nodes in all cases.The high performance of the algorithm makes it viable for real-time visualizations of live and large data streams.On a desktop machine the algorithm producing differential updates of the network in the form of JSON files took several minutes to finish for the US patents and less than two minutes for the other datasets.Given such a performance, it is possible to visualize in real-time highly popular events such as Super Bowl, which produced up to 4500 tweets per second.For the explanatory purposes we provide and describe values of the parameters of the algorithm for the diversified case studies in the Appendix B. Other than these experimental datasets, on-demand animations of Twitter hashtag co-occurrence and diffusion (retweet and mention) networks can be generated with our tool via the Truthy service (truthy.indiana.edu/movies).Hundreds of videos have already been generated by the users of the platform and are available to view on YouTube (youtube.com/user/truthyatindiana/videos).
As already noted, our algorithm can also stream its results directly to Gephi, and the user can interact with the dynamic network that it produces.In Appendix A we explain how to test the Gephi visualization.

CONCLUSIONS AND FUTURE WORK
Tools for dynamic graph visualization developed so far do not provide any way to dynamically select the most important portions of large evolving graphs.We contribute to fill this gap by proposing an algorithm to select nodes and edges that best represent the network structure at a given time.We implemented our approach in an open source tool that takes in input a stream of interaction data and outputs a movie of the network evolution or a live Gephi animation.
As future work, we wish to improve our algorithm by means of further optimization and to enhance the tool by providing a standalone module for live visualization of the graph evolution, as well as generalizing our method to weighted networks.

Figure 1 :
Figure 1: Simple diagram of main components of the algorithm.

Table 1 :
Statistics on the experimental datasets.