Uncovering nodes that spread information between communities in social networks
© Mantzaris; licensee Springer 2014
Received: 13 January 2014
Accepted: 17 September 2014
Published: 22 October 2014
From many datasets gathered in online social networks, well defined community structures have been observed. A large number of users participate in these networks and the size of the resulting graphs poses computational challenges. There is a particular demand in identifying the nodes responsible for information flow between communities; for example, in temporal Twitter networks edges between communities play a key role in propagating spikes of activity when the connectivity between communities is sparse and few edges exist between different clusters of nodes. The new algorithm proposed here is aimed at revealing these key connections by measuring a node’s vicinity to nodes of another community. We look at the nodes which have edges in more than one community and the locality of nodes around them which influence the information received and broadcasted to them. The method relies on independent random walks of a chosen fixed number of steps, originating from nodes with edges in more than one community. For the large networks that we have in mind, existing measures such as betweenness centrality are difficult to compute, even with recent methods that approximate the large number of operations required. We therefore design an algorithm that scales up to the demand of current big data requirements and has the ability to harness parallel processing capabilities. The new algorithm is illustrated on synthetic data, where results can be judged carefully, and also on a real, large scale Twitter activity data, where new insights can be gained.
Online social networks (OSNs) such as Facebook, LinkedIn and Twitter have inspired a great amount of research. Whether it is regarding their uses  in different aspects of our daily lives or on how a important scientific breakthrough can spread around the world . These networks can be very large, for example Facebook currently holds around 1 billion user accounts. Despite the obvious computational challenges, analysis of these large datasets provides the opportunity to test hypothesis about human social behavior on an unprecedented scale, and hence to reveal deeper understandings of human social behavior . Furthermore, commercial, government and charitable enterprises can utilize the networks to inform campaigning, advertising and promotion. Hence, there is great potential impact for improvements in the analytical tools designed for analysing social networks.
Within the OSNs generated by users, community structures form naturally, and research into their detection is very active , . These developments in community detection have produced a diverse set of methods which are at our disposal. Run times of the algorithms are a major concern, and current datasets can be too large for many of the algorithms available. One approach to deal with the size is by using network samples; for example,  analyzes community structure in a subset of millions of nodes taken from Facebook. However, for the types of effects that span over the entirety of the networks, we wish to avoid sampling and deal with complete networks.
Communities in OSNs can emerge for many reasons. A key driver can be homophily , where some underlying similarity between users in a community leads to a higher number of edges between these users than with users in a different community.  investigates homophily formation and evolution in a online social buyers setting. Here, a community builds trust and supports the activity of online purchases, which is the motivation for more in depth research into the nature of the inter-community connections. Companies have an interest in their brand identity within OSN communities, as users now have the ability to broadcast brand information to many other users within their social reach. Although not based on data from OSNs,  discusses the attributes that users exhibit to utilize their business associations, and how companies should work to cultivate their brand presence with customers. The authors also raise many interesting questions concerning the dynamic elements of brand presence which are relevant to this work.
It is also interesting to elaborate on how distinct communities are brought together to create large connected graphs. Without the connectivity between dense communities, isolated components would not support many of the fascinating phenomena that have been observed, notably the hugely influential small world effect , where there exist surprisingly short paths between members of the network located in different communities. By definition, the density of those edges connecting communities is less than the density of edges within communities. The sparsity of the between-community connectivity is the basis for community separation quality measures such as the modularity index, . The relatively low number of these edges connecting communities together gives them special importance as they are critical for the graph’s connectivity. A recent study explores this network feature using examples from brain connectivity, , concluding that connection costs can explain these modular networks. For the applications in OSNs, where companies seek to harness the power of Internet advertising, nodes which offer community traversal connections are critical targets . The aim of this work is therefore to give a simple and scalable methodology for defining and discovering this type of key structural component. For the remainder of this paper an edge connecting two different communities will be referred to as a boundary edge and the nodes on either side of these edges as boundary nodes.
It is important to have in mind that the edges created facilitate an information exchange but when a node receives content, independently it decides on whether to repeat this received information to its follower node set. In future time steps this can include nodes that were previously not included in the sharing of this content for whatever reasons or constraints might exist. For content to spread throughout the network this decision to repeat the content must be consistently agreed on independently. The number of times this must occur is increased when there is a large number of distinct communities and only a few boundary nodes acting as regulators for the content to cross communities and become viral. The term viral usually assumes that a large portion of the nodes in network are aware of a piece of information or content regardless of the specific niche community they may belong to. Viral activity can be identified through conversation volume spikes, or cascades, as users share a common piece of content in a short amount of time.
In general the content users would classify as news has many examples of viral spreading of content. Twitter is sometimes considered to be a news source, with  counting at least 85% of Tweets being related to headline news. Much of which includes news of commercial interest and opens possibilities for real time engagement. Real time monitoring of these events being discussed is therefore essential for automated engagement. With spikes in topics lasting in the order of minutes, the run time of an algorithm should be reduced as much as possible and the ability of the algorithm to utilize the hardware of multiple processors is highly desirable. The works ,  discuss this real time monitoring of events and gives a number of case studies, comparing techniques for spike detection. Our work has a slightly different emphasis, since we aim to detect nodes and edges that facilitate propagation of information, and hence would be natural candidates for monitoring and intervention.
where counts the total number of shortest paths between i and j, and counts how many of these pass through node v. Hence, gives an indication for the amount of potential control or influence node v has on the information flow between all other nodes in the network. Computing this measure straightforwardly for each node requires a large number of operations, , leading to a run time that is impractical for large networks. Using Brandes algorithm  a complexity of is possible, which is still time consuming for the networks with millions of nodes.
The strict assumption that information flows along shortest paths (geodesics) is not always appropriate, as discussed, by Newman , who proposes a random walk betweenness measure computed using matrix methods. An important criticism of the geodesic viewpoint, which also motivates the random walk alternative, is that when passing messages to target nodes, typical users do not have the global network information and hence may not be aware of the shortest paths between pairs of nodes to be able to place them along the correct route. The runtime for this random walk betweenness measure is and the algorithm requires matrix inversions. We also note that these two betweenness measures above are designed for static networks, and changes in the size of communities over time can affect the distribution of the betweenness values amongst the nodes.
Given networks arising from online social media, there are many cases where rich community structure is observed. The edges connecting these separately clustered groups of nodes are referred to as boundary nodes here, and those edges connecting them are boundary edges. In this section our algorithm for measuring the boundary node proximity is described. The goal is to be able to rank nodes in a network according to their ability to influence nodes across different communities by the information (content) they exchange. This will reveal the boundary nodes, which play a key role in exchanging information between different communities, and those nodes surrounding them in their local vicinity. The algorithm is based on the premise that information travels via a random walk rather than through a shortest path route.
Here are two communities belonging to the list of community labels, C, in the graph. We assume that the number of community labels will be much less than the number of nodes, . From the boundary edge set W, the boundary nodes B, can be found. Due to the typical sparsity of the community connectivity, the number of boundary nodes will be much less than the total number of nodes, .
Outline of the boundary node vicinity algorithm
Extract the set of connected graphs from the original graph
For each connected component obtain the community labels for the graph
Obtain the set of boundary nodes
Measure the local vicinity of each boundary node using the fixed length random walk method and aggregate all of the values in the graph into a normalized score
The final step in Table 1 runs a number of i.i.d. random walkers from each boundary node until a convergence criterion is satisfied based on the number of visits to the nodes in the network. The number of visits to each node is counted and is a measure of ability to disseminate information across boundary edges and influence different communities. Steps 3 and 4 are described in more detail in the next subsection. This algorithm can be referred to as the boundary vicinity algorithm (BVA).
2.1 Boundary node analysis
Here the is the Kronecker delta where the value of one is given when both inputs are equal. To obtain these matrices it is not a requirement to iterate through each element. This will be clarified below. With the community adjacency matrices for each community label and the boundary nodes belonging to the network B we can iterate through the nodes of B and run the series of random walkers localized on each boundary node and confined to each .
The random walks used to measure the ability for nodes to influence and affect the boundary nodes have a fixed number of steps. For the walks to represent the localized region of these nodes, B, the walks cannot be given an excessively large length as this would dilute the importance of nodes closer to the boundary nodes. The Barabási-Albert model ,  uses the mechanism of preferential attachment to reproduce the growth characteristics of many networks. The average path length for these networks is , where we assume that the ceiling of the value is taken. We use this value as a baseline in deciding the number of random walk steps that must be taken before a piece of information loses the consistency and relevance of the original content. Various other values can be used for the number of steps taken in the random walks. The method is not sensitive towards this value as long as the chains do not reach the stationary distribution where the initial state of the chain has not affected the final results, since we are interested in the locality of nodes surrounding the initial state which is one of the boundary nodes. Alternatives which worked well are the average path length, and the longest path length from the boundary node to another node in the same community.
The run time of this algorithm is dominated by the community detection phase. Due to the boundary node set being much smaller in size than the number of nodes, the loops required to iterate through them and perform the random walks will typically cost less than N.
Each component of the BVA algorithm, Table 1, can be made more efficient using parallelization methods. The first step requiring breadth first search can be parallelized using shared or distributed memory, following the works , , where the number of edges visited is significantly reduced. Another approach to BFS is  that utilizes the Nvidia GPUs, but the authors note the memory restrictions for large graphs that take up more than the graphics card memory of around 1 GB. When the number of nodes goes beyond a few million nodes and tens of millions of edges, memory becomes a concern and the method of  shows that the step of acquiring the set of connected components can be performed in log space. The community detection component can also be parallelized by using the method of  resulting in a completely parallelizable algorithm. The last steps of the algorithm can naturally be parallelized by running the i.i.d. random walkers on separate processors at the same time. After they have completed their walks the trajectories can then be monitored for convergence.
Figure 2 shows the results of using the boundary vicinity algorithm (BVA) and calculating betweenness on a synthetically produced network. Three communities were generated independently with the ER model and then a set of random nodes (26 here) were selected from these communities to be connected to a different community. These selected nodes become the boundary nodes in the network. There are 167 nodes, and the three communities have 87, 47 and 33 nodes with a total of 13 bridges between them. The chains of random walkers used were run until the convergence diagnostic of PSRF was below 1.2. There are six subfigures labelled (a)-(f), where (a) and (b) show the normalized values from the algorithms (y-axis) given to each node in the network (x-axis). Subfigure (a) for the boundary vicinity algorithm has a more evenly spread distribution across the nodes than what betweenness produces in subfigure (b). We can see that betweenness gives almost absolute importance to the nodes on the boundary with little emphasis for the nodes in the vicinity of those boundary nodes. Subfigures (c) and (d) display the networks with the vertices scaled according to the boundary vicinity measure and betweenness respectively. In (c) we can see the neighboring nodes of the boundary scaled as well. Subfigure (e) counts the proportion of overlap in the ranking between BVA and betweenness for an increasing number of nodes. We can see that both algorithms have almost complete overlap in choosing the top 26 nodes but differ in the order for the subsequent nodes. Subfigure (f) shows a scatter plot of the values for all the nodes with both algorithms. We can see how the top ranking nodes are clearly distinct from the bulk of the network and how BVA produces a greater variance for nodes not in the boundary set. These results are consistent with multiple runs, and alternative networks which varied the number of bridges connecting communities and the density of edges between nodes in a community.
In Figure 3 3 communities are produced using the Barabási-Albert model ,  algorithm of preferential attachment and then these communities are connected by choosing nodes uniformly from each group. There are 360 nodes, 3 communities of 60, 120, and 180 nodes with a total of 13 bridges. When BVA is run the chains of random walkers that begin from the boundary nodes were run until the convergence diagnostic of PSRF was below 1.2. The same format as with the previous figure is used. In the first row of subplots, (a) and (b), we can see again that there is a wider distribution in the scores for the nodes with the BVA algorithm on non-boundary nodes. In subfigures (c) and (d) we visualize the networks with the nodes scaled according to the BVA and betweenness respectively. We can see that the highest degree nodes which are central to the community they belong to are scaled and highlighted in both cases. A critical difference is that the boundary nodes at the top which receive a large score with BVA but are given minimal importance with betweenness. With betweenness the role of these nodes is redundant given alternative routes through nodes with higher degree and direct connections to many nodes in the community. In the effort to inspire cross pollination of communities with promoted content, the ability to saturate a user with fewer connections may be advantageous, and worth considering because they may be influenced more easily. In (e) we look at the overlap proportion of the ranking between nodes for a number of nodes in both algorithms. We can see the local peak of the number of overlaps for more nodes than the number of boundary nodes. This is because the structure of the network includes nodes in the vicinity of the boundary which lay on the shortest paths to other nodes in the community. In the last subfigure we can see the scatter plot of the BVA values and betweenness. The ranking of the algorithms may be more similar to each other than with the ER communities connected but the distribution is much more narrow for betweenness in this case, highlighting the few boundary nodes that are also core to the communities.
Subfigure (b) in Figure 5 shows the results of simulating an S-I epidemic on the network of three connected ER communities. Each simulation begins where a single node is put into the infected state and each infected node can infect the nodes in its locality of a single edge according to the adjacency matrix. Three hundred independent simulations are run for each different configuration of the transmission probability and the average percentage of the network that is infected at each iteration is shown. The percentage of the infected network is stops where simulations no longer continued infecting new nodes on average. The maximum permitted iteration number for each simulation was set at 60. In the first plot the black line shows the results where the transmission probability is uniform across all nodes, and is 0.2 in this case. The blue and red lines show where the top ten BVA and betweenness scoring nodes are removed/immunized from the spread of the infection. The rate of network infection is reduced in both cases showing that both scores provide useful targets for limiting the spread. Since the communities had a very sparse interconnectivity the removal restricted the between community spread limiting the number of infected nodes. The second plot shows a slightly different strategy where for the blue and red lines, instead of removing the top ten nodes based on BVA and betweenness, their probability to move from susceptible to the infected state is 0.01 compared to the rest of the nodes with probability 0.2. The black line here is still the case of the uniform 0.2 probability used across the network. The rate of transmission is significantly reduced from the uniform case and even more than the results on the ER graph in the first subfigure since the community connectivity relies on few edges.
Figure 4 shows the results of using BVA on the Zachary Karate club dataset. In subplot (a) the network is visualized and the vertices are scaled according to normalized scores given by the BVA algorithm. The central members of the communities are given large values as are the boundary nodes since they are within the vicinity of the boundary. In subfigure (b) the overlap of the rankings with BVA and betweenness is shown for the number of nodes included, and as with the previous two figures the overlap for both methods peaks when including the top number of nodes which corresponds to the number of boundary nodes. In Figure 5, subfigure (c) shows the results of simulating an S-I epidemic on this network. The case of removal of top scoring BVA and betweenness nodes is not presented due to the size of the network. Here only the top 3 scoring nodes for BVA and betweenness are given the reduced probability of infection 0.01. As with the simulations on the other networks, BVA and betweenness target nodes which reduce the rate of infection.
When analyzing the Enron email dataset, a subset of the nodes are included where the position in the company is known. BVA and betweenness scores are calculated for each of the nodes in the network and the top ten nodes for which their roles are known are compared. BVA selects 3 vice presidents, 1 CEO, 2 managers, 2 traders, and 2 employees to be in the top ten. Betweenness selects 1 vice president, 1 managing director, 2 managers, 1 director of trading, 2 traders, 1 secretary and 2 employees. The list provided by BVA contains more company members with higher positions than by betweenness. This may not be always the case, but it does show that the features of the network extracted by BVA captures importance in the node placements.
The work presented here gives an efficient algorithm for ranking the ability of nodes in a network, with community structure, to spread information between clusters. Previously proposed methods impose large computational difficulties or are not based on principles which realistically model how information across the communities can spread. Focusing attention on these boundary nodes in a network can be critical for monitoring whether content may reach the point of becoming viral. In practice not all of the nodes in the network may be directly influenceable. An alternative approach can be to indirectly influence a chosen node by targeting the local vicinity of the node in the network. The boundary vicinity algorithm (BVA) acknowledges nodes that may be placed in such a position to have more or less influence on content leaving or entering a community of nodes in network.
A strength of this boundary vicinity algorithm is that it combines the power of community detection algorithms with the use of random walkers to assist in the process of investigating the range of influence of the boundary nodes. The results show that this algorithm is comparable with betweenness centrality without the requirement for full the maturity of a network to be visible. In situations where the observed connectivity is changing, analysing the network in sections based on a community structure is an approach to provide more consistent results over time. The algorithm has a single tuning parameter which determines the number of steps a random walker takes from the boundary nodes. Using a fraction of the average path length for networks constructed with the Barabási-Albert model has given stable results in our experiments.
Measures such as betweenness can provide a set of optimal targets for spreading content along shortest path routes throughout the complete connected network. This task ignores the challenges that might be faced which attempting to promote activity in the critical set of nodes which lay on the boundary of the communities making up the complete network. A list of the nodes which are best positioned to quickly spread a piece of content does not address many of the practical challenges in inspiring activity as a non-invasive influencer. Assessing the vicinity of the influencers for the boundary nodes gives a reasonable subset for which attention must be given to ensure that cross pollination between clustered sets of nodes can occur.
Overall, the proposed algorithm has the potential to quickly handle the task of analysis with an online stream of large datasets. In particular real time event monitoring in environments such as Twitter where topic discussions can grow and decay rapidly, this is especially important. With the goal of spreading the content as far as possible the boundary nodes, and those nodes in its close vicinity, in a community must be targeted, which is at the core of this method proposed here.
Thanks is given to Peter Laflin for providing feedback over the course of this work, and to Bloom Agency, Leeds, for supplying anonymised Twitter data. This work was performed as part of the Mathematics of Large Technological Evolving Networks (MOLTEN) project, which is supported by the Engineering and Physical Sciences Research Council and the Research Councils UK Digital Economy programme, with grant EP/I016058/1, and the support of the University of Strathclyde with Bloom Agency for the follow-on support from the Impact Acceleration Account.
- Skeels MM, Grudin J: When social networks cross boundaries: a case study of workplace use of Facebook and LinkedIn. In Proceedings of the ACM 2009 international conference on supporting group work. ACM, New York; 2009:95–104. 10.1145/1531674.1531689View ArticleGoogle Scholar
- De Domenico M, Lima A, Mougel P, Musolesi M (2013) The anatomy of a scientific gossip., [http://arxiv.org/abs/arXiv:1301.2952]Google Scholar
- McAlexander JH, Schouten JW, Koenig HF: Building brand community. J Mark 2002, 66: 38–54. 10.1509/jmkg.126.96.36.19951View ArticleGoogle Scholar
- Leskovec J, Lang KJ, Dasgupta A, Mahoney MW: Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 2009, 6(1):29–123. 10.1080/15427951.2009.10129177MathSciNetView ArticleGoogle Scholar
- Fortunato S: Community detection in graphs. Phys Rep 2010, 486(3):75–174. 10.1016/j.physrep.2009.11.002MathSciNetView ArticleGoogle Scholar
- Ferrara E: A large-scale community structure analysis in Facebook. EPJ Data Sci 2012, 1(1):1–30. 10.1140/epjds9View ArticleGoogle Scholar
- McPherson M, Smith-Lovin L, Cook JM: Birds of a feather: homophily in social networks. Annu Rev Sociol 2001, 27: 415–444. 10.1146/annurev.soc.27.1.415View ArticleGoogle Scholar
- Matsuo Y, Yamamoto H: Community gravity: measuring bidirectional effects by trust and rating on online social networks. In Proceedings of the 18th international conference on World Wide Web. ACM, New York; 2009:751–760. 10.1145/1526709.1526810View ArticleGoogle Scholar
- Travers J, Milgram S: An experimental study of the small world problem. Sociometry 1969, 32: 425–443. 10.2307/2786545View ArticleGoogle Scholar
- Newman ME, Girvan M: Finding and evaluating community structure in networks. Phys Rev E 2004., 69(2): 10.1103/PhysRevE.69.026113Google Scholar
- Clune J, Mouret J-B, Lipson H: The evolutionary origins of modularity. Proc R Soc Lond B, Biol Sci 2013., 280(1755): 10.1098/rspb.2012.2863Google Scholar
- Subramani MR, Rajagopalan B: Knowledge-sharing and influence in online social networks via viral marketing. Commun ACM 2003, 46(12):300–307. 10.1145/953460.953514View ArticleGoogle Scholar
- Kwak H, Lee C, Park H, Moon S: What is Twitter, a social network or a news media? In Proceedings of the 19th international conference on World Wide Web. ACM, New York; 2010:591–600. 10.1145/1772690.1772751Google Scholar
- Weng J, Lee B-S: Event detection in Twitter. ICWSM 2011.Google Scholar
- Nichols J, Mahmud J, Drews C: Summarizing sporting events using Twitter. In Proceedings of the 2012 ACM international conference on intelligent user interfaces. ACM, New York; 2012:189–198. 10.1145/2166966.2166999View ArticleGoogle Scholar
- Freeman LC: A set of measures of centrality based on betweenness. Sociometry 1977, 40: 35–41. 10.2307/3033543View ArticleGoogle Scholar
- Brandes U: A faster algorithm for betweenness centrality. J Math Sociol 2001, 25(2):163–177. 10.1080/0022250X.2001.9990249View ArticleGoogle Scholar
- Newman ME: A measure of betweenness centrality based on random walks. Soc Netw 2005, 27(1):39–54. 10.1016/j.socnet.2004.11.009View ArticleGoogle Scholar
- Barabási A-L, Albert R: Emergence of scaling in random networks. Science 1999, 286(5439):509–512. 10.1126/science.286.5439.509MathSciNetView ArticleGoogle Scholar
- Lancichinetti A, Fortunato S: Community detection algorithms: a comparative analysis. Phys Rev E 2009., 80(5): 10.1103/PhysRevE.80.056117Google Scholar
- Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E: Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008., 2008(10): 10.1088/1742-5468/2008/10/P10008Google Scholar
- Clauset A, Newman ME, Moore C: Finding community structure in very large networks. Phys Rev E 2004., 70(6): 10.1103/PhysRevE.70.066111Google Scholar
- Albert R, Barabási A-L: Statistical mechanics of complex networks. Rev Mod Phys 2002, 74(1):47. 10.1103/RevModPhys.74.47MathSciNetView ArticleGoogle Scholar
- Gelman A, Rubin DB: Inference from iterative simulation using multiple sequences. Stat Sci 1992, 7: 457–472. 10.1214/ss/1177011136View ArticleGoogle Scholar
- Beamer S, Asanović K, Patterson D: Direction-optimizing breadth-first search. Sci Program 2013, 21(3):137–148.Google Scholar
- Beamer S, Buluc A, Asanovi K, Patterson DA (2013) Distributed memory breadth-first search revisited: enabling bottom-up search. Technical report, DTIC documentGoogle Scholar
- Harish P, Narayanan PJ: Accelerating large graph algorithms on the GPU using CUDA. In High performance computing—HiPC 2007. Springer, Berlin; 2007:197–208. 10.1007/978-3-540-77220-0_21View ArticleGoogle Scholar
- Reingold O: Undirected connectivity in log-space. J ACM 2008., 55(4): 10.1145/1391289.1391291Google Scholar
- Martelot El, Hankin C (2013) Fast multi-scale community detection based on local criteria within a multi-threaded algorithm. , [http://arxiv.org/abs/arXiv:1301.0955]Google Scholar
- Zachary WW: An information flow model for conflict and fission in small groups. J Anthropol Res 1977, 33: 452–473.Google Scholar
- Chapanond A, Krishnamoorthy MS, Yener B: Graph theoretic and spectral analysis of Enron email data. Comput Math Organ Theory 2005, 11(3):265–281. 10.1007/s10588-005-5381-4View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd.Open Access This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.