Fake news propagates differently from real news even at early stages of spreading

Social media can be a double-edged sword for society, either as a convenient channel exchanging ideas or as an unexpected conduit circulating fake news through a large population. While existing studies of fake news focus on theoretical modeling of propagation or identification methods based on machine learning, it is important to understand the realistic propagation mechanisms between theoretical models and black-box methods. Here we track large databases of fake news and real news in both, Weibo in China and Twitter in Japan from different cultures, which include their traces of re-postings. We find in both online social networks that fake news spreads distinctively from real news even at early stages of propagation, e.g. five hours after the first re-postings. Our finding demonstrates collective structural signals that help to understand the different propagation evolution of fake news and real news. Different from earlier studies, identifying the topological properties of the information propagation at early stages may offer novel features for early detection of fake news in social media.


Maintext
Social networks such as Twitter or Weibo, involving billions of users around the world, have tremendously accelerated the exchange of information and thereafter have led to fast polarization of public opinion [1]. For example, there is a large amount of fake news about the 3.11 earthquake in Japan, where about 80 thousand people have been involved in both diffusion and correction [2]. These fake news, which can be fabricated stories or statements yet without confirmation, circulate online pervasively through the conduit offered by on-line social networks. Without proper debunking and verification, the fast circulation of fake news can largely reshape public opinion and undermine modern society [3]. Even worse, fake news can be intentionally fabricated, leading to diverse threats to modern society including turmoil or riot. The later fake news is identified and corrected the greater the damage it can make, due to its fast propagation. Thus, detecting fake news at their early stages, in order to effectively avoid further risks and damages, is crucial.
Different from the age of word of mouth, identification of fake news in the online social network by experts is generally labor-intensive with low efficiency [4], which has at-tracted much research attention to provide alternative solutions. One intuitive idea for understanding fake news spreading is inspired by epidemic models. In the 1960s, Daley and Kendall proposed the so-called DK model [5] in which agents are divided into ignorant, spreader and stifle. Its later extensions are based on the known epidemic spreading models such as SIS model [6,7], SIR model [8,9], SI model [10,11] and SIRS model [12]. While these studies focus on theoretical modeling of fake news propagation, the availability of real data in online social platforms, as we show here, can provide an opportunity to deepen our understanding of the realistic information cascades. Different kinds of observations have been made in empirical studies of fake news, including linguistic features [13], temporal features of re-postings [14][15][16] and user profiles [17][18][19]. Actually, information cascades in online social networks are collective propagation networks of which critical topological features remain yet unknown. This motivates our present study to analyze and compare empirically the propagation networks between fake and real news, especially in their early stage, so as to identify the propagation differences and mechanisms behind. These topological features could help to design machine learning approaches to essentially boost the accuracy of fake news targeting [20][21][22].
Very recently, based on empirical datasets, it has been found that the propagation network of fake news is different from that of real news [23]. They have found that falsehood propagates significantly farther, faster, deeper, and broader than truth news in many categories of information. While this study provides the possibility to differentiate fake news from real news based on the propagation network, it remains unclear how this difference between fake news and real news emerges and how soon one can separate these two types. Thus, a systematic study for the dynamic evolution of propagation topology is still missing. This motivated us to explore deeper in this direction of how the propagation evolves topologically in different scenarios. With collected real data, we identified early signals for identifying fake news, at five hours from the first re-posting, without other information on contents or users. Note that different from considering all the cascade components [23], our finding is valid for even only following the largest cascade component.
Based on realistic traces of real and fake news propagation in both Weibo (from China) and Twitter (from Japan), we use the re-posting relationships between different users to establish propagation networks (see Methods for details). Given similar popularity scales, we find that fake news shows significant different topological features from real news. These novel topological features will enable us to design an efficient algorithm to distinguish between fake news and real news even shortly after their birth.

Results
To construct the propagation network of fake and real news, we utilize the re-posting relation between different users participating in circulating the same message (see Methods and Table 1). A schematic description of such propagation networks is shown in Fig. 1A. Typical propagation networks of fake news and real news in Weibo and Twitter are demonstrated in Fig. 1B-E. The topology of the propagation network of fake news and real news can be seen to be different. For example, the number of layers in fake news ( Fig. 1B and 1D) is typically larger than that of real news ( Fig. 1C and 1E). Additionally, from looking at various examples of fake news propagation networks, it is somewhat surprising that for widely distributed fake news, the creator does not usually have the largest degree in the propagation network (Figs. S1 and S2). In the following, our analysis considers also real news Number of Twitter propagation networks (larger than 200 re-postings) 27 28 The Weibo networks are from the first Weibo datasets.  It is seen that the layer sizes of real news and fake news are significantly different in both Weibo and Twitter. Real news networks tend to have a relatively larger first layer, while fake news networks are relatively uniformly distributed in different layers created by non-official sources, to avoid the artificial differences due to different types of information creators (official or non-official accounts). Layer ratio. The layer number is defined as the number of hops from the creator to a given node for a given propagation network. The cumulative numbers of nodes at different layers as a function of time for four typical networks of fake news ( Fig. 2A for Weibo and 2C for Twitter) and real news ( Fig. 2B for Weibo and 2D for Twitter) are demonstrated. The fraction of re-postings in the first layer of fake news network is found significantly smaller than that of real news, while the fraction in other layers for fake news is significantly larger than that of real news. Early adopters re-posting the message shortly after the creator play a dominant role in circulating real news comparatively. These different roles lead to distinctive landscapes of propagation networks.
The investigation of layer sizes in propagation networks demonstrated in Fig. 2, are systematically extended to all the available messages. As shown in Fig. 3A and 3B, fake news networks tend to possess a relatively smaller first layer, while other layers are larger comparatively. Therefore, we can define the ratio of layer size as the ratio between the size of the second and the first layer. As shown in ratio distribution ( Fig. 3C and 3D), the ratio in fake news is significantly larger than that of real news. The distribution for the ratio of layer sizes separates fake and real news well with only a small overlapped area. Furthermore, it is seen in Fig. 3C that this difference is already significant only at five hours since the first re-posting. In Fig. 3D, it is seen that, for the whole lifespan, the separation of the fake and the real is also significant. In the circulation of fake news, the success of the propagation depends highly on the branching process creating different layers, which show different evolution paths between fake and real news. We further investigate the probability difference between fake and real news based on distributions of layer ratio from the time of first re-posting (Figs. S3 and S4). Note that the layer size distribution has a peak around layer four on Twitter in Fig. 3B, probably due to secondary outbreaks.
It should be noted that real news is more likely to be created by official accounts such as government agencies or mass media agencies. In order to eliminate the possible effects of official creators, we also investigate the distribution of the ratio of layer sizes in real news from only non-official creators. While official news and non-official news have different sample sizes here, we found they both have different propagation patterns from fake news. For example, in Fig. 3C and 3D, the non-official real news and the fake news are found to have different distribution of layer size ratio. To verify our results, we also analyze data Characteristic distance. While the ratio of layer sizes can be regarded as a local feature of the network structure, we further inspect a global feature in terms of characteristic distance in a propagation network. As seen in Fig. 4A, distances between pairs of nodes in fake news are longer than those of real news, implying that later adopters foster the penetration of fake news in social networks. In order to quantify this finding for all the networks, we propose a second measure called characteristic distance (a) shown in Fig. 4B (see Methods). Considering the distance of all the networks as in Fig. 4B, fake news possesses a significantly longer characteristic distance (4.26) than that of real news (2.59). Similar results can also be observed in Twitter propagations (Fig. 4C). The distributions of characteristic distances for all networks are shown in Fig. 4D, where the two curves of fake and real news are well separated. Different from the results in [23], we show that the size distributions of fake and real news are similar (Fig. S7). This suggests that with similar levels of popularity, the characteristic distance is significantly different in fake news compared to real news. We also verified that the propagation size has less correlation with the characteristic distance (Fig. S8). To verify our results, we also analyze data of 2000 more real news from another dataset shown in Fig. S5.
Structural heterogeneity. Network topology describes the geometry of connections, with more information embedded than the scale statistics in [23]. Here we measure the Heterogeneity (see Methods) between propagation networks in fake and real news. The parameter h reflects the difference between a given propagation network and its counterpart of a star network with the same-size. Network with smaller h means similar to a star network. Although the out-degree distribution demonstrates only a minor difference between fake news and real news (Fig. S9), it is interestingly found here that the topology heterogeneity is significantly distinguishable. Note that the relationship between heterogeneity and N for star networks is power-law as seen in Fig. 5A. The h is the difference between the logarithm of a real network heterogeneity value H r and the logarithm of heterogeneity value of the same-size star network H s . The parameter h of fake news is significantly larger compared to that of real news. Consistent findings can also be observed on Twitter (Fig. 5B). In order to quantify the heterogeneity systematically, two distributions of h considering different time intervals are calculated. In Fig. 5C, it shows a significant difference at five hours from the first re-posting. For the whole propagation lifespan in Fig. 5D, h of fake news is also significantly larger than that of real news. Fake news networks have typically lower heterogeneity (larger h) since their propagation involves few dominant broadcasters. On the contrary, real news demonstrates higher heterogeneity (smaller h) and a more starlike layout. The ability to distinguish fake news from real ones is also valid for real news posted by non-official users (Fig. S10). This implies that the indicator based on structural heterogeneity is independent of the creator type. Additionally, another measure named the Herfindahl-Hirschman Index (HHI [24]) shows also a distinction between fake news and real news (Fig. S11).
The distinction between fake and real news of the heterogeneity measure is the highest among the above three indicators as seen in Fig. 6 and Table 2. For a given Weibo network,  measuring its h provides a clear difference between fake news and real news, even only considering re-postings at five hours from the first re-posting (Fig. 6A). This identification becomes even sharper in Fig. 6B, when we consider all re-postings. We show in Fig. 6C the difference significance (see Methods) between fake news and real news for different h. The differences are about 76% and 79% respectively for re-postings at a relatively short time (five hours) and all re-postings. Note that the probability of being fake news at five hours is already very similar to that for the whole propagation lifespan. The verification analysis (shown in Figs. S5 and S6) also demonstrates the difference significance between fake news and real news from another dataset, which is fully published by non-official accounts. Our results suggest that even without sophisticated features like texts or user profiles, direct and understandable topological features can offer high significance for developing early detections.
Classifier. The three features mentioned above, namely the ratio of layer sizes, the characteristic distance, and the heterogeneity parameter could be used to create a Support Vector Machine (SVM) classifier. Here we divide the dataset into training set (60%) and test set (40%) ten times randomly. We find that the average accuracy of this classifier is 79.5% when applying the RBF kernel.

Discussion
Being the most vital and popular form of new media, online social networks, fundamentally enhance the creation and dissemination of fake news [25,26]. Though existing solutions, especially the inspired machine learning approaches, perform impressively on targeting fake news, their black-box style essentially prevents a solid understanding and corresponding method development of debunking or blocking false information. On the other way, the human-intensive labor approach is time-consuming and expensive. For example, it usually takes at least three days [4] for verification and therefore misses the optimal prevention window before massive spreading. In this sense, novel approaches that could help to identify fake news at early stages are urgently needed in preventing the negative impact of false information propagation on modern society.
We show here that fake news spread with very different network topology, even at early stages, from authentic messages. We focus, in this manuscript on the evolution differences between the propagation topology of two types of information at early stages rather than providing a comprehensive prediction approach [22]. Even taking only one feature, the difference between fake news and real news is significant. The propagation mechanism, which essentially couples information dynamics and collective cognition in social networks, results in a distinctive landscape of circulations between fake and real news. In this way, several early signals can be derived, including the layer-ratio, the characteristic distance and the heterogeneity. Varol et al. study early detection of promoted campaigns by using supervised machine learning, which contains features about diffusion patterns, content information, sentiment information, temporary signals, and user data [27]. Moreover, Vicario et al. study fake news by identifying polarizing content, which contains structural features, semantic features, user-based features and sentiment-based features [28]. In contrast, our suggested measures focus on structural features which are simple, without text analysis, and time efficient. For example, the weak heterogeneity of fake news might be the result of opinion competition from weak ties between social communities. As stated that "bad" is usually more influential than "good" [29], the unconsciousness of "negativebias" might result in a late burst of fake news, which essentially differs from the spread of real ones. Disclosing intelligence factors that generate the specific topological features we found here can be a promising research direction in the future. Moreover, once we identify fake news, it is possible to study the nodes that participated in many networks. These nodes are much more active in the permeation of fake news, and as a result, they are more likely to be bots. The study of these vital nodes in the fake news propagation will play an important role in identifying and analyzing bots.
Note that our study has several major differences from Vosoughi et al. [23]. We focus more on the topological features (shape of a network), rather than on scale measures of propagation networks (depth or width). Furthermore, we focus on the largest cascade component of the propagation network, while all the cascade components are considered in [23]. As both manuscripts confirm the difference between fake news and real news in different aspects, we find surprisingly that this difference can be very significant even at the early stages of propagation.

Methods
Weibo data preprocessing. We analyze 1701 fake news of Weibo propagation networks (with 973,391 users) and 492 real news of Weibo propagation networks (with 347,401 users) that spread on Weibo from 2011 to 2016. We choose here large networks with more than 200 tweets. More details are given in Table 1. The topics of these Weibo propagation networks include political fake news, economic fake news, fraudulent fake news, tidbit fake news and pseudoscience fake news (Fig. S12).
Fake news is officially investigated and confirmed by the platform of Weibo [30]. Regarding real news, we collect them directly from reliable Weibo accounts. Creators of the real news can be official accounts, for example, government accounts and on-line newspaper accounts. All these real news accounts have been officially verified by the platform of Weibo. On the other hand, we also select manually 51 out of 492 real news networks whose creators are not official accounts. To verify our results, we also analyze another dataset (2000 more recent real news) from Weibo in Figs. S5 and S6. These 2000 real news networks are from more recent records that has been collected in the same way as above, and from non-official accounts.
In order to create the network, in which nodes are users of Weibo and links are repostings, we first mine the following data both for fake and real news: (a) Users: the unique serial number of users who participate in the same network. We also mark the node of the network creator. (b) Re-postings: the unique serial number of directed re-posting activities, and the serial number of source users and reposted users of this re-posting. Twitter data preprocessing. Twitter data was collected from Japanese tweets posted during the period between March 11th and March 17th in 2011, which is the Great East Japan earthquake period. During this period, a lot of fake news propagated on Japanese Twitter.
After gathering fake and real news tweets on a keyword basis, we focused on those with more than 200 tweets to create a retweet network. Here we define screen names as nodes, which appeared in the tweet context, and links are mention signs "@" between the author of the tweet and screen names after the sign. This is because many fake retweet users have already deleted their tweet or account itself, and do not appear in the database. Deleting the tweet or account makes the network more segregated and more challenging to capture the real structure of the networks. To avoid network segregation, we use the abovementioned context-based method to create retweet networks. Furthermore, as of March in 2011, many Japanese Twitter users did not clearly distinguish between mention symbol "@" and clear retweet symbol "RT @". Note that if there are multiple "@" in one tweet, according to the above rules, we extracted multiple screen names as nodes and linked them in order from the beginning of the sentence to create the networks. We compared two types of networks defined by mention symbol and retweet symbol in Fig. S13, and found our major results still hold.
After creating networks, we extract the largest connected component (LCC) without consideration of link directions and analyze only those with LCC size above roughly 200 nodes. A node with the oldest tweet time in LCC was treated as creators. All the fake and real news that we determined are shown in Additional file 1.
Our method of creating a retweet network is different from the way of previous literature [20,23] that used follower graphs and tweet data simultaneously to create a retweet network. In case that we do not have a follower graph as of 2011, we applied this approximate method of extracting as much information as possible from the tweet context. In principle, because retweet information remains in the tweet context, the topology of the network should be equivalent to the previous literature, but the time information in resolution of seconds is not accurate in our case. Therefore, we only use time information in hours in the Twitter analysis.
Definition of fake news and real news. In a recent paper by Lazer et al. [31], "fake news" is defined as fabricated information that mimics news media content in the form, without news media's editorial norms and processes for ensuring the accuracy and credibility of the information. In our manuscript with Weibo data, the fake news is false information fact-checked by the platform and verified as having been fabricated. Regarding real news, we collect them directly from reliable Weibo accounts. And all these real news accounts have been officially verified by the platform of Weibo.
For Twitter data, the fake news is also false information which is fact-checked by reliable evidences [32][33][34]. This is similar as the true/false news defined in paper by Vosoughi et al. [23] that their rumor cascades are checked independently by six fact-checking organizations. However, since there were no official anti-rumor website in Japan as of 2011, we first gathered 57 topics listed on websites [32,33] and a book [34]. These contents include tweets based on no evidence and malicious tweets, such as starvation of babies and elderly people, someone under the server rack needed help, and the Japan prime minister is taking luxury supper during the disaster. When collecting tweets, we combine a few keywords related to the contents of each fake news. These keywords were proper nouns, such as place names and personal names. After that, we excluded correction tweets whose contents are against fake news including keywords such as "false" and "mistake". Our typical procedure to gather fake news tweets is explained in a previous work [2]. To validate the fake news tweets, three graduate students at the University of Tsukuba checked independently whether these topics are fake and the gathered tweets are properly classified into fake news.
For real news in Twitter, we gathered 71 topics by combining keywords (proper nouns, such as place names and personal names) as with the fake news. We collected most of tweets originated from official accounts with verified Twitter badges such as government agencies, major newspapers and famous people. The contents included tweets about earthquake information, traffic information, donation information and so on. In addition, we also collected five topics originated from civilians without badges, which were widely retweeted. These tweet contents were related to small correct tips during the disaster.
Establishing a network model. Based on the information we analyze above, we establish a directed network as demonstrated in Fig. 1A. The users are the nodes in the network, and the re-postings are the edges in the network. And we the mark network creator using color green. Each edge has a direction that is either from creator to re-poster or from former re-poster to later re-poster. We plot figures of typical networks for both fake and real news of Weibo and Twitter as shown in Fig. 1B to 1E.
Ratio of layer sizes. The layer number is defined as the number of hops from the creator to a given re-poster. The ratio of layer sizes is a measure for each network defined as: ratio of layer sizes = n 2 n 1 , n 1 and n 2 are the sizes (number of nodes) of the first and second layer for a certain network respectively. Characteristic distances. In order to measure the distances, for each network we first calculate the distances between all pairs of nodes in the network and plot the distribution in a logarithmic scale (y axis). It can be seen from Fig. 4 that the function can be approximated by an exponential function. We consider the linear part of curves where their x value (distance) is above one. We calculate the characteristic distance (a) accordingly: Heterogeneity measure. The heterogeneity [35] is defined as: N : The number of nodes in the network, k i : The degree of node i. We show a scatter plot (Fig. 5A) for both fake and real news of Weibo. The black line is the theoretical line for star network: