Fake news propagate differently from real news even at early stages of spreading

Social media can be a double-edged sword for modern communications, either a convenient channel exchanging ideas or an unexpected conduit circulating fake news through a large population. Existing studies of fake news focus on efforts on theoretical modelling of propagation or identification methods based on black-box machine learning, neglecting the possibility of identifying fake news using only structural features of propagation of fake news compared to those of real news and in particular the ability to identify fake news at early stages of propagation. Here we track large databases of fake news and real news in both, Twitter in Japan and its counterpart Weibo in China, and accumulate their complete traces of re-posting. It is consistently revealed in both media that fake news spreads distinctively, even at early stages of spreading, in a structure that resembles multiple broadcasters, while real news circulates with a dominant source. A novel predictability feature emerges from this difference in their propagation networks, offering new paths of early detection of fake news in social media. Instead of commonly used features like texts or users for fake news identification, our finding demonstrates collective structural signals that could be useful for filtering out fake news at early stages of their propagation evolution.


Introduction
Social networks such as Twitter or Weibo, possessing billions of users around the world, have tremendously accelerated the exchange of information and thereafter lead to fast polarization of public opinion [1].For example, there was one big fake news tweet that about 80 thousand people had been involved in both diffusion and correction [2].In the absence of authentic information, fake news, which can be stories or statements yet without confirmation, circulate online pervasively through the conduit offered by on-line social networks.Just as the saying that three people make you believe a tiger around, without proper control, the fast circulation of fake news can eventually confound right and wrong and profoundly reshape public opinion.
For example, according to a research of Pew Center, there are 44% of Americans getting their news from Facebook.Moreover, in the US elections of 2016, BuzzFeed found that 38% of posts shared from three large rightwing politics pages on Facebook included "false or misleading information".Even worse, fake news can be intentionally fabricated to be misleading and the broad spread will thus result in diverse threats to modern society, even leading to turmoil or riot.Therefore, detecting fake news at early stages, in order to effectively block further propagation, is challenging.Existing approaches are generally labor-intensive and contemporary efforts of experts and volunteers.Furthermore, the efficiency of manual filtering is extremely low, e.g., fact-checkers of Facebook take more than three days to finish the human judgment when their prevention is usually too late [3].Thus, a fast and reliable way of identifying fake news at early stages in social networks is urgent but still missing.
The first intuitive idea related to structural features is to untangle the relation between social network and the circulation of fake news.Inspired by the epidemic models, in 1960s, Daley and Kendall proposed the so-called DK model in which agents are divided into ignorants, spreaders and stifles [4].Its later extensions ranges from the known epidemic spreading models such as SIS model [5,6] SIR model [7,8] SI model [9,10] and SIRS model [11].Existing studies from structural aspects usually devote significant efforts on theoretical modelling and rather less on empirical investigations, thus inspiring only few realistic applications.Moreover, these studies focus mainly on structures of social networks but not on propagation networks that are composed of re-postings which might provide realistic traces of how information flows.This motivates the present study to find out empirically which structural features control propagation networks of fake and real news and these indeed are found here to offer based on their network differences a novel global distinction that lead to important insight.
Interesting observations in empirical propagations of fake news, ranging from contents [12], re-postings [13][14][15] to user profiles [16][17][18] have been evidently and extensively demonstrated.Even more inspiring, machine learning approaches that consider both structural features and combined temporal and linguistic features essentially boost the accuracy of fake news targeting [19,20].More specifically, it has been indicated that in both, Twitter and Weibo, the layer of re-postings of fake news seems larger than that of real news [21,22].However, while the combined underlying social network and information content have been explored, the time evolution of the propagation networks emerging from the interaction and feedback during the information spreading has been rarely studied.Thus, a systematic dynamic evolution of network structure for understanding if and how fake news propagate differently from real news remains unclear.We show here that the ability to detect early signals of fake news (in this paper within five hours) without information on contents and users is effective and can be useful for fake news debunking.
Based on realistic traces of real and fake news propagation in both Weibo and Twitter, we consider here the evolution of re-posting relationships between different users in order to establish propagation networks.We identify structural characteristics that evidently distinguish fake from real news in terms of only the structure of propagation networks.More specifically, even after short time (about 5 hours) from the first reposting, the ratio of layer sizes (a mesoscopic property), characteristic distance (a global property) and heterogeneity measure (a global property considering the different importance of users) is significantly larger for fake news.These measures can be effective and efficient signals for targeting and debunking fake news at early stages of circulation.Importantly, our analysis consider also real news created by non-official sources which suggest that the differences are due to different types of propagation networks (real news or fake news) rather than different types of creators (official or non-official accounts).Our results could help to understand the mechanism of fake news spread and inspire efficient detecting methods of high predictability.

Results
For analyzing the propagation of fake and real news, we construct a network from the re-postings between different users participating in circulating the message (see Methods).A schematic description of such propagation networks is shown in Fig. 1A.
Two typical propagation networks of fake news and real news in Weibo and Twitter are also demonstrated in Fig. 1B-D.The layer value in this paper is a property of edges defined as the numbers of hops of re-postings since the creator.Additionally, from looking at various examples of fake news propagation networks, it is somewhat surprising that for widely distributed fake news, the creator does not usually have the largest degree in the propagation network (Fig. S1).For example, in Fig. 1D, node a is the creator, while node b, which was reposted later, is the node with the largest degree and it performs like a broadcaster producing popularity spikes.On the contrary, in real news networks, with much greater odds the creator would be the node of the largest degree (Fig. S2), implying the dominance of source in the circulation of real news.The above observations suggest that further investigations of propagation networks, especially in a manner of comprehensively comparing fake news and real ones, will be insightful and useful.
Layer sizes and their ratio.The cumulative number of nodes as functions of time for four typical networks of fake news (Fig. 2A) and real news (Fig. 2B) in Weibo, as well as fake news (Fig. 2C) and real news (Fig. 2D) in Twitter are demonstrated.As can be seen, the plots suggest that the fraction of re-postings in the first layer of fake news is significantly smaller than that of real.However, in later layers, the fraction of re-postings of fake news is significantly larger than that of real news.It demonstrates that early adopters who repost the message shortly after the source play different roles in circulating fake news and real news, especially for real news the large number of early adopters essentially determines the whole propagation.These different roles lead to distinctive landscapes of propagation networks.
The analysis of layer sizes in propagation networks demonstrated above, are systematically extended to all the messages.As Fig. 3A and 3B demonstrate, fake news networks tend to possess a relatively smaller first layer while other layers are larger compared to real news.The ratio between layers size in a network is simply defined as the ratio between the sizes of the second layer to the first layer.As shown in Fig. 3C and 3D, this ratio in fake news is significantly larger than that of real news, and the distribution of ratio of layer sizes separates well the two types with only a small overlapped area.This novel finding shows that the ratio is a good indicator in distinguishing fake news from real ones.In circulation of fake news, early re-posters (in the first layer) function trivially in broadcasting the information and the success of the propagation depends intensively on the later re-posters, for example in the second layer.While for real news, the role of first layer is exclusive and these early re-posters basically determine how far the information will spread.We further investigate the distributions of ratios at different propagation durations from the time of first re-posting (Fig. S3).In both, Fig. 3C (within five hours) and Fig. 3D (for the whole lifetime), surprisingly, the separation of the fake and the real is significant, which profoundly makes the targeting of fake news in an early stage possible (Fig. S4).
It should be noted that while ordinary users post fake news, real news is more likely to be created by official accounts such as government agencies or mass media.In order to eliminate the possible impact of official creators, we also investigate the distribution of ratio of layer sizes for real news when including only non-official creators.The purpose of this investigation is to verify that the differences revealed above are between fake news and real news, and not just between official and non-official creators.As seen in Figs.3C and 3D, the ratio of fake news is significantly larger than that of real news also for non-official creators, and the distributions of these two curves also separate well fake and real news with only a small part of overlapped area.Thus, the differences between fake news and real news, in terms of layer ratio, is stable and can be useful as an early indicator in targeting fake news efficiently.

Distance between all pairs of nodes.
The ratio of layer sizes is a mesoscopic feature of network structure.In order to corroborate and understand the differences between fake and real news in a more comprehensive view, we further inspect a global feature in terms of the distance between all pairs of nodes.As can be seen in Fig. 4A, distances between pairs of nodes in fake news are relatively longer than those of real news, implying the fact that later adopters help deepen the penetration of fake news in social networks.In order to quantify this finding for all the networks, we propose a second measure called characteristic distance which is the inverse of the slope, parameter a, in Fig. 4B (see Methods).Considering the characteristic distance of all the networks as in Fig. 4B, fake news possesses a significantly longer characteristic distance (4.26) than that of real news (2.59).Similar results can also be observed in Twitter propagations (Fig. 4C).The distributions of characteristic distances for all networks are shown in Fig. 4D.The distances in fake news are relatively large compared to real news, and the two curves of fake and real news are also well separated.In particular, the distinction in terms of characteristic distance still holds also for real news of only non-official creators (Fig. 4D), implying that the indicator based on distance is independent to the type of creators.The distance can also identify fake news with high accuracy even in an early stage (Fig. S5).
Structural Heterogeneity.The normalized characteristic distance is also a global feature for networks.Note that the distance measure considers important users and ordinary users equally and ignores the difference across users.Considering degree as one of the most basic measures of importance of nodes which can be easily detected without global information, we analyze the degree distribution and heterogeneity measure (see Methods).The h reflects the difference between a propagation network and its counterpart of star network (the most heterogeneous topology) with the same size.Note that the relationship of heterogeneity and N for star network is linear in Fig. 5A.Network with smaller h implies being similar to star network and with higher structural heterogeneity as a result.Though the degree distribution demonstrates only slight difference between fake news and real news (Fig. S6), it is interesting found that the heterogeneity is significantly distinguishable.As can be seen in Fig. 5A, h of fake news is relatively larger compared to h of real news.Consistent findings can also be observed in Twitter as seen in Fig. 5B.In order to quantify the heterogeneity more systematically, two distributions of h considering different time intervals are further given in Fig. 5C (within five hours after the first re-posting) and Fig. 5D (after whole lifetime).We find that h of fake news is significantly larger than that of real news and the two distributions are well separated.It is revealed that in both media Weibo and Twitter, fake news networks have lower heterogeneity (larger h), indicating that their propagation involves few important broadcasters; while real news demonstrates higher heterogeneity (smaller h) and more stable layout, suggesting that a single important source dominants the entire propagation.The ability to distinguish fake news from real ones is valid also for real news posted by non-official users (Fig. S7).
This implies that the indicator based on degree heterogeneity is independent of the type of creators.Additionally, another measure based on degree ， named Herfindahl-Hirschman Index (HHI), shows similar results (Fig. S8).
The predictability of the heterogeneity measure demonstrates the highest accuracy between these three methods (Table 2).The method of heterogeneity has the highest accuracy of predictability as shown in Fig. 6.For an unknown type of Weibo network, the parameter h provides a high probability of identifying fake news as seen in Fig. 6A (considering re-postings within five hours after original re-posting) and Fig. 6B (considering all re-postings).We show in Fig. 6C the accuracy (see Methods) of predictability of fake news for different h.The weighted accuracy is about 76.4% and 78.7% respectively for re-postings within a relatively short time (five hours) and all re-postings.The former accuracy is just a little below the latter one, indicating that the predictability is high even after short times.Our results suggest that even without sophisticated features like texts or user profiles, explicit and understandable structural features can offer early detection of very good accuracy.

Discussion
New media means new conduit for communications, either misinformation or correct information.Being the most vital and popular forms of new media, social networks, fundamentally enhance the creation and dissemination of fake news that could affect massive population to become victims of false messages [23,24].Though existing solutions, especially the inspired machine learning approaches, perform impressively on targeting fake news, their black-box styles essentially prevent the ability of offering solid understanding and methods of debunking or blocking false information.
Human intensive labor approach is reliable and can be an option, e.g., Facebook trains and employs experts and volunteers to manually filter out fake news.However, this usually takes at least three days [3] and misses the optimal prevention window.Only signals that efficiently and accurately identify fake news at early stages can be useful in preventing the negative impact of false information on modern society.
Inspired by previous explorations, we claim here that fake news spread differently in network structure compared to authentic messages, even at short times.Empirical observations from comprehensive traces in both Twitter and Weibo solidly testify the above idea and as we show here, the structural differences appear already after short time and can be useful as early signal detection indicators (Fig. 6A).Specifically, we find that late adapters that broadcasting fake news greatly boost the circulation and penetration of false information in social networks.On the contrary, few late adapters will re-ignite the spread and the source of real news dominates the entire propagation.
The mechanism, which essentially couples information and structure in social networks, results in a distinctive landscape of circulations between fake and real news.
For example, fake news spread in a manner of multiple stars while real news mainly explodes usually around a single source.More importantly, several early signals can be confidently derived from the mechanism we revealed, including the layer-ratio, the average distance and the degree heterogeneity.Among them, the longer distance of fake news might be the result of more re-postings from strong ties between acquaintances.Therefore, they are likely to repost only form their acquaintance rather than from the source which result in longer distances.On the contrary, shorter distances in real news are more likely due to re-posting of authentic messages through weak ties [25] that just established between 'strangers' that never met, especially as consider the existing of Dunbar's number [26].Different from those sophisticated and time consuming features that has been considered in contemporary solutions, our suggested measures are simple, explainable, and explicit.Moreover, they can be precisely probed within five hours after the propagation starts, implying the possibility that fake news can be undermined properly in time.
Like rumors, fake news might emerge under conditions marked by a combination of uncertainty, involvement, anxiety, and credulity [23].And the difference in circulation between false and real may origin in human beings and their interactions instead of information itself.As stated that bad is always more influential than good [27], the unconsciousness of 'negative-bias' might result in late broadcastings of fake news and then essentially differs the spread from real ones.A future direction of disclosing factors that drive ordinary people in re-posting fake news, especially through weak ties, can be thus a promising research direction.Additionally, it is also valuable to study the most important or the weakest points in the propagation network considering burst of re-postings, and whether attacking these weak points could stop the propagation.

Methods
Data preprocessing.We analyze 1701 fake news of Weibo networks (with 973,391 users) and 492 real news of Weibo networks (with 347,401 users), and also 27 fake news of Twitter networks (with 105,335 users) and 28 real news of Twitter networks (with 133,109 users).

Weibo data
The retweeting traces of rumors and news are thoroughly collected through Weibo's open APIs [28].The fake news is from a Weibo official web page for fake news reporting.All messages announced by this web page are confirmed as fake news [29].
The real news is obtained from reliable sources.Most but not all the creators of the real news are official accounts, for example, government accounts and on-line newspaper accounts.We also select manually 51 real news networks whose creators are not official accounts.
In order to create the network, in which nodes are users of Weibo and links are re-postings, we first mine the following data both for fake and real news: (a) Users: the unique serial number of users who participate in the same network.We also mark the node of the network creator.
(b) Re-postings: the unique serial number of directed re-posting activities, and the serial number of source users and reposted users of this re-posting.
In this paper, only networks with size above 200 users are studied, because we want to analyze only relatively larger networks since they are more influential.More details are shown in Table 1.
Due to privacy issues, we agreed not to spread data of Twitter and Weibo publically according to the limitations of collection licensees.As a result, we could not share data used in this paper publically.

Twitter data
Twitter data was collected from Japanese tweets posted during 2011/3/11 to 2011/3/17, the Great East Japan earthquake period.During this period, many fake news propagated in the Japanese Twitter.For fake news, we first gathered 57 topics listed on the website [30] about the summary of fake news during the earthquake.The contents included tweets without evidence and malicious tweets, such as starvation of baby and elderly, someone under the server rack needed help, and the Japan prime minister was taking luxury supper during the disaster.When collecting tweets, we combined a few keywords related to the contents of each fake news.These keywords were proper nouns, such as place names and personal names.After that, we excluded correction tweets whose contents are against fake news including keywords such as "false" and "mistake".
For real news, we gathered 71 topics by combining keywords (proper nouns, such as place names and personal names) as with the fake news.We collected most of tweets originated from official accounts with verified Twitter badges such as government agencies, major newspapers and famous people.The contents included tweets about earthquake information, traffic information, donation information and so on.In addition to this, we also collected five topics originated from civilians without badges and those that were widely retweeted.These tweet contents were related to small tips during the disaster.
After gathering fake and real news tweet on a keyword basis, we focused on those with more than 200 tweets to create retweet network.We created retweet networks by using the mention symbol "@" in the tweets as a trigger.Because account name (username) of Twitter cannot be longer than 15 characters and can only contain alphanumeric characters (letters A-Z, numbers 0-9) except for underscores [31], we extracted those strings that consist of 2 to 12 characters after "@" as account names.
Note that we skipped the account names of one character because it rarely appeared in Japanese Twitter space, and it often appeared as the usage of emoticons such as "@ _ @ (surprised face expression)."Furthermore, since there is no distinction between uppercase and lowercase letters in account name rules, all account names were converted to lowercase letters to proceed.
If there are multiple "@" in one tweet, according to the above rules, we extracted multiple account names as much as possible and linked them in order from the beginning of the sentence to create the networks.Basically, from one tweet, there exists only one account that retweeted previously.However, tweets were often deleted aftermath in particular fake news, and the network was segregated by this approach.
Therefore, by going back to clues in the remaining tweets, we created the network by extracting as much accounts as possible from one tweet.
We created networks with the node as an account name and the link as a connection at the mention symbol "@" with the above rules.We extracted the largest connected component (LCC) that did not consider link directions, and we analyzed only those with the LCC size above 200 nodes.Account names with the oldest tweet time in LCC were treated as creators.

Network model establishing.
Based on the information we analyze above, we establish a directed network as shown in Fig. 1A.The users are the nodes in the network, and the re-postings are the edges in network.And we mark network creator in particular.Each edge has a direction which is either from creator to re-poster or from former re-poster to latest re-poster.After modeling, we could plot figure for typical networks for both fake and real news of Weibo and Twitter as shown in Fig. 1B to 1E by using software named Origin.
Ratio of layer sizes.The layer is a property of edges which means the times (for example, once, twice, three times and so on) of re-posting since the creator.The ratio of layer sizes is a measure calculated for each network separately and it is defined as: n and 2 n are the sizes of the first and second layer for a certain network respectively.

Distances.
In order to measure the distances, for each network we first calculate the distances among all pairs of nodes in the network, and next we plot them in a logarithmic scale (y axis).It can be seen that the function can approximated by exponential.We consider lineal part of curves where their x value (distance) is above one.The x value seems to have lineal relationship with y value.We calculate the characteristic distance according to following formula: : The characteristic distance.
: The distance between pairs of nodes.
: The probability of a certain distance.
The distance is influenced by the number of nodes in a network.Since the size of network increases by about the logarithm of N in many typical networks, for example, random network, we calculate normalized characteristic distance as: , a : The normalized characteristic distance.
N : The number of nodes in the network Heterogeneity measure.The heterogeneity [32] is a measure that was calculated for each network separately and it is defined as: N : The number of nodes in the network Probability of fake news.Here we use ratio of layer sizes as an example to explain this.We divide ratio of layer sizes into portion.In the number i portion, the probability of fake news is: : The probability of fake news in the number i portion (the number of fake news in this portion divided by the total number of fake news).
: The probability of real news in the number i portion.
Given a certain ratio of layer sizes, we know the probability of fake news by known ratio for different networks.

Accuracy of predictability.
Considering the small number of propagation networks in Twitter, the study of predictability is performed on them.When we distinguish fake news from real ones using different networks' measures such as ratio of layer sizes or the characteristic distance, it is important to know the prediction accuracy.Here we use ratio of layer sizes as an example to explain weighted accuracy of predictability.
First, we rank the Weibo networks by their ratio of layer sizes ignoring their types (fake or real).Second, we randomly split these propagation networks into n portions which have same number of network.Finally, we calculate the weighted accuracy using following formula: : The number of portions that we divide.2A, the fraction of nodes located in the first layer is around 45% of all the nodes at the end of propagation.However, in Fig. 2B, the total number of nodes in the first layer occupies about 78% of all the nodes.As for the long time limit, for example, in Fig.The heterogeneity measure of the networks enables a high predictability of fake news.The black line in the boundary is the probability that a Weibo network is fake news.
The three vertical lines divide the figure into four parts with equal number of networks.For example, the area on the left of the "25%" line has 25% of the Weibo networks both for fake and real news.(B) Predicting Weibo networks' type using all re-postings in different networks by h.(C) The weighted average accuracy is 0.764 (within five hours), and the weighted average accuracy is 0.787 for the whole lifetime.
Additionally, in some situations, the accuracy is very high although we only consider one structural property.For example, for the re-postings within 5 hours, when we only consider networks whose h is more than 0.2, the accuracy is 0.863 for 30% of the networks; and for all re-postings, when we only consider networks whose h is more than 0.2, the accuracy is 0.850 for 46% networks.

Figure 2 .
Figure 2. Different layer sizes grow with time in typical networks.The y axis is the cumulative number of re-postings at different layers of typical networks in Weibo and Twitter.The x axis is the time (in hours) from the time of creating the news and the different colors stand for different layers.Shown examples are (A) fake news and (B) real news in Weibo, as well as (C) fake news and (D) real news in Twitter.In Fig.

2A, if the
total number of nodes does not change much after 20 hours, we ignore the reposting after 20 hours in order to clearly see the layers in the figure.These four typical networks are the same networks shown in Fig.1.It is seen that the layer sizes of real news and fake news are significantly different in both, Weibo and Twitter already after few hours.Real news networks tend to have a relatively larger first layer, while fake news networks are relatively uniformly distributed in different layers.

Figure 3 . 8 4.3 10  8 2.3 10 
Figure 3. Ratio of layer sizes differentiates fake news from real news.The distribution of ratio of layer sizes and its development within a period of time can differentiate fake news from real news.These differences appear already after several hours.(A) The PDF of all re-postings in the first five layers considering all of the Weibo networks.One can see that more real news re-postings appear in the first layer, and more fake news re-postings appear in later layers.The Welch Two Sample t-test

Figure 4 .
Figure 4. Distances between all pairs of nodes in fake news and real news differentiates fake news from real news.Real news has relatively smaller characteristic distances both in Weibo and Twitter.(A) The PDF of distances of three typical examples of networks of similar sizes for both fake and real news of Weibo.We plot six curves of distance distribution for six typical networks in Weibo.These six curves show separation into two groups.Three curves of real news are on the left, while the other three curves of real news are on the right.(B) The PDF of pairs of nodes distances for all real and fake news networks of Weibo.The characteristic distance (a, the inverse of slope in the semi-log plot) of fake news is 4.26 (from distance two to twenty five), and that of real news is 2.59 (from distance two to ten).(C)The PDF of pair distances of all real and fake news networks of Twitter.Here also, real news has relatively shorter characteristic distances.The characteristic distance of fake news is 3.66 (from distance two to twenty), and that of real news is 1.61 (from

Figure 5 .
Figure 5. Comparing the heterogeneity measure for fake and real news in Weibo.

( A )
The x axis is the number of users in a Weibo network, and the y axis is the heterogeneity measure.The black line is the theoretical line of the star layout.The h is the difference of heterogeneity value between a real network and theoretical value of star layout.(B) The scatter plot of Twitter.(C) Distribution of h within five hours from the first re-posting.The p-value here is

Figure 6 .
Figure 6.Heterogeneity measure show predictability of fake news much before it is widely spread.Among the different structural properties that we analyzed, we find that the Heterogeneity measure is able to identify fake news with accuracy of about 0.787 percent.(A) Predicting Weibo networks' type according to h after five hours.