A multi-layer approach to disinformation detection in US and Italian news spreading on Twitter

We tackle the problem of classifying news articles pertaining to disinformation vs mainstream news by solely inspecting their diffusion mechanisms on Twitter. This approach is inherently simple compared to existing text-based approaches, as it allows to by-pass the multiple levels of complexity which are found in news content (e.g. grammar, syntax, style). As we employ a multi-layer representation of Twitter diffusion networks where each layer describes one single type of interaction (tweet, retweet, mention, etc.), we quantify the advantage of separating the layers with respect to an aggregated approach and assess the impact of each layer on the classification. Experimental results with two large-scale datasets, corresponding to diffusion cascades of news shared respectively in the United States and Italy, show that a simple Logistic Regression model is able to classify disinformation vs mainstream networks with high accuracy (AUROC up to 94%). We also highlight differences in the sharing patterns of the two news domains which appear to be common in the two countries. We believe that our network-based approach provides useful insights which pave the way to the future development of a system to detect misleading and harmful information spreading on social media.


Introduction
In recent years there has been increasing interest in the issue of disinformation spreading on online social media. Global concern over false (or "fake") news as a threat to modern democracies has been frequently raised-ever since 2016 US Presidential elections-in correspondence of events of political relevance, where the proliferation of manipulated and low-credibility content attempts to drive and influence people's opinions [1][2][3][4].
Researchers have highlighted several drivers for the diffusion of such malicious phenomenon, which include human factors (confirmation bias [5], naive realism [6]), algorithmic biases (filter bubble effect [1]), the presence of deceptive agents on social platforms (bots and trolls [7]) and, lastly, the formation of echo chambers [8] where people polarize their opinions as they are insulated from contrary perspectives.
The problem of automatically detecting online disinformation news has been typically formulated as a binary classification task (i.e. credible vs non-credible articles), and tackled with a variety of different techniques, based on traditional machine learning and/or deep learning, which mainly differ in the dataset and the features they employ to perform the classification. We may distinguish three approaches: those built on content-based features, those based on features extracted from the social context, and those which combine both aspects. A few main challenges hinder the task, namely the impossibility to manually verify all news items, the lack of gold-standard datasets and the adversarial setting in which malicious content is created [4,7,9].
In this work we follow the direction pointed out in a few recent contributions on the diffusion of disinformation compared to traditional and objective information. These contributions have shown that false news spread faster and deeper than true news [10], and that social bots and echo chambers play an important role in the diffusion of malicious content [7,8]. Therefore, we focus on the analysis of spreading patterns which naturally arise on social platforms as a consequence of multiple interactions between users, due to the increasing trend in online sharing of news [1].
We propose a classification framework based on a multi-layer formulation of Twitter diffusion networks, extending the results of our previos work [11]. For each article we disentangle different social interactions on Twitter, namely tweets, retweets, mentions, replies and quotes, to accordingly build a diffusion network composed of multiple layers (one for each type of interaction), and we compute structural features separately for each layer. We pick a set of global network properties from the network science toolbox which can be qualitatively explained in terms of social dimensions and allow us to encode different networks with a tuple of features. These include traditional indicators, e.g. network density, number of strong/weak connected components, and diameter, and more elaborated ones such as main K-core number [12] and structural virality [13]. Our work is driven by the following research questions: • RQ1: Does a multi-layer representation of Twitter diffusion networks yield a significant advance in terms of classification accuracy over a conventional single-layer diffusion network? • RQ2: Which of the above features, and which layer, are the most effective in the classification task? We perform classification experiments with an off-the-shelf Logistic Regression model on two different datasets of mainstream and disinformation news shared on Twitter respectively in the United States and in Italy during 2019. In the former case we also perform multiple disaggregated tests to control for political biases inherent to different news sources, referring to the procedure proposed in [3] to label different outlets. Overall we show that we are able to classify credible vs non-credible diffusion networks (and consequently news articles) with high accuracy (AUROC up to 94%), also when controlling for the political bias of sources (and training only on left-biased or right-biased articles). We observe that the layer of mentions alone conveys useful information for the classification, denoting a different usage of this functionality when sharing news belonging to the two news domains. We also show that the most discriminative features, which are relative to the breadth and depth of the largest cascades in different layers, are the same across the two countries.
As our datasets are collected in different countries, we also investigate whether disinformation can be detected independently from the country where it originates. Crosscountry experiments show that our methodology fails to distinguish reliable vs nonreliable news regardless of where it originates from. We argue that this might be due either to the high imbalance of data or to the class discrepancies which are country specific. It emerges that a classifier based on our methodology should be trained in a country-wise fashion.
The outline of this paper is the following: we first provide a description of related literature; next, we describe our methodology acknowledging its intrinsic limitations; then, we provide experimental results and finally we draw conclusions and future directions.

Related literature
A deep learning framework for detecting fake news cascades is proposed in [14], where the authors refer to [10] in order to collect Twitter cascades pertaining to verified false and true rumors. They employ geometric deep learning, a novel paradigm for graph-based structures, to classify cascades based on four categories of features, such as user profile, user activity, network and spreading, and content. They also observe that a few hours of propagation are sufficient to distinguish false news from true news with high accuracy.
Diffusion cascades on Weibo and Twitter are analyzed in [15], where the authors focus on highlighting different topological properties, such as the number of hops from the source or the heterogeneity of the network, to show that fake news shape diffusion networks which are highly different from credible news, even at early stages of propagation.
In this work, we consider the results of [11] as our baseline. In that paper, we classified US news articles leveraging Twitter diffusion networks. Here we further build on that approach by exploiting a disaggregated multi-layer representation of Twitter social interactions, in order to take advantage of a larger and more specific set of features. The methodology used in [11] has several analogies with [16], where authors successfully detect Twitter astroturfing content, i.e. political campaigns disguised as spontaneous grassroots, with a machine learning framework based on network features.

Disinformation and mainstream news
In this work we formulate our classification problem as follows: given two classes of news articles, respectively D (disinformation) and M (mainstream), a set of news articles A i and associated class labels C i ∈ {D, M}, and, for each article A i , a set of tweets i = {T 1 i , T 2 i , . . .} each containing an Uniform Resource Locator (URL) pointing explicitly to article A i , predict the class C i of each article A i .
There is huge debate and controversy on a proper taxonomy of malicious and deceptive information [2-4, 11, 17-19]. In this work we prefer the term disinformation to the more specific fake news to refer to a variety of misleading and harmful information. Therefore, we follow a source-based approach, a consolidated strategy also adopted by [2,3,7,18], in order to obtain relevant data for our analysis. We collected: 1. Disinformation articles, published by websites which are known for producing low-credibility content, false and misleading news reports as well as extreme propaganda and hoaxes, and flagged as such by reputable journalists and fact-checkers;

US dataset
We collected tweets associated to a dozen US mainstream news websites, i.e. most trusted sources described in [20], with the Streaming API, and we referred to Hoaxy API [18] for what concerns tweets containing links to 100+ US disinformation outlets. We filtered out articles associated to less than 50 tweets to reduce noisy observations. The resulting dataset contains overall ∼1.7 million tweets for mainstream news, collected in a period of three weeks (February 25th, 2019-March 18th, 2019), which are associated to 6978 news articles, and ∼1.6 million tweets for disinformation, collected in a period of three months (January 1st, 2019-March 18th, 2019) for the sake of balance of the two classes, which hold 5775 distinct articles. Diffusion censoring effects [13] were correctly taken into account in both collection procedures. We provide in Fig. 1 the distribution of articles by source and political bias for both news domains.
As it is reported that conservatives and liberals exhibit different behaviors on online social platforms [21-23], we further assigned a political bias label to different US outlets (and therefore news articles) following the procedure described in [3]. In order to assess the robustness of our method, we performed classification experiments by training only on left-biased (or right-biased) outlets of both disinformation and mainstream domains and testing on the entire set of sources. As additional test, we excluded particular sources that outweigh the others in terms of samples to avoid over-fitting.

Italian dataset
For what concerns the Italian scenario we first collected tweets with the Streaming API in a 3-week period (April 19th, 2019-May 5th, 2019), filtering those containing URLs pointing to Italian official newspapers websites as described in [24,25]; these correspond to the list provided by the association for the verification of newspaper circulation in Italy (Accertamenti Diffusione Stampa). a We instead referred to the dataset provided by [26] to obtain a set of tweets, collected continuously since January 2019 using the same Twitter endpoint, which contain URLs to 60+ Italian disinformation websites (the list is available in the Additional file 1); these were obtained by using black-lists from Italian fact-checking websites and agencies (PagellaPolitica.it, Bufale.net and Butac.it). In order to get balanced classes, we retained data collected in a longer period w.r.t to mainstream news (April 5th, 2019-May 5th, 2019). In both cases we filtered out articles with less than 50 tweets; overall this dataset contains ∼160k mainstream tweets, corresponding to 227 news articles, and ∼100k disinformation tweets, corresponding to 237 news articles. We provide in Fig. 2 the distribution of articles according to distinct sources for both news domains. As in the US dataset, we took into account censoring effects [13] by excluding tweets published before (left-censoring) or after two weeks (right-censoring) from the beginning of the collection process.
The different volumes of news shared on Twitter in the two countries are due both to the different population size of US and Italy (320 vs 60 million) but also to the different usage of Twitter platform (and social media in general) for news consumption [27]. Both datasets analyzed in this work are available from the authors on request.  Table 1 the breakdown of our datasets for what concerns cardinalities of different Twitter interactions across news domains. We notice that news sharing mostly involves retweeting and tweets in both countries and for both classes of news articles.

Breakdown of Twitter interactions
For what concerns different Twitter actions, users primarily interact with each other using retweets and mentions [22]. The former are the main engagement activity and act as a form of endorsement, allowing users to rebroadcast content generated by other users [28]. Besides, when node B retweets node A we have an implicit confirmation that information from A appeared in B's Twitter feed [16]. Quotes are simply a special case of retweets with comments. Mentions usually include personal conversations as they allow someone to address a specific user or to refer to an individual in the third person; in the first case they are located at the beginning of a tweet and they are known as replies, otherwise they are put in the body of a tweet [22]. The network of mentions is usually seen as a stronger version of interactions between Twitter users, compared to the traditional graph of follower/following relationships [29].

Building diffusion networks
Using the notation described in [30] we employ a multi-layer representation for Twitter diffusion networks. Sociologists have indeed recognized decades ago that it is crucial to study social systems by constructing multiple social networks where different types of ties among the same individuals are used [31]. Therefore, for each news article we build a multi-layer diffusion network composed of four different layers, one for each type of social interaction on Twitter platform, namely retweet (RT), reply (R), quote (Q) and mention (M), as shown in Fig. 3. These networks are not necessarily node-aligned, i.e. users might be missing in some layers. We do not insert "dummy" nodes to represent users not active in a given layer as it would have severe impact on the global network properties (e.g. number of weakly connected components). Alternatively one may look at each multi-layer diffusion network as an ensemble of individual graphs [30]; since global network properties are computed separately for each layer, they are not affected by the presence of inter-layer edges, which nonetheless allow the diffusion of information across layers.
In our multi-layer representation, each layer is a directed graph where we add edges and nodes for each tweet of the layer type. While the direction of information flow-thus the edge direction-is unambiguous for some layers, e.g. RT, the same is not true for others. Here we follow the conventional approach described e.g. in [7,16,19,22] to define the direction of edges. For the RT layer: whenever user a retweets account b we first add nodes a and b if not already present in the RT layer, then we build an edge that goes from b to a if it does not exist. Similarly for the other layers: for the R layer edges go from user a (who replies) to user b, for the Q layer edges go from user b (who is quoted by) to user a and for the M layer edges go from user a (who mentions) to user b. Note that, by construction, our layers do not include isolated nodes; they correspond to "isolated tweets", i.e. tweets which have not originated any interactions with other users. However, they are present in our dataset, and their number is exploited for classification, as described below.

Global network properties
We used a set of global network indicators which encode each network layer by a tuple of features. Then we simply concatenated tuples as to represent each multi-layer network with a single feature vector. We used the following global network properties:  all nodes in a graph; the local clustering coefficient of a node quantifies how close its neighbourhood is to being a complete graph (or a clique). It is computed according to [32]. 7. Main K-core Number (KC): a K-core [12] of a graph is a maximal sub-graph that contains nodes of internal degree K or more; the main K-core number is the highest value of K (in directed graphs the total degree is considered). 8. Density (d): the density for directed graphs is d = |E| |V ||V -1| , where |E| is the number of edges and |V | is the number of vertices in the graph; the density equals 0 for a graph without edges and 1 for a complete graph. 9. Structural virality of the largest weakly connected component (SV): this measure is defined in [13] as the average distance between all pairs of nodes in a cascade tree or, equivalently, as the average depth of nodes, averaged over all nodes in turn acting as a root; for |V | > 1 vertices, SV = 1 |V ||V -1| i j d ij where d ij denotes the length of the shortest path between nodes i and j. This is equivalent to compute the Wiener's index [33] of the graph and multiply it by a factor 1 |V ||V -1| . In our case we computed it for the undirected equivalent graph of the largest weakly connected component, setting it to 0 whenever |V | = 1. We used networkx Python package [34] to compute all features. Whenever a layer is empty, we simply set to 0 all its features. In addition to computing the above nine features for each layer, we added two indicators for encoding information about isolated tweets, namely the number T of isolated tweets (containing URLs to a given news article) and the number U of unique users authoring those tweets. Therefore, a diffusion network for a given article is represented by a vector with 9 · 4 + 2 = 38 entries.

Interpretation of network features and layers
The aforementioned network properties can be qualitatively explained in terms of social footprints as follows (see the illustrative examples in Fig. 4): in this specific class of networks, SCC correlates with the size (i.e. number of nodes) of the diffusion layer, as the propagation of news occurs in a broadcast manner in most cases, i.e. re-tweets dominate on other interactions, while LSCC allows to distinguish cases where such monodirectionality is somehow broken. WCC equals (approximately) the number of distinct diffusion cascades pertaining to each news article, with exceptions corresponding to those cases where some cascades merge together via Twitter interactions such as mentions, quotes and replies, and accordingly LWCC and DWCC equals the size and the depth of the largest cascade. CC corresponds to the level of connectedness of neighboring users in a given diffusion network whereas KC identifies the set of most influential users in a network [19]. Finally, d describes the proportions of potential connections between users which are actually activated and SV indicates whether a news item has gained popularity with a single and large broadcast or in a more viral fashion through multiple generations [13].

Limitations
As mentioned beforehand, we use a coarse approach to label articles at the source level relying on a huge corpus of literature on the subject. We believe that this is currently the most reliable classification approach, although it entails obvious limitations, as disinformation outlets may also publish true stories and likewise misinformation is sometimes reported on mainstream media [4]. Also, given the choice of news sources, we cannot test whether our methodology is able to classify disinformation vs factual but not mainstream news which are published on niche, non-disinformation outlets [11].
Another crucial aspect in our approach is the capability to fully capturing sharing cascades on Twitter associated to news articles. It has been reported [35] that the Twitter streaming endpoint filters out tweets matching a given query if they exceed 1% of the global daily volume b of shared tweets, which nowadays is approximately 5 · 10 8 ; however, as we always collected less than 10 6 tweets per day, we did not incur in this issue and we thus gathered 100% of tweets matching our query.
We built Twitter diffusion networks using an approach widely adopted in the literature [3,7,19]. We remark that there is an unavoidable limitation in Twitter Streaming API, which does not allow to retrieve true re-tweeting cascades because re-tweets always point to the original source and not to intermediate re-tweeting users [10,13]; thus we adopt the only viable approach based on Twitter's public availability of data. However, by disentangling different interactions with multiple layers we potentially reduce the impact of this limitation on the global network properties compared to the approach used in our baseline.
Finally, a limitation of the present work is the lack of a direct comparison of our methodology with other techniques, an exercise which boils down to assessing several classification metrics on the same dataset(s). As thoroughly discussed in [36], the problem of reliably comparing fake-news classifiers is open and faces many types of challenges that go out the scope of this work. We just mention that the performance of our classification framework is quantitatively comparable (in terms of AUROC value) to that of state-of-the-art deep learning models for fake news detection [14,37]. However, this result is only indicative, because obtained on different datasets and, in one case [37], with different focus of the classification task.

Setup
We performed classification experiments using a basic off-the-shelf classifier, namely Logistic Regression (LR) with L2 penalty; this also allows us to compare results with our baseline [11]. We applied a standardization of the features and we used the default configuration for parameters as described in scikit-learn package [38]. We also tested other  classifiers (such as K-Nearest Neighbors, Support Vector Machines and Random Forest) but we omit results as they give comparable performance. We remark that our goal is to show that a very simple machine learning framework, with no parameter tuning and optimization, allows for accurate results with our network-based approach. We used the following evaluation metrics to assess the performance of different classifiers ( Operating Characteristic (ROC) curve [39], which plots the TP rate versus the FP rate, shows the ability of a classifier to discriminate positive samples from negative ones as its threshold is varied; the AUROC value is in the range [0, 1], with the random baseline classifier holding AUROC = 0.5 and the ideal perfect classifier AUROC = 1; thus larger AUROC values (and steeper ROCs) correspond to better classifiers. In particular we computed so-called macro average-simple unweighted mean-of these metrics evaluated considering both labels (disinformation and mainstream). We employed stratified shuffle split cross validation (with 10 folds) to evaluate performance.
Finally, we partitioned networks according to the total number of unique users involved in the sharing, i.e. the number of nodes in the aggregated network represented with a single-layer representation considering together all layers and also isolated tweets. A breakdown of both datasets according to size class (and political biases for the US scenario) is provided in Table 2 and Table 3.

Classification performance
In Table 4 we first provide classification performance on the US dataset for the LR classifier evaluated on the size classes described in Table 2. We can observe that in all instances our methodology performs much better than a random classifier (50% AUROC), with AUROC values above 85% in all cases.
For what concerns political biases, as the classes of mainstream and disinformation networks are not balanced (e.g., 1292 mainstream and 4149 disinformation networks with Table 4 Performance of the LR classifier (using a multi-layer approach) evaluated on different size classes on both the US (top rows) and the Italian (bottom rows) dataset right bias) we employ a Balanced Random Forest with default parameters (as provided in imblearn Python package [40]). In order to test the robustness of our methodology, we trained only on left-biased networks or right-biased networks and tested on the entire set of sources (relative to the US dataset); we provide a comparison of AUROC values for both biases in Fig. 5. We can notice that our multi-layer approach still entails significant results, thus showing that it can accurately distinguish mainstream news from disinformation regardless of the political bias. We further corroborated this result with additional classification experiments, that yield similar performance, in which we excluded from the training/test set two specific sources (one at a time and both at the same time) that outweigh the others in terms of data samples-respectively "breitbart.com" for right-biased sources and "politicususa.com" for left-biased ones [36]. We performed classification experiments on the Italian dataset using the LR classifier and different size classes (notice that [1000, +∞) is empty for the Italian dataset); we show results for different evaluation metrics in Table 4. We can see that despite the limited number of samples (one order of magnitude smaller than the US dataset) the performance is overall in accordance with the US scenario.
As shown in Table 5, we obtain results which are much better than our baseline in all size classes: • In the US dataset our multi-layer methodology performs much better in all size classes except for large networks ([1000, +∞) size class), reaching up to 13% improvement on smaller networks ([0, 100) size class); Table 5 Comparison of performance of our multi-layer approach vs the baseline (single-layer). We show AUROC values for the LR classifier evaluated on different size classes of both US and IT datasets

Layer importance analysis
In order to understand the impact of each layer on the performance of classifiers, we performed additional experiments considering separately each layer (we ignored T and U features relative to isolated tweets). In Table 6 we show metrics for each layer and all size classes, computed with a 10-fold stratified shuffle split cross validation, evaluated on the US dataset; in Fig. 6 we show AU-ROC values for each layer compared with the general multi-layer approach. We can notice that both Q and M layers alone capture adequately most of discrepancies of the two distinct news domains in the United States as they obtain good results with AUROC values in the range 75%-86%; these are comparable with those of the multi-layer approach which, nevertheless, outperforms them across all size classes.
In the Italian dataset we observe that the M layer obtains comparable performance w.r.t the multi-layer approach for what concerns small networks and the dataset altogether, whereas the RT layer performs better on large networks (see Table 7 and Fig. 7). We also notice higher values in standard deviations of performance metrics which are likely due to the limited size of the training/test data.

Feature importance analysis and cross-country experiments
We further investigated the importance of each feature by performing a χ 2 test, with 10fold stratified shuffle split cross validation, considering the entire range of network sizes [0, +∞). We show the Top-5 most discriminative features for each country in Table 8. We find exactly the same set of features (with different rankings in the Top-3) in both countries; these correspond to two global network properties-LWCC, which indicates the size of the largest cascade in the layer, and SCC, which correlates with the size of the layer (ρ ≈ 0.99, with p ≈ 0 in all cases)-associated to the same set of layers (Q, RT and M).
We further performed a χ 2 test to highlight the most discriminative features in the M layer of both countries, which performed equally well in the classification task as previously highlighted; also in this case we focused on the entire range of network sizes [0, +∞). Interestingly, we discovered exactly the same set of Top-3 features in both countries, namely LWCC, SCC and DWCC (which indicates the depth of the largest cascade in the layer). We performed a Kolmogorov-Smirnov two-sample test to assess whether distributions of these features are statistically equivalent across the two news domains; the hypothesis was rejected in all cases at α = 0.05.  The similarities evidenced so far in both countries-i.e., classification performance of single layers and features importance-might suggest that the two news domains exhibit discrepancies which are geographic-independent. We further investigated this hypothesis by testing the performance of both LR and Balanced Random Forest classifiers in several cross-country settings, e.g. training on the US dataset and testing on the Italian (and viceversa), performing feature normalization either over the entire data or separately for training and test sets, to investigate whether we can classify disinformation vs mainstream news regardless of the country where they originate. Interestingly, performance is in all cases worse or equal than those of a random classifier (AUROC = 50%); this might be due either to the high imbalance of data across the two countries, or most likely suggests that sharing patterns of the two news domains exhibit coupled dissimilarities which are very country specific.

Conclusions
In this work we tackled the problem of the automatic classification of news articles in two domains, namely mainstream and disinformation news, with a language-independent approach which is based solely on the diffusion of news items on Twitter social platform. We disentangled different types of interactions on Twitter to accordingly build a multilayer representation of news diffusion networks, and we computed a set of global network properties-separately for each layer-in order to encode each network with a tuple of features. Our goal was to investigate whether a multi-layer representation performs better than an aggregated one [11], and to understand which of the features, observed at given layers, are most effective in the classification task.
Experiments with an off-the-shelf classifier such as Logistic Regression on datasets pertaining to two different media landscapes (US and Italy) yield very accurate classification results (AUROC up to 94%), also when controlling for the different political bias of news sources, which are far better than our baseline [11] with improvements up to 20%. Classification performance using individual layers shows that the layer of mentions alone entails better performance w.r.t. other layers in both countries, pointing in both cases to a peculiar usage of this type of Twitter interaction across the two domains.
We also highlighted the most discriminative features across different layers in both countries; we noticed the exact same set of features, suggesting, at first glance, that differences between the two news domains might be country-independent and rather due only to the typology of content shared. However, the two news domains exhibit coupled dissimilarities in sharing patterns which appear to be very country specific, and our methodology fails to detect disinformation regardless of where it originates.
Overall, our results prove that the topological features of multi-layer diffusion networks might be effectively exploited to detect online disinformation. Notice that we do not deny the presence of deceptive efforts to orchestrate the regular spread of information on social media via content amplification and manipulation [41,42]. On the contrary, we postulate that such hidden forces might play to accentuate the discrepancies between the diffusion patterns of disinformation and mainstream news (and thus to make our methodology effective).
In the future we aim to further investigate two main directions: (1) employ temporal networks to represent news diffusion and apply classification techniques (e.g. recurrent neural networks) that take into account the sequential aspect of data; (2) leverage our networkbased features in addition to state-of-the-art text-based approaches for "fake news" detection in order to deliver a real-world system to detect misleading and harmful information spreading on social media.