Early Detection of Promoted Campaigns on Social Media

Social media expose millions of users every day to information campaigns --- some emerging organically from grassroots activity, others sustained by advertising or other coordinated efforts. These campaigns contribute to the shaping of collective opinions. While most information campaigns are benign, some may be deployed for nefarious purposes. It is therefore important to be able to detect whether a meme is being artificially promoted at the very moment it becomes wildly popular. This problem has important social implications and poses numerous technical challenges. As a first step, here we focus on discriminating between trending memes that are either organic or promoted by means of advertisement. The classification is not trivial: ads cause bursts of attention that can be easily mistaken for those of organic trends. We designed a machine learning framework to classify memes that have been labeled as trending on Twitter.After trending, we can rely on a large volume of activity data. Early detection, occurring immediately at trending time, is a more challenging problem due to the minimal volume of activity data that is available prior to trending.Our supervised learning framework exploits hundreds of time-varying features to capture changing network and diffusion patterns, content and sentiment information, timing signals, and user meta-data. We explore different methods for encoding feature time series. Using millions of tweets containing trending hashtags, we achieve 75% AUC score for early detection, increasing to above 95% after trending. We evaluate the robustness of the algorithms by introducing random temporal shifts on the trend time series. Feature selection analysis reveals that content cues provide consistently useful signals; user features are more informative for early detection, while network and timing features are more helpful once more data is available.


Introduction
An increasing number of people rely, at least in part, on information shared on social media to form opinions and make choices on issues related to lifestyle, politics, health, and products purchases [4,11,60]. Such reliance provides a variety of entities -from single users to corporations, interest groups, and governments -with motivation to influence collective opinions through active participation in online conversations. There are also obvious incentives for the adoption of covert methods that enhance both perceived and actual popularity of promoted information. There are abundant recently reported examples of abuse: astroturf in political campaigns, or attempts to spread fake news through social bots under the pretense of grassroots conversations [64,30,9]; pervasive spreading of unsubstantiated rumors and conspiracy theories [8]; orchestrated boosting of perceived consensus on relevant social issues performed by governments [67]; propaganda and recruitment by terrorist organizations, like ISIS [6,32]; and actions involving social media and stock market manipulation [73].
The situation is ripe with dangers as people are rarely equipped to recognize propaganda or promotional campaigns as such. It can be difficult to establish the origin of a piece of news, the reputation of its source, and the entity behind its promotion on social media, due both to the intrinsic mechanisms of sharing and to the high volume of information that competes for our attention. Even when the intentions of the promoter are benign, we easily interpret large (but possibly artificially enhanced) popularity as widespread endorsement of, or trust in, the promoted information.
There are at least three questions about information campaigns that present scientific challenges: what, how, and who. The first concerns the subtle notion of trustworthiness of information, ranging from verified facts [18], to rumors and exaggerated, biased, unverified or fabricated news [64,85,8]. The second considers the tools employed for the propaganda. Again, the spectrum is wide: from a known brand that openly promotes its products by targeting users that have shown interest, to the adoption of social bots, trolls and fake or manipulated accounts that pose as humans [74,30,26,20,37]. The third question relates to the (possibly concealed) entities behind the promotion efforts and the transparency of their goals. Even before these question can be explored, one would need to be able to identify an information campaign in social media. But discriminating such campaigns from grassroots conversations poses both theoretical and practical challenges. Even the very definition of "campaign" is conceptually difficult, as it entangles the nature of the content (e.g., product or news), purpose of the source (e.g., deception, recruiting), strategies of dissemination (e.g., promotion or orchestration), different dynamics of user engagement (e.g., the aforementioned social bots), and so on. This paper takes a first step toward the development of computational methods for the early detection of information campaigns. In particular, we focus on trending memes and on a special case of promotion, namely advertisement, because they provide convenient operational definitions of social media campaigns. We formally define the task of discriminating between organic and promoted trending memes. Future efforts will aim at extending this framework to other types of information campaign.

The challenge of identifying promoted content
On Twitter, it is common to observe hashtags -keywords preceded by the # sign that identify messages about a specific topic -enjoying sudden bursts in activity volume due to intense posting by many users with an interest in the topic [45,84,58]. Such hashtags are labeled as trending and are highlighted on the Twitter platform. Twitter algorithmically identifies trending topics in a predetermined set of geographical locations. Although Twitter recently included personalized and clustered trends, the ones in the collection analyzed here correspond to single hashtags selected on the basis of their popularity. Unfortunately, detailed knowledge about the algorithm and criteria used to identify organic trends is not publicly available [72]. Other hashtags are exposed prominently after the payment of a fee by parties that have an interest in enhancing their popularity. Such hashtags are called promoted and often enjoy subsequent bursts of popularity similar to those of trending hashtags, therefore being listed among trending topics. Of course, once Twitter labels a hashtag as trending, it is not necessary to detect whether or not it is promoted -this information is disclosed by Twitter. However, since it is difficult to manually annotate a sufficiently large datasets of campaigns, we use organic and promoted trending topics as a proxy for a broader set of campaigns, where promotion mechanisms may be hidden. Our data collection methodology provide us with a large source of reliable "ground truth" labels about promotion, which represent an ideal testbed to evaluate detection algorithms. These algorithms have to determine whether or not a hashtag is promoted based on information that would be available even in cases where the nature of a trend is unknown. We stress that our goal of distinguishing mechanisms for promoting popular content is different from that of predicting viral topics, an interesting area of research in its own right [79,15,16]. Discriminating between promoted and organically trending topics is not trivial, as Table 1 illustrates -promoted and organic trending hashtags often have similar characteristics. One might assume that promoted trends display volume patterns characteristic of exogenous influence, with sudden bursts of activity, whereas organic trends would conform to more gradual volume growth patterns typical of endogenous processes [68,59,45]. However, Fig. 1 shows that promoted and organic trends exhibit similar volume patterns over time. Furthermore, promoted hashtags may preexist the moment in which they are given the promoted status and may have originated in an entirely grassroots fashion. It is therefore conceivable for such hashtags to display features that are largely indistinguishable from those of other grassroots hashtags about the same topic, at least until the moment of promotion.
The analysis in this paper is motivated by the goal of identifying promoted campaigns at the earliest possible time. The early detection task addresses the difficulty of judging the nature of a hashtag using only the limited data available immediately before trending. Fig. 2 illustrates the shortage of information available for early detection. It is also conceivable that once the promotion has triggered interest in a hashtag, the conversation is sustained by the same mechanisms that characterize organic diffusion. Such noise around popular conversations may present an added difficulty for the early detection task.

Contributions and outline
The major contribution of this paper, beyond formulating the problem of detection of campaigns in social media, is the development and validation of a supervised machine learning framework that takes into consideration the temporal sequence of messages associated with a trending hashtag on Twitter and successfully classifies it as either "promoted" (advertised) or "organic" (grassroots). The proposed framework adopts time-varying features built from network structure and diffusion patterns, language, content and sentiment information, timing signals, and user meta-data. In the following sections we discuss the data we collected and employed, the procedure for feature extraction and selection, the implementation of the learning framework, and the evaluation of our system.

Dataset description
The dataset adopted in this study consists of Twitter posts (tweets) that contain a trending hashtag and appeared during a defined observation period. Twitter provides an interface that lists trending topics, with clearly labeled promoted trends at the top (Fig. 3). We crawled the Twitter webpage at regular intervals of 10 minutes to collect all organic and promoted hashtags trending in the United States between January and April 2013, for a total of N = 927 hashtags. This constitutes our ground-truth dataset of promoted and organic trends.
We extracted a sample of organic trends observed during the first two weeks of March 2013 for our analysis. While Twitter allows for at most one promoted hashtag per day, dozens of organic trends appear in the same period. As a result, our dataset is highly imbalanced, with the promoted class more than ten times smaller than the the organic  Table 1). Such an imbalance, however, reflects our expectation to observe in the Twitter stream a minority of promoted conversations blended in a majority of organic content. Therefore we did not balance the classes by resampling, to study the campaign detection problem under realistic conditions.
Hashtags may trend multiple times on Twitter. However, those in our collection only trended once during our observation period. For each trend, we retrieved all tweets containing the trending hashtag from an archive containing a 10% random sample of the public Twitter stream. The collection period was hashtag-specific: for each hashtag we obtained all tweets produced in a four-day interval, starting two days before its trending point and extending to two days after that. This procedure provides an extensive coverage of the temporal history of each trending hashtag in our dataset and its related tweets, allowing us to study the characteristics of each trend before, during, and after the trending point.
Given that each trend is described by a collection of tweets over time, we can aggregate data in sliding time windows [t, t + ) of duration and compute features on the subsets of tweets produced in these windows. A window can slide by time intervals of duration δ. The next window therefore contains tweets produced in the interval [t + δ, t + + δ). We experimented with various time window lengths and sliding parameters, and the optimal performance is often obtained with windows of duration = 6 hours sliding by δ = 20 minutes.
We have made the IDs of all tweets involved in the trending hashtags analyzed in this paper available in a public dataset. 1

Features
Our framework computes features from a collection of tweets in some time interval. The system generates 487 features in five different classes: network structure and information  8 Ratio of tweets that contain emoticons 1 † We consider three types of network: retweet, mention, and hashtag co-occurrence networks. The hashtag co-occurrence network is undirected. * Distribution types. For each distribution, the following eight statistics are computed and used as individual features: min, max, median, mean, std. deviation, skewness, kurtosis, and entropy. ** Part-of-Speech (POS) tag. There are eight POS tags: verbs, nuns, adjectives, modal auxiliaries, predeterminers, interjections, adverbs, and pronouns. *** For each feature we compute mean and std. deviation. diffusion patterns, content and language, sentiment, timing, and user meta-data. The classes and types of features are reported in Table 2 and discussed next. All of the feature time series in this study are available in our public dataset.

Network and diffusion features
Twitter actively fosters interconnectivity. Users are linked by means of follower/followee relations. Content travels from person to person via retweets. Tweets themselves can be addressed to specific users via mentions. The network structure carries crucial information for the characterization of different types of communication. In fact, the usage of network features significantly helps in tasks like astroturf detection [64]. Our system reconstructs three types of networks: retweet, mention, and hashtag co-occurrence networks. Retweet and mention networks have users as nodes, with a directed link between a pair of users that follows the direction of information spreading -toward the user retweeting or being mentioned. Hashtag co-occurrence networks have undirected links between hashtag nodes when two hashtags have occurred together in a tweet. All networks are weighted according to the number of interactions and co-occurrences. For each network, a set of features is computed, including in-and out-strength (weighted degree) distribution, density, shortestpath distribution, and so on. (cf. Table 2).

User-based features
User meta-data is crucial to classify communication patterns in social media [55,30]. We extract user-based features from the details provided by the Twitter API about the author of each tweet and the originator of each retweet. Such features include the distribution of follower and followee numbers, and the number of tweets produced by the users (cf. Table 2).

Timing features
The temporal dimension associated with the production and consumption of content may reveal important information about campaigns and their evolution [35]. The most basic time-related feature we considered is the number of tweets produced in a given time interval. Other timing features describe the distributions of the intervals between two consecutive events, like two tweets or retweets (cf. Table 2).

Content and language features
Many recent papers have demonstrated the importance of content and language features in revealing the nature of social media conversations [24,52,57,12,48]. For example, deceiving messages generally exhibit informal language and short sentences [13]. Our system extracts language features by applying a Part-of-Speech (POS) tagging technique, which identifies different types of natural language components, or POS tags. The following POS tags are extracted: verbs, nouns, adjectives, modal auxiliaries, pre-determiners, interjections, adverbs, pronouns, and wh-pronouns. 2 Tweets can be therefore analyzed to study how such POS tags are distributed. Other content features include the length and entropy of the tweet content (cf. Table 2).

Sentiment features
Sentiment analysis is a powerful tool to describe the attitude or mood of an online conversation. Sentiment extracted from social media conversations has been used to forecast offline events, including elections and financial market fluctuations [71,10], and is known to affect information spreading [56,33]. Our framework leverages several sentiment extraction techniques to generate various sentiment features, including happiness score [42], arousal, valence and dominance scores [77], polarization and strength [82], and emotion score [1] (cf. Table 2).

Feature selection
Our system generates a set I of |I| = 487 features (cf. Table 2) designed to extract signals from a collection of tweets and distinguish promoted trends from organic ones. Some features are more predictive than others; some are by definition correlated with each other due to temporal dependencies. Most of the correlations are related to the volume of data. For instance the two most correlated features immediately prior to the trending point are the size of the hashtag cooccurrence network and the size of its largest connected component (Pearson's ρ = 0.75). This is why it is important to perform feature selection to eliminate redundant features and identify a combination of features that yield good classification performance.
There are several methods to select the most predictive features in a classification task [36]. We implemented a simple greedy forward feature selection method, summarized as follows: (i) initialize the set of selected features S = ∅; (ii) for each feature i ∈ I − S, consider the union set U = S ∪ {i}; (iii) train the classifier using the features in U ; (iv) test the average performance of the classifier trained on this set; (v) add to S the feature that provides the best performance; (vi) repeat (ii)-(v). We terminate the feature selection procedure if the AUC (cf. Sec. 2.5) increases by less than 0.05 between two consecutive steps. Most of the experiments terminate after selecting fewer than 10 features. The time series for the selected features are passed as input to the learning algorithms. In the next subsections we provide details about our experimental setting and learning models.

Experimental setting
Our experimental setting follows a pipeline of feature selection, model building, and performance evaluation. We apply the wrapper approach to select features and evaluate performance iteratively [40]. During each iteration (Fig. 4), we train and evaluate models using candidate subsets of features and expand the set of selected features using the greedy approach described in Sec. 2.3. Once we identify the set of features that performs best, we report results of experiments using only this set of features.
In each experiment and for each feature, an algorithm receives in input a time series with L = 35 data points to carry out its detection. The length of the time series and its delay D with respect to the trending point are discussed in Sec. 3; different experiments will consider different delays.
A set of feature time series is used to either train a learning model or evaluate its accuracy. The learning algorithms are discussed in the next subsection. For evaluation, we compute a Receiver Operating Characteristic (ROC) curve, which plots the true positive rate (TPR) versus the false positive rate (FPR) at various thresholds. Accuracy is evaluated by measuring the Area Under the ROC Curve (AUC) [28] with 10-fold cross validation, and averaging AUC scores across the folds. A random-guess classifier produces the diagonal line where TPR equals FPR, corresponding to a 50% AUC score. Classifiers with higher AUC scores perform better and the perfect classifier in this setting achieves a 100% AUC score. We adopt AUC to measure accuracy because it is not biased by the imbalance in our classes (75 promoted trends versus 852 organic ones, as discussed earlier).

Learning algorithms
Let us describe the learning systems for online campaign detection based on multidimensional time-series data from social media. We identified an algorithm, called K-Nearest Neighbor with Dynamic Time Warping (KNN-DTW), that is capable of dealing with multidimensional time series classification. For evaluation purposes, we compare the classification results against two baselines: SAX-VSM and KNN. These three methods are described next.

KNN-DTW classifier
KNN-DTW is a state-of-the-art algorithm to classify multidimensional time series, illustrated in Fig. 4. During learning, we provide our model with training and testing sets generated by 10-fold cross validation. Time series for each feature are processed in parallel using dynamic time warping (DTW), which measures the similarity between two time series after finding an optimal match between them by "warping" the time axis [7]. This allows the method to absorb some non-linear variations in the time series, for example different speed or resolution of the data.
For efficiency, we initially apply a time series coarsening strategy called piece-wise aggregation. We split each original time series into p equally long sections and replace the time-series values by the section averages, reducing the dimensionality from L to L = L/p. For trend i and feature k, we thus obtain a coarsened time series f i k = {f i k,1 , f i k,2 , · · · , f i k,L }. Then, DTW computes the distance between all pairs of points of two given trend time series Points closer to each other are more likely to be matched. To create a mapping between the two time series, an optimal path is computed over the time-series distance matrix. A path must start from the beginning of each time series and terminate at its end. The path between first and last points is then computed by minimizing the cumulative distance (γ) Figure 4: Wrapper method description for KNN-DTW. We present the pipeline of our complete system, including feature selection and model evaluation steps. Input data feed into the system for training (green arrow) and testing (blue arrow) steps. over alternative paths. This problem can be solved via dynamic programming [7] using the following recurrence: γ(t, t ) = M (t, t ) + min{γ(t − 1, t − 1), γ(t − 1, t ), γ(t, t − 1)} (indices i, j, k dropped for readability). The distance γ ij k is used as the ij-th element of the N × N trend similarity matrix Γ k .
The computation of similarity between time series using DTW requires O(L 2 ) operations. Some heuristic strategies use lower-bounding techniques to reduce the computational complexity [41]. Another technique is to re-sample the data before adopting DTW. Our coarsening approach reduces the computational costs by a factor of p 2 . We achieved a significant increase in efficiency with marginal classification accuracy deterioration by setting p = 5 (L = 7).
In the evaluation step, we use the K-Nearest Neighbor (KNN) algorithm [23] to assign a class score to a test trend q. We compare q with each training trend i to obtain a DTW distance γ iq k for each feature k. We then find the K = 5 labeled trends with smallest DTW distance from q, and compute the fraction of promoted trends s q k among these nearest neighbors. We finally average across features to obtain the class scores q . Higher values ofs q indicate a high probability that q is a promoted trend. Class scores, together with ground-truth labels, allow us to computate the AUC of a model, which is then averaged across folds according to cross validation.

SAX-VSM classifier
Our first baseline, called SAX-VSM, blends symbolic dimensionality reduction and vector space models [66]. Time series are encoded via Symbolic Aggregate approXimation (SAX), yielding a compact symbolic representation that has been used for time series anomaly and motif detection, time series clustering, indexing, and more [49,50]. A symbolic representation encodes numerical features as words. A vector space model is then applied to treat time series as documents for classification purposes, similarly to what is done in information retrieval. In our implementation, we first apply piece-wise aggregation and then use SAX to represent the data points in input as a single word of L letters from an alphabet ℵ. This choice and the parameters |ℵ| = 5 and L = 4 are based on prior optimization [66], and variations to these settings only marginally affect performance. Each time-series value is mapped into a letter by dividing the range of the feature values into |ℵ| regions in such a way as to obtain equiprobable intervals under an assumption of normality [50]. In the training phase, for each feature, we build two sets of words corresponding to organic and promoted trends, respectively. In the test phase, a new instance is assigned to the class with the majority of word matches across features. In case of a tie we assign a random class. For further details about this baseline and its implementation, we refer the reader to the SAX-VSM project website. 3

K-Nearest Neighbors classifier
Our second baseline is an off-the-shelf implementation of the traditional K-Nearest Neighbors algorithm [23] for time-series classification. We used the Python scikit-learn package [61]. We selected KNN because it can capture and learn time-series patterns without requiring any pre-processing of the raw time-series data. We created the feature vectors for each trend by concatenating into a single vector the continuous-valued time series representing each feature. The nearest neighbor classifier computes the Euclidean distance between pairs of single-vector time series. For a test trend, the class score is given by the fraction of promoted trends among the K = 5 nearest neighbors.

Results
In this section, we present results of experiments design to evaluate the ability of our machine learning framework to discriminate between organic and promoted trends. For all experiments, each feature time series consists of 120 real-valued data points equally divided before and after the trending point. Although in principle we could use the entire time series for classification, ex-post information would not serve our goal of early detection of social media campaigns in a streaming scenario that resembles a real setting, where information about the future evolution of a trend is obviously unavailable. For this reason, we consider only a subset consisting of L data points ending with delay D since the trending point; D ≤ 0 data points for early detection, D > 0 for classification after trending. We evaluate the performance of our detection framework as a function of the delay parameter D. The case D = 0 involves detection immediately at trending time. However, we also consider D < 0 to examine the performance of our algorithms based on data preceding the trending point; of course the detection would not occur until D = 0, when one would become aware of the trending hashtag. Time series are encoded using the settings described above (L = 35 windows of length = 6 hours sliding every δ = 20 minutes).

Method comparison
We carried out an extensive benchmark of several configurations of our system for campaign detection. The performance of the algorithms as a function of varying delays D is plotted in Fig. 5.
In addition, we introduce random temporal shifts for each trend time series to test the robustness of the algorithms. In real-world scenarios, one would ideally expect to detect a promoted trend without knowing the trending point. To simulate such scenarios, we designed an experiment that introduces variations that randomly shift each time series around its trending point. The temporal shifts are sampled from gaussian distributions with different variances. We present the results of this experiment in Fig. 6.
KNN-DTW and KNN display the best detection accuracy (measured by AUC) in general. Their performance is comparable (Fig. 5). The AUC score is on average around 95% for detecting promoted trends after trending. In the early detection task, we observe scores above 70%. This is quite remarkable given the small amount of data available before the trending point. KNN-DTW also displays a strong robustness to temporal shifts, pointing to the advantage of time warping (Fig. 6). The KNN algorithm is less robust because it computes point-wise similarities between time series without any temporal alignment; as the variance of the temporal shifts increases, we observe a significant drop in accuracy. SAX-VSM benefits from the time series encoding and provides good detection performance (on average around 80% AUC) but early detection accuracy is poor, close to random for D < 0. A strong feature of SAX-VSM is its robustness to temporal shifts, similar to KNN-DTW. Our experiments suggest that temporal encoding is a crucial ingredient for successful classification of time-series data. Encoding reduces the dimensionality of the signal. More importantly, encoding preserves (most) information about temporal trends and makes an algorithm robust to random shifts, which is an importance advantage in real-world scenarios. SAX-VSM ignores long-term temporal ordering. KNN-DTW, on the other hand, computes similarities using a time series representation that preserves the long-term temporal order, even as time warping may alter short-term trends. This turns out to be a crucial advantage to achieve both high accuracy and robustness.
Using AUC as an evaluation metric has the advantage of not requiring discretization of scores into binary class labels. However, detection of promoted trends in real scenarios requires binary classification by a threshold. In this way we can measure accuracy, precision, recall, and identify misclassified accounts. Fig. 7 illustrates the distribution of probabilistic scores produced by the KNN-DTW classifier as a function of the delay for the two classes of trends, organic and promoted. The scores are computed for leave-out test instances, across folds. An ideal classifier would separate these distributions completely, achieving perfect accuracy. Test instances in the intersection between two distributions either are misclassified or have low-confidence scores. Examples of misclassified instances are discussed in Sec. 3.3. For D < 0, KNN-DTW generates more conservative scores, and the separation between the organic and promoted class distributions is smaller. For D > 0, Figure 6: Temporal robustness. AUC of different learning algorithms with random temporal shifts versus the standard deviation of the shifts. We repeated the experiment for various delay values D. Significance levels of differences in consecutive experiments are marked as (*) p < 0.05 and (**) p < 0.01.
KNN-DTW scores separate the two classes well. To convert continuos scores into binary labels, we calculated the threshold values that maximize the F1 score of each experiment; this score combines precision and recall. Trends with scores above the threshold are labeled as promoted. The best accuracy and F1 score are obtained shortly after trending, at D = 20.

Feature analysis
Let us explore the roles and importance of different features for trend detection. To this end, we identify the significant features using the greedy selection algorithm described in Sec. 2.3, and group them by the five classes (user meta-data, content, network, sentiment, and timing) previously defined. We focus on KNN-DTW, our best performing method. After selecting the top 10 features for different delays D, we compute the fractions of top features in each class, as illustrated by Fig. 8. We list the top features for experiments D = 0 (early detection) and D = 40 (classification) in Table 3.
The usefulness of content features does not appear to change significantly between early Figure 7: Distributions of KNN-DTW classifier scores. We use Kernel Density Estimation (KDE), a non-parametric smoothing method, to estimate the probability densities based on finite data samples. We also show the threshold values that separate the two classes yielding an optimal F1 score. and late detection. In the early detection task, user features seem to contribute significantly more than any other class, possibly because early adopters reveal strong signals about the nature of trends. As we move past the trending point, signals from early adopters are flooded by increasing numbers of participants. Timing and network features become increasingly important as the involvement of more users allows to analyze group activity and network structure patterns.

Analysis of misclassifications
We conclude our analysis by discussing when our system fails. In Fig. 9, we illustrate how some key features of misclassified trends diverge from the majority of the trends that are correctly classified. We observe that some misclassified trends follow the temporal characteristics of the other class. This is best illustrated in the case of volume (number of tweets). An advantage of continuous class scores is that we can tune the classification threshold to achieve a desired balance between precision and recall, or between false positives and false negatives. False negative errors are the most costly for a detection system: a promoted trend mistakenly labeled as organic would easily go unchecked among the larger number of correctly labeled organic trends. Focusing our attention on a few specific instances of false negatives generated by our system, we gained some insight on the reasons triggering the mistakes. First of all, it is conceivable that promoted trends are sustained by organic activity before promotion and therefore they are essentially indistinguishable from organic ones until the promotion triggers the trending behavior. It is also reasonable to expect a decline in performance for long delays: as more users join the conversation, promoted trends become harder to distinguish from organic ones. This may explain the dip in accuracy observed for the longest delay (cf. Fig. 5).
False positives (organic trends mistakenly labeled as promoted) can be manually filtered out in post-processing and are therefore less costly. However, analysis of false positives provides for some insight as well. Some trends in our dataset, such as #watchsuitstonight and #madmen, were promoted via alternative communication channels (television and radio), rather than via Twitter. This has become a common practice in recent years, as more and more Twitter campaigns are mentioned or advertised externally to trigger organic-looking responses in the audience. Our system recognized such instances as promoted, whereas their ground-truth labels did not. Those campaigns were therefore wrongly counted as false positives, penalizing our algorithms in the evaluation. We find it remarkable that in these cases our system is capable of learning the signature of promoted trends, even though the promotion occurs outside of the social media itself.

Related work
Recent work on social media provides a better understanding of human communication dynamics such as collective attention and information diffusion [78], the emergence of trends [47,31], social influence and political mobilization [11,21,22,75].
or entertainment) are continuously discussed and sometimes a particular conversation can accrue lots of attention and generate trending memes. The promotional campaigns studied here can be seen as a type of exogenous factor affecting the visibility of memes.
The present work, to the best of our knowledge, is the first to investigate the early detection of promoted content on social media. We focus our attention on advertisement, which can play an important role in information campaigns. Trending memes are considered an indicator of collective attention in social media [83,45], and as such they have been used to predict real-world events, like the winner of a popular reality TV show [19]. Although emerging from collective attention, communication on social media can be manipulated, for example for political gain, as in the case of astroturf [54,64].
Recent work analyzes emerging topics, memes, and conversations triggered by real world events [2,5,14]. Studies of information dissemination reveal mechanisms governing content production and consumption [17] as well as prediction of future content popularity. Cheng et al. study the prediction of photo-sharing cascade size [15] and recurrence [16] on Facebook. Machine learning models can predict future popularity of emerging hashtags and content on social media [70,51]. Features extracted from content [39], sentiment [33,43], community structure [80,81], and temporal signatures [62,34,76] are commonly used to train such models. In this paper we leverage similar features, but for the novel task of campaign detection. Furthermore, our task is more challenging because we deal with dynamic features whose changes over time are captured in high-dimensional time series.
Another topic related to our research is rumor detection. Rumors may emerge organically as genuine conversation and spread out of control. They are characterized and sustained by ambiguous contexts, where correctness and completeness of information or the meaning of a situation is not obviously apparent [27]. Examples are situations of crisis or topics of public debate [53]. Existing systems to identify rumors are based mostly on content analysis [63,44] and clustering techniques [29,38]. An open question is to determine if rumor detection might benefit from the wide set of feature classes we propose here.
The proposed framework is based on a mixture of features common in social media data, including emotional and sentiment information. The literature has reported extensively on the use of social media content to describe emotional and demographic characteristics of users [55,33,56]. The use of language in online communities is the focus of two recent papers [24,52]: the authors observe that the language of social media users evolves, and common patterns emerge over time. The language style of users adapts to achieve better fitness in the conversation [25]. These findings suggest that language contains strong signals, in particular if studied in conjunction with other dimensions of the data. Our study confirms the importance of content for campaign detection. Finally, our system builds on network features and diffusion patterns of social media messages. Network structure and information diffusion in social media have been studied extensively [3,46]. Network features are highly predictive of certain types of social media abuse, like astroturf, that attempt to simulate grassroots online conversations [64,65,69,30,74]. Such artificial campaigns produce peculiar patterns of information diffusion: the topology of retweet or mention networks is often a stronger signal than content or language. The present findings are consistent with this body of work, as network features are helpful in detecting promoted content after trending.

Conclusions
As we increasingly rely on social media to satisfy our information needs, it is important to recognize the dynamics behind online campaigns. In this paper, we posed the problem of early-detection of promoted trends on social media, discussed the challenges that this problem presents, and proposed a supervised computational framework to attack it. The proposed system leverages time series representing the evolution of different features characterizing trending campaigns. The list includes features relative to network structure and diffusion patterns, sentiment, language and content features, timing, and user meta-data. We demonstrated the crucial advantages of encoding temporal sequences.
We achieved good accuracy in campaign detection. Our early detection performance is remarkable when one considers the challenging nature of the problem and the low volume of data available in the early stage of a campaign. We also studied the robustness of the proposed algorithms by introducing random temporal shifts around the trending point, simulating realistic scenarios in which the trending point can only be estimated with limited accuracy.
One of the advantages of our framework is that of providing interpretable feature classes. We explored how content, network, and user features affect detection performance. Extensive feature analysis revealed that signatures of campaigns can be detected early, especially by leveraging content and user features. After the trending point, network and temporal features become more useful.
The availability of data about organic and promoted trends is subject to Twitter's recipe for selecting trending hashtags. There is no certain way to know if and when social media platforms make any changes to such recipes. However, nothing in our approach assumes any knowledge of a particular platform's trending recipe. If the recipe changes, our system could be retrained accordingly.
This work represents an important step toward the automatic detection of campaigns. The problem is of paramount importance, since social media shape the opinions of millions of users in everyday life. Further work is needed to study whether different classes of campaigns (say, legitimate advertising vs. terrorist propaganda) may exhibit characteristics captured by distinct features. Many of the features leveraged in our model, such as those related to network structure and temporal attributes, capture activity patterns that could provide useful signals to detect astroturf [64]. Therefore, our framework could in principle be applied to astroturf detection, if longitudinal training data about astroturf campaigns were available.