Open Access

Modeling dynamics of attention in social media with user efficiency

EPJ Data Science20143:5

DOI: 10.1140/epjds30

Received: 19 December 2013

Accepted: 13 February 2014

Published: 4 March 2014

Abstract

Evolution of online social networks is driven by the need of their members to shareand consume content, resulting in a complex interplay between individual activity andattention received from others. In a context of increasing information overload andlimited resources, discovering which are the most successful behavioral patterns toattract attention is very important. To shed light on the matter, we look into thepatterns of activity and popularity of users in the Yahoo Meme microblogging service.We observe that a combination of different type of social and content-producingactivity is necessary to attract attention and the efficiency of users, namely theaverage attention received per piece of content published, for many users has adefined trend in its temporal footprint. The analysis of the user time series ofefficiency shows different classes of users whose different activity patterns giveinsights on the type of behavior that pays off best in terms of attention gathering.In particular, sharing content with high spreading potential and then supporting theattention raised by it with social activity emerges as a frequent pattern for usersgaining efficiency over time.

Keywords

online attention microblogging social networks time series

1 Introduction

Understanding users’ activities in social media platforms, in terms of the actionsthey take and how those actions affect the attention they receive (e.g., comments,replies, re-posts of messages they post, etc.), is crucial for understanding thedynamics of social media systems as well as for designing incentives that lead to growthin terms of user activity and number of users. As expected, given the nature of suchplatforms, users who receive attention from their peers tend to be more engaged with theservice and are less likely to churn out [1]. Insights on the kinds of actions that users take to gain more attention andbecome “popular” are therefore important because they can help explain howsocial media platforms evolve. In spite of the importance of analyzing such behavior ata large scale, the dynamics of attention are not well understood. This is largely due totwo main reasons: on one hand that there are few datasets that show the evolution of anetwork from its very beginnings, and on the other hand, because most work has focusedon the popularity of content rather than on analyzing the effects of user’sbehaviors on how other users react to them. For example, there have been many studies toestablish the reasons behind user or item popularity in social networks (e.g., [2, 3]), but the effects that the patterns of attention received have on theactivity and the engagement of the “average” users have not been thoroughlyexplored so far.

In this paper, we address questions that focus on social media users’ behavior atdifferent stages of their participation in social media platforms. In particular, weintroduce a new way to examine attention dynamics, and from this perspective perform adeep analysis of the evolution of user activity and attention in a social network fromits beginning until the service ceased to exist. Analyzing the weekly efficiency, i.e.the amount of attention received in the platform normalized by the amount of contentproduced, we observe that 56% of the users in the dataset exhibit a footprint of theirefficiency with a clearly defined trend (i.e., sharply increasing/decreasing orpeaking). We are able to extract patterns of user behavior from these temporalfootprints that reveal differences in the activity behavior of users of differentclasses. We focus our analysis on Yahoo Meme, a microblogging service that was launchedby Yahoo in 2009 and discontinued in 2012. While the mechanisms of interaction in YahooMeme were similar to those found in other social media platforms, to the best of ourknowledge, this is the first study that examines in detail the questions we areaddressing from the perspective of user efficiency, using data from a service from itsinitial launch.

The main contributions of this work include:

  • Study of the attention dynamics in social networks from the angle ofefficiency, namely the ratio between attention received and activityperformed. The notion of efficiency in time allows to detect patterns that could notemerge using other raw popularity or activity indicators.

  • Definition of a method to classify noisy time series ofuser-generated events. The method is successfully used to find classes of users based onthe time series of their efficiency scores, with an accuracy ranging from 0.85 to 0.93,depending on the different classes.

  • Extraction of insights useful to detect and prevent user churn. Forinstance, exploration of the efficiency time series reveals that increase in efficiencyis determined by creation of high-quality content, but the acquired attention has to besustained with additional social activity to keep the efficiency high. If such socialexchange is missing, attention received drops very quickly.

2 Related work

Much effort has been spent lately in measuring the effect that the activity of contentproduction and sharing has in influencing the actions of social media participants.Depending on whether the investigation adopts the perspective of the user whois sharing or of the content being shared, emphasis has been given to thecharacterization of either the influential users or the process of information spreadingalong social connections.

Different methods to identify influentials, namely individuals who seed viralinformation cascades, have been proposed recently [4], and it has been observed that simple measures such as the raw number ofsocial connections are not good predictors of influence potential [57]. Instead, the ease of propagation of a piece of content is correlated withmany other features, including the position of the content creator in the social network [8], demographic factors [9, 10], and the sentiment conveyed in the message [11].

For what concerns content-centered analysis, much attention has been devoted to thestudy of the structure and diffusion speed of information cascades in social and newsmedia [1214], including Yahoo Meme [14, 15]. Weng et al.[14] for instance have shown that triadic closure helps to explain the linkformation in early stages of the user’s lifetime but later in time it is theinformation flow the driver for new connections. Despite the difficulty of determiningwhether observed cascades are generated by a real influence effect [16] (unless performing controlled experiments [17]), the role of influence in social network dynamics is widely recognized,albeit not fully understood. Factors related to influence include geolocation,visibility of the content, or exogenous factors like major geopolitical or news eventsfor news media [1820].

Patterns of temporal variation of popularity have been investigated previously, mostlyfocusing on the attention given to pieces of user-generated content. Previous workincludes characterization of the peakness and saturation of video popularity on YouTubein relation to content visibility [18], crowd productivity dependence on the attention gathered by videos [1], the classification of bursty Twitter hashtags in relation to topic detectiontasks [21], and the clustering of hashtag popularity histograms based on their shape [22]. Time series has been used to predict popularity in blogs, where the earlyreactions of the crowd to a piece of content is strongly correlated to the expectedoverall popularity [3, 23].

In this work we focus on users as opposed to content and we analyze time series of ametric combining the user activity and the attention received. We do not focus on thepopularity gained at a global scale, but instead we characterize temporal patterns ofactivity and attention of each individual. We show that time series of individual useractivity cannot be clustered accurately based on their shapes by state-of-the artmethods, so we propose an algorithm to fix that. Finally, except in rare cases (e.g., [24]), previous work on network analysis has relied mostly on limited temporalsnapshots. In contrast, we use the temporal data of the entire life-span of Meme, fromits release date until its shutdown.

3 Dataset description

Meme was a microblogging service launched by Yahoo in April 2009 anddiscontinued in May 2012. Users could post messages, receive notifications ofposts published by people they follow (follower ties are directedsocial connections), and repost messages of other users or comment onthose messages. The overall number of registered users grew at a constant pace, up toalmost 700K. When neglecting uninvolved users (i.e., users who were registered,but stopped explicit activity), we observe a growing trend up to a maximum of60K users around the end of the first year, and then a slow but steadydecline. In Table 1 we report general statistics on thefollower network in the last week of the service. The final network contains awell-connected core of users resulting in a greatest connected component covering almostthe full network, with a high clustering coefficient. As already observed for otheronline social networks, the average path length is proportional to log log ( N ) https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq1_HTML.gif, and similarly to other news media the level of sociallink reciprocity is very low [25].
Table 1

Followers network statistics

Nodes

Edges

Density

k

k i n https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq2_HTML.gif

GWCC%

Reciprocity

d

d m a x https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq3_HTML.gif

C

568K

20M

6.210−5

71

35

0.996

0.096

2.59

11

0.433

Δ= density, GWCC%= relative size of the greatest weaklyconnected component, d= geodesic distance, C= clusteringcoefficient.

4 Activity vs. attention

Activity and attention are the two dimensions we aim to examine with our study. Afterdefining the features, we look at their relationship in terms of correlations of theirraw indicators and then we study them from a novel perspective by defining a metric ofuser efficiency. We find that very efficient users tend to write fewer posts per weekbut are heavily involved in social activities such as commenting.

4.1 Activity and attention metrics

We define activity and attention indicators that are computed forevery user. Activity indicators are measured by the number of posts (pd),reposts (rd), and comments done (cd), or by the number of newfollowees added (fwee), while attention is determined by the number ofreposts (rr) ot comments received (cr) from others, and by thenumber of new incoming follower links (fw). Reposts received can bedirect or indirect (i.e., reposting a repost). To measureattention we consider direct reposts.

The possibility of indirect reposting originates repost cascades that can bemodeled as trees rooted in the original post and whose descendants are the direct(depth 1) and indirect (depth 2 to the leaves) reposts. Besides being anotherattention indicator, the cascade size (cs) is a good proxy for theperceived interestingness of the content because, intuitively, sharing apiece of content originated by someone who is not directly linked through a socialtie, and therefore is likely to be unknown to the reposter, implies a higherlikelihood that the reposter was interested in that piece content. Therefore, weconsider the cascade size as a measure of content interestingness.

Even though several measure of influence, authoritativeness, or more in generalimportance of a user in a networked system have been developed in the past (see forinstance the work by Romero et al.[7]), here we adopt the perspective of a single user, rather than of the wholecommunity. Therefore, we are going to interpret the system as a black box thatreceives input from a user (activity) and returns some output (attention), withoutconsidering the actual effect that the input causes inside the system. Although thisis a simplification, it allows us to better focus on the user dimension and tocluster users with respect to the perception they get from the interaction with thesystem (i.e., attention in exchange for activity).

4.2 Correlations

When dealing with multidimensional behavioral data, detecting causation betweenevents can be difficult [16], but potential mechanisms driving the interactions between the differentdimensions at play can be spotted through the investigation of correlations [26]. In this case, the correlations between activity and attention metricsgive a first hint about the potential payoff of some user actions in terms ofattention received.

In Figure 1, visual clues of the relationship betweendifferent metrics of activity and attention are shown in the form of heatmaps. Thefour plots on the left display the average values of attention indicators for userswhose number of posts and comments resides in given ranges. To make sure that thetrends emerging from the heatmaps are significant, we count the number of usersfalling in each of the range buckets. In Table 2 we reportthe average and the median number of users in each bucket of the heatmaps. Asexpected from the broad distributions of the activity and attention indicators, fewactors have very high values for some pairs of indicators. For instance, in theheatmap in Figure 1(E), just 10 users are in theupper-right bucket (users with >625 posts and and >625 followers). However,in general the number of users per bucket is sufficiently high to consider the trendstatistically significant, as shown by Table 2.
https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_Fig1_HTML.jpg
Figure 1

Correlations between activity and attention. Users were groupedaccording to the number of x and y values (plotted on a logscale) and, for each group, the average number of the z-value wascalculated and mapped to a color intensity.

Table 2

Statistics for the number of users considered in each bucket of the heatmapsdepicting the correlations between activity and popularity metrics(Figure  1 )

x-axis

y-axis

Average

Median

Posts

Reposts received

73.1

39

Posts

Comments done

77.9

32

Posts

Followers

74.3

45

The average and median number of users per bucket in each combination ofmetrics is shown.

First, we observe that attention in terms of followers and comments(Figures 1(A)-(B)) is correlated with both number ofposts and comments done, resulting in a color gradient becoming brighter whentransitioning from the lower-left corner to the upper-right one. Users who gainedmore followers were heavier content producers and an even more evident correlation isfound when considering comments received (Figure 1(B)),likely due to a comment reciprocity tendency (we calculated the comment reciprocitybeing around 24%, much higher than reciprocity in the follower network). We observe apartially similar effect when looking at content-centered indicators, namely thereposts received and the cascade size (Figures 1(C)-(D)).In these cases we find a positive correlation with the number of posts, but not withthe amount of comments, suggesting that social interaction, such as commenting onother people posts, does not strongly characterize content propagation.

The two plots on the right of Figure 1 show the relationbetween pairs of attention metrics with the number of posts. From Figure 1(E) we learn that social exposure (i.e., being followed) andproductivity (i.e., number of posts) are both heavily correlated with the number ofreposts. However, people with moderate or heavy posting activity can reach a highlevel of attention even having a relatively small audience (as shown by the brightcolors extending down along the right side of the map). This intuition is confirmedby the fact that swapping the axes of the two attention measures, the correlation isdisrupted (Figure 1(F)), meaning that people with highnumber of posts and reposts do not necessarily have a large number of followers.

4.3 User efficiency

The above findings support on one hand the intuitive principle about: “the moreyou give, the more you get” and, on the other hand, they reinforce thehypothesis that visibility is not enough to grant a wide diffusion of content(similarly to the “million follower fallacy” in the context of Twitter [6]). However, the user perception of the interaction with peers through anonline system is not dependent just by the raw number of feedback actions received,but also by the amount of attention in relation with the effort spent to gain it.Given this perspective, we define the efficiency η of a useru in a given time frame [ t i , t j ] https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq4_HTML.gif as the amount of attention received over the amount ofactivity performed between t i https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq5_HTML.gif and t j https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq6_HTML.gif, for any pair of activity (Act) and attention(Att) metrics:
η u Act , Att ( t i , t j ) = t i t j Att u t i t j Act u . https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_Equ1_HTML.gif
(1)

Analogous definitions have been used in different disciplines such as physics andeconomics [27], and in most of the cases the efficiency is upper bounded to 1, i.e., theoutcome from the system cannot exceed the energy given in input. On the contrary, ina social media setting the efficiency is unbounded and it constitutes an objectivefunction to maximize in order to increase the engagement of the user base. Even ifcomments can be strong indicators of involved user participation, the main focus ofthe online service under study is posting and reposting, similarly to Twitter.Therefore we always consider the number of posts as the metric of activity in theefficiency formula. In the above definition (Formula (1)) we assume that theattention that we take into account should be the one that is directly triggered bythe activity considered, we use either the number of reposts ( η u Post , Repost https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq7_HTML.gif) or the number of comments ( η u Post , Comm https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq8_HTML.gif) as proxies for attention received, since other metricssuch as number of followers are not necessarily responses to the postingactivity.

The distribution of η u Post , Repost https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq7_HTML.gif and η u Post , Comm https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq8_HTML.gif for all the users during the complete lifetime of thenetwork is drawn in Figure 2. Even if the maximumefficiency scores span up to several hundreds, the majority of users have anefficiency lower than 1, and most of them have values close to zero. The average overthe η u https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq9_HTML.gif values of all users is higher than 1 for reposts andmuch lower for comments. This is justified by the fact that Meme emphasizedespecially the repost feature. For this reason, next we consider only the efficiencyof posts in relation to reposts, and we refer to it as η u https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq9_HTML.gif, for simplicity.
https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_Fig2_HTML.jpg
Figure 2

Efficiency scores. Distribution of efficiency scores, bucketed in0.25-wide bins. Average scores are 0.38 for comments and 1.55 for reposts.

High activity is usually indicative of poor efficiency or, in other words, activityalone is not indicative of high potential of attention gain. To study more in depththe traits of efficient and inefficient users, we describe users with different η u https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq9_HTML.gif values according to several activity and statusfeatures, as shown in Figure 3.
https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_Fig3_HTML.jpg
Figure 3

Activity and efficiency. Average values of activity and statusindicators at fixed values of η u https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq9_HTML.gif.

Insightful patterns emerge. First, the higher the η u https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq9_HTML.gif, the lower the activity in terms of number of posts,but not in the range 0 η u 5 https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq10_HTML.gif (containing most of the users), in which the number ofposts grows with η u https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq9_HTML.gif. However, when looking at the average number of postssubmitted per week instead, the trend becomes monotonic, confirming the theory aboutthe limited attention of the audience being a barrier for attention gathering [20]. Second, the higher the η u https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq9_HTML.gif, the higher the amount of comments: the more efficientusers are the ones who comment the most. Finally, the longevity of the profile andthe prestige on the follower network (computed with standard PageRank) are alsodistinctive features of efficient users.

5 Evolution of efficiency in time

Attention attracted by users, and by consequence their efficiency, is not constant intime. It depends on the amount of activity, the position in the network and otherfactors. However next we show that, even if many users exhibit a oscillating butglobally stable values of efficiency in time, more than half the users show sharpvariations in their efficiency time series, that tell more about the activity behaviorin different periods of the user lifetime. First, we give the definition of efficiencytime series. Then, we explain the algorithm used to classify users efficiency tracesaccording to the shape of their trend and discuss the properties of the four classes wefound. We (i) find that state-of-the-art algorithms for clustering of timeseries do notperform well on the noisy traces such the ones generated by human activity, therefore,based on the observed shapes, (ii) we propose a new classification method and evaluateit against a human-curated ground truth, and (iii) we analyze the differences betweenuser behaviors in the four main user efficiency classes around the main changepoint ofthe efficiency curve.

5.1 Efficiency time series definition

By adapting the efficiency formula for a discrete-time scenario, we model thetemporal efficiency evolution using weekly time series for each user umeasuring the efficiency η u https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq9_HTML.gif after each week. The elements of the series aregenerated as follows:
η u ( t i ) = rr ( p t i ) | p t i | , t i T u = { t 1 , , t n } , https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_Equa_HTML.gif

where p t i https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq11_HTML.gif represents the set of posts published by useru on week t i https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq5_HTML.gif, rr ( p t i ) https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq12_HTML.gif is the total number of direct reposts received in theuser’s lifetime for the set of posts p t i https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq11_HTML.gif, and T u https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq13_HTML.gif is the sorted list of weeks in which the useru published at least one post.

5.2 Time series type detection

Characterizing users based on the exhibited temporal behavior of their efficiencyrequires to extract automatically patterns out of the generated time series. Thereare two main families of state-of-the art methods for this task. The first oneincludes feature-based approaches that cluster series based on theirkurtosis, skewness, trend, and chaos [28]. The latter one includes area-under-the-curve methods [2931] that consist into dividing the time series into equally sized fragments,measure the area under the curve in each fragment, represent the time series as avector of such quantities, and then apply a clustering algorithm over them(specifically, we used k-means). We first tried those state-of-the artmethods to cluster the efficiency time series. We do not report extensively theresults obtained for the sake of brevity, but both feature-based approachesarea-under-the-curve methods produce clusters containing extremely heterogeneouscurves, as we assessed by manual inspection. In addition to that, we tried also aseparate approach, proposed few years ago, that transforms the curves throughPiecewise Aggregate Approximation and Symbolic Aggregate Approximation and thenclusters the resulting representations with k-means [32]. Also this method lead to very imbalanced clusters, being the 99% ofcurves put in one single cluster. The main issue with those approaches is that theyhave been tested in the past mainly on synthetic time series. When time seriesrepresent the activity of single actors they may have an extremely broad variety oflength, shapes, and oscillation of the curve that the mentioned methods are not ableto handle properly.

Even though the produced clusters were very noisy, the area-under-the-curve methodtended to group together curves in four main clusters, with a predominance ofwell-recognizable shapes: increasing, decreasing, peakyand steady. Some examples of time series for each class are depicted inFigure 4 (top). Driven by the qualitative insights thatthe clustering produced, we developed a tailored classification algorithm to obtaincleaner groups, based on a qualitative, discrete representation of the temporal data,inspired by the representation of financial time series presented by Lee etal.[33]. Our algorithm executes the following steps:
  1. 1.

    Smoothing. Apply the kernel regression estimator of Nadaraya and Watson [34] to the user temporal data to obtain a smoothed time series t. The smoothing process gets rid of very sharp and punctual fluctuations, which are very frequent in human activity time series. Examples of raw curves compared to their smoothed versions are shown in Figure 4 (bottom).

     
https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_Fig4_HTML.jpg
Figure 4

Time series examples for each class. Examples of efficiency time seriesfor users of each class indicated by the clustering of time series. Top: rawtime series, bottom: smoothed Time series. Threshold used to detectchangepoints for the first three types are reported with dashed lines.

  1. 2.

    Linguistic transform. Generate a qualitative representation of the time series t for a user u using three states: High, Medium, Low (H, M, L). We empirically set the threshold for high values to 0.6 and for medium values to 0.3 (i.e., values greater than the 60% of the maximum efficiency reached by the user are considered High). The idea of using threshold values is supported by previous work in time-series segmentation [35].

     
  2. 3.

    Fluctuation reduction. Search for contiguous subsequences of a given state and drop the subsequences whose length is less than the 10% of the total length. Similarly to the smoothing procedure, this step helps to eliminate noisy fluctuations in the time series. For example, in the series HHHMHHMMMLLL, the fourth element, M is dropped.

     
  3. 4.

    String collapsing. Collapse the string representation of t by replacing subsequences of the same state with a single symbol of the same type. For instance, the resulting series from the previous example, HHHHHMMMLLL, is transformed to HML.

     
  4. 5.

    Detection of Increasing/Decreasing classes. Look for collapsed sequences with just two groups of symbols and classify as “Increasing” a sequence transitioning from L or M to the state H and as “Decreasing” those transitioning from H to L or M. The second and third columns in Figure 4 show the threshold for High values as a dotted red line.

     
  5. 6.

    Detection of Peaky class. For the unclassified series, find those exhibiting a peaky shape by looking at outliers in the series whose value is higher than x times the average value. This method has been successfully used before in the context of Twitter, with x = 5 https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq14_HTML.gif[21]. Other methods for peak detection we tested [36] find just local peaks, which are very frequent in noisy time series.

     
  6. 7.

    Detection of changepoint. Accurately locating the point in which a curve transitions between different levels is important to study the behavior of users in their single activity and popularity metrics around the point in time when these changes occur [37]. For the peak type curves, the changepoint is intuitively defined by the highest peak, whereas for the increasing and decreasing types the point is identified by the time in which the linguistic representation of the series transitions from H to M or L status (decreasing) or from L or M to H status (increasing). For the sake of comparison, we match our simple technique with the statistical change point analysis recently proposed by Chen et al.[38]. We find that, although for most time series the values from the two methods were very close (at most 1 or 2 weeks difference in around 80% of the cases), the statistical changepoint detection often identifies points right before or right after a change of efficiency.

     
  7. 8.

    Detection of Steady class. The remaining time series are classified as steady.

     

As in most previous work [39], in absence of an automatic way to compute the quality of the classes, twoof the authors annotated a random sample of 1 , 000 https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq15_HTML.gif time series per class to assess the goodness of ouralgorithm. Since the expected shapes of the curves for each class are very clear (seeexamples in Figure 4) a human evaluator can decide withcertainty whether the instances from the sample match the expected template. Theoutcome of the labeling is very encouraging, with 93% correct instances in theDecreasing class, 86% in the Increasing, and 85% in Peak, and almost perfectagreement between evaluators (Fleiss κ = 0.80 https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq16_HTML.gif). For the Steady class, where shapes can vary much, welabeled as misclassified any curve belonging to the other classes. We founda low portion of misclassifed instances (12%). We observe that the users inthe steady class are around 44%, meaning that 56% of the users exhibit a temporalfootprint of the efficiency curve that has a clearly defined trend. This is a findingwith important implications on the applicative side, meaning that the majority ofusers could be accurately profiled as having consistently increasing or decreasingefficiency patterns.

5.3 Changepoint detection

Accurately locating the point in which a curve transitions between different levelsis important to characterize the user behavior when his efficiency significantlyincreases or drops, thus allowing to study how single activity and popularity metricsvary when these changes occur [37]. Changepoint detection refers to the problem of finding time instantswhere abrupt changes occur [37]. Except for the steady time series, which denote a user behavior that isquite constant in time (or for which transition to higher or lower efficiency levelsare much slower), all the other three types have a changepoint in which theefficiency trend changes radically in a relatively short period of time compared tothe total length of the user lifetime. For the peak type curves, the changepoint isintuitively defined by the highest peak, whereas for the increasing and decreasingtypes the point is identified by the time in which the linguistic representation ofthe series transitions from H or M to L status(decreasing) or from L or M to H status (increasing). Moregeneral methods to identify changepoints relying on the changes in mean and variancehave been proposed in the past. For the sake of comparison, we match our simpletechnique with the statistical change point analysis recently proposed by Chen etal.[38]. We find that, although for most time series the values from the twomethods were very close (at most 1 or 2 weeks difference in around 80% of the cases),the statistical changepoint detection sometimes identifies points right before orright after a change of efficiency. In fact, the generality of statistical methods isnot a plus in cases in which the set of curves in input is quite homogeneous and forwhich ad-hoc methods result more reliable. For this reason, we use our definition ofchangepoint.

Once users with similar profiles in their temporal efficiency evolution have beengrouped, time series are analyzed to identify meaningful changepoints.

6 User efficiency classes

For each detected class, we perform an analysis in aggregate over all the users firstand then we characterize the evolution of the same metrics in time. We find that (i)publishing interesting content helps to boost the efficiency of the subsequent poststhrough attention gathering and that (ii) the efficiency gained in that way should besustained by intense social activity to avoid it to drop.

6.1 Static analysis of user classes

We aggregate different activity and attention indicator scores over users and weeks,for each of the four user classes. For all the indicators, we compute their averagevalue per-week for every user and then we compute the median of all the resultsobtained for users of the same class. Median is used instead of average to accountfor the broad distribution of values. In addition, to get a measure of the adhesionof users to the service, we measure the median number of weeks of activity and themedian number of days of duration of the user account. Values for all the metrics areshown in Table 3 and they show a first picture of thelevels of activity performed and attention attracted by users of different classes.Users in the Increasing class have the highest values for almost all the metricscompared to other groups. They are able to attract high levels of attention(fw, rr), combined with the ability of conciliating theproduction of content of high interestingness for the community (high cs)with social activity (high cd and fwee values). As we will showlater, the production of comments and addition of followees is a characteristic ofthis class through time. On the contrary, users belonging to the Peak class are theleast active in terms of social activity (low cd and fwee values)but, surprisingly, they are relatively active content publishers and have thetendency to be active for long time, exhibiting a high number of active weeks and thehighest account duration. They are quite involved in posting but are not much engagedin the social interactions that complements the content production and consumptionprocess. As we will observe next, these users do some commenting activity at thebeginning of their lifetime but they reduce significantly the number of followees orcomments rapidly. Users in the Decreasing and Steady classes receive both a goodamount of attention and establish a high number of social links, backed up by a highcontent-production activity in the Steady case. Given the shorter time of involvementand knowing about their sharp efficiency drop, the users in the Decreasing class arelikely people with a good level of participation who, differently from the users inSteady, reduced significantly the involvement in the service at some point.
Table 3

Activity, popularity and longevity indicators for the four user classes

Type

%users

Activity

Attention

Time

pd

cd

fwee

cr

fw

rr

cs

days

weeks

Decreasing

15%

6.11

2.78

10.7

4.90

3.57

25.3

34

491

53

Increasing

16%

10.3

4.74

9.69

6.14

4.82

43.4

51

690

92

Peak

25%

8.10

2.74

6.82

4.07

3.18

9.11

32

703

85

Steady

44%

8.22

3.75

10.3

5.50

4.35

29.1

40

610

72

Values are the median of the average weekly values. Abbreviations used arepd= posts, cd= commentsDone, fwee= followees,fw= followers, rr= repostsReceived, cs=cascadeSize, days= userLifetime, week= activeWeeks

6.2 Variation around the changepoint

Here we investigate deeper how users in each class distribute the amount of activityin time. We perform an analysis around the changepoint of the efficiency curve, andsee if the different temporal patterns can explain why their efficiencylevel changed over time. We decompose the timeseries into different phasesand study the relations between them in terms of the activity and attentionindicators. Specifically, for all the users belonging to the classes where thechangepoint is given (i.e., all but the Steady class).

Let us define three user-dependent time steps: the week in which the user activitystarted w start https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq17_HTML.gif, the week of the changepoint of the efficiency curve w cp https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq18_HTML.gif, and the week of the end of the activity w end https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq19_HTML.gif, after which no other action is performed by the user.Accordingly, we define three phases of the user lifespan referred asBefore, CP, After, which represent, respectively: theweeks in the [ w start , w cp ) https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq20_HTML.gif interval, the changepoint week w cp https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq18_HTML.gif, and the weeks in the ( w cp , w end ] https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq21_HTML.gif interval. We calculate the average weekly amount ofactivity and attention metrics during these three macro-aggregates of weeks. Thethree values obtained for each indicator capture the variation of activity andattention when approaching the critical point in which a consistent change ofefficiency is detected.

To detect the variation of the values in the three phases we compute two ratios foreach user: (a) RatioCP = activity-or-attention metric measured in w cp https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq18_HTML.gif divided by the same metric computed in [ w start , w cp ) https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq22_HTML.gif, and (b) RatioAfter = activity-or-attention metricmeasured in ( w cp , w end ] https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq23_HTML.gif divided by the same metric during [ w start , w cp ) https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_IEq22_HTML.gif. Ratios are then averaged over all the users of eachclass. Comparison of ratios between different user classes reveals the keydifferences between them: values above 1 mean that the value of the indicator grew inCP of in After phases compared to the Before phase.Final results for different values of activity and attention are reported inFigure 5. For instance, in Figure 5(a), we observe that RatioAfter is above 1 just for users in the Peakyclass. It means that the users in that class have published more posts after thechangepoint than they did before it. We can summarize our findings as follows:

  • Activity and attention at CP. Users of all classesmaintain a similar trend in the number of posts done in CP with a slightincrease in the case of the Increasing class (Figure 5(a)). For Peak and Increasing classes, the number of reposts received,cascade size and followers increases significantly in CP compared toBefore (Figure 5(d), (e), (f) respectively).Since reposts received and cascade size are proxies for content interestingness, thisindicates the production of content that attracts the attention of a much highernumber of users. For both classes, this is the most likely cause of the rise of theirefficiency at CP. For the Decreasing class the attention values startdropping instead. Finally, differently from other classes, users in the Increasingclass produce a higher number of comments in CP (Figure 5(b)).

https://static-content.springer.com/image/art%3A10.1140%2Fepjds30/MediaObjects/13688_2013_Article_6002_Fig5_HTML.jpg
Figure 5

Ratio of activity and attention. Ratio of activity and attention metricsbetween the Before phase and later phases (Change Point and After), for the 3user classes.

  • Social activity after CP. In the After phase,social interaction such as the number of comments and the addition of new followeesconsiderably increase compared to Before for the Increasing class(Figures 5(b), (c)), while they remain stable or inslight decrease for the Peak class. Decreasing class values drop also in thiscase.

  • Content production activity after CP. The reversescenario is found when looking at the posting activity. In the After phase,Peak post messages at a higher rate than Before (Figure 5(a)), while Increasing posting activity drops in favor of a higherattention to social interaction.

The main lesson learned from the above findings is that the submission of pieces of“interesting” content, namely posts that attract the attention of a wideraudience than usual, is the trigger to transition to higher efficiency levels.However, efficiency cannot be maintained without cost. Increasing engagement insocial activity and expanding the potential audience turns out to be an effectivestrategy not to lose efficiency. Conversely, producing more content withoutreinforcing the social relationships with the potential consumers of the contentresults in a rapid drop of efficiency to the original levels. The difference betweenthe Increasing and Peaky classes is particularly striking, having the Increasing-typeusers fully exploiting social activity with 17% more followees, 23% more comments and61% reposts after their changepoint, while Peaky-type users keep their activityapproximately stable (except for an increase of reposts done). Moreover, as expected,when a status of equilibrium between attention received and activity is disrupted byan arbitrary reduction of productivity and social interactions, the efficiency isdestined to fade quickly.

7 Conclusions

We explored the interplay between activity and attention in Yahoo Meme by defining thenotion of user efficiency, namely the amount of attention received in relationto the content produced. We find that, unlike the raw attention measures, efficiency hasstrong negative correlation with the amount of user activity and users who are involvedin social activities such as commenting, have higher centrality in the social networkthan average, but are not necessarily heavy content producers.

However, if we consider commenting as a form of content creation, we observe thatcomment takes less effort than creating a post but, frequently, it can be moreeffective. It is so because the reciprocity plays a role and the comments networkexhibits a higher reciprocity than that of the follower network. Users can, thus,benefit from the visibility of a post whenever they comments on it.

We classify into four main classes (sharp increasing/decreasing steps, peaks or stabletrend) the time series of user efficiency with a novel algorithm that overcomeslimitations of previous approaches and we find four main clusters. By analyzing thevariation of activity and attention around the changepoints of the timeseries, we findevidences that user efficiency is boosted by a particular combination of production ofinteresting content and constant social interactions (e.g., comments). In these cases,users gather the attention from a wider audience by publishing content with higherspreading potential and then they manage to keep the attention high through regular andintensified social activity. These insights find direct application on the detection andprevention of user churn: being able to detect users who increase their efficiency butthat are frustrated by not being able to keep it high can be helped either byrecommending them social activities or pushing their contacts to interact with them. Thetask of churn prediction is a natural continuation of the present work that we plan toaddress in the future.

Declarations

Acknowledgements

This work is supported by the SocialSensor FP7 project, partially funded by the ECunder contract number 287975. Carmen Vaca research work has been funded by ESPOL andthe Ecuadorian agency SENESCYT. We would like to thank Amin Mantrach, NeilO’Hare, Daniele Quercia, and Rossano Schifanella for the usefuldiscussions.

Authors’ Affiliations

(1)
Politecnico di Milano
(2)
Yahoo Labs
(3)
FIEC, Escuela Superior Politecnica del Litoral

References

  1. Huberman BA, Romero DM, Wu F: Crowdsourcing, attention and productivity. J Inf Sci 2009, 35(6):758–765. 10.1177/0165551509346786View Article
  2. Ratkiewicz J, Fortunato S, Flammini A, Menczer F, Vespignani A: Characterizing and modeling the dynamics of online popularity. Phys Rev Lett 2010., 105(15): Article ID 158701 Article ID 158701View Article
  3. Szabo G, Huberman BA: Predicting the popularity of online content. Commun ACM 2010, 53(8):80–88. 10.1145/1787234.1787254View Article
  4. Pal A, Counts S: Identifying topical authorities in microblogs. In Proceedings of the fourth ACM international conference on web search and datamining (WSDM). ACM, New York; 2011:45–54.View Article
  5. Asur S, Huberman BA, Szabo G, Wang C: Trends in social media: persistence and decay. Proceedings of the 5th AAAI conference on weblogs and social media(ICWSM) 2011.
  6. Cha M, Haddadi H, Benevenuto F, Gummadi PK: Measuring user influence in Twitter: the million follower fallacy. 10. AAAI conference on weblogs and social media (ICWSM) 2010, 10–17.
  7. Romero DM, Galuba W, Asur S, Huberman BA: Influence and passivity in social media. In WWW’11: proceedings of the 20th international conference companion onworld wide web. ACM, New York; 2011:113–114.
  8. Hong L, Dan O, Davison BD: Predicting popular messages in Twitter. In WWW. ACM, New York; 2011.
  9. Strufe T: Profile popularity in a business-oriented online social network. In Proceedings of the 3rd workshop on social network systems (SNS). ACM, New York; 2010:2.
  10. Suh B, Hong L, Pirolli P, Chi EH: Want to be retweeted? Large scale analytics on factors impacting retweet inTwitter network. In 2010 IEEE second international conference on social computing(SocialCom). IEEE Press, New York; 2010:177–184.View Article
  11. Quercia D, Ellis J, Capra L, Crowcroft J: In the mood for being influential on Twitter. In 2011 IEEE third international conference on privacy, security, risk and trust(PASSAT) and 2011 IEEE third international conference on social computing(SocialCom). IEEE Press, New York; 2011:307–314.View Article
  12. Cha M, Mislove A, Gummadi KP: A measurement-driven analysis of information propagation in the Flickr socialnetwork. In Proceedings of the 18th international conference on world wide web(WWW). ACM, New York; 2009:721–730.View Article
  13. Bakshy E, Hofman JM, Mason WA, Watts DJ: Everyone’s an influencer: quantifying influence on Twitter. In Proceedings of the fourth ACM international conference on web search and datamining (WSDM). ACM, New York; 2011:65–74.View Article
  14. Weng L, Ratkiewicz J, Perra N, Gonçalves B, Castillo C, Bonchi F, Schifanella R, Menczer F, Flammini A: The role of information diffusion in the evolution of social networks. Proceedings of the 19th ACM SIGKDD international conference on knowledgediscovery and data mining. KDD’13 2013, 356–364.View Article
  15. Ienco D, Bonchi F, Castillo C: The meme ranking problem: maximizing microblogging virality. In 2010 IEEE international conference on data mining workshops (ICDMW). IEEE Press, New York; 2010:328–335.View Article
  16. Shalizi CR, Thomas AC: Homophily and contagion are generically confounded in observational social networkstudies. Sociol Methods Res 2011, 40(2):211–239. 10.1177/0049124111404820MathSciNetView Article
  17. Bakshy E, Rosenn I, Marlow C, Adamic L: The role of social networks in information diffusion. In Proceedings of the 21st international conference on world wide web(WWW). ACM, New York; 2012:519–528.
  18. Figueiredo F, Benevenuto F, Almeida JM: The tube over time: characterizing popularity growth of YouTube videos. In Proceedings of the fourth ACM international conference on web search and datamining (WSDM). ACM, New York; 2011:745–754.View Article
  19. Brodersen A, Scellato S, Wattenhofer M: YouTube around the world: geographic popularity of videos. In Proceedings of the 21st conference on world wide web (WWW). ACM, New York; 2012:241–250.View Article
  20. Weng L, Flammini A, Vespignani A, Menczer F: Competition among memes in a world with limited attention. Sci Rep 2012., 2: Article ID 335 Article ID 335
  21. Lehmann J, Gonçalves B, Ramasco JJ, Cattuto C: Dynamical classes of collective attention in Twitter. In Proceedings of the 21st international conference on world wide web(WWW). ACM, New York; 2012:251–260.
  22. Yang J, Leskovec J: Patterns of temporal variation in online media. In Proceedings of the fourth ACM international conference on web search and datamining (WSDM). ACM, New York; 2011:177–186.View Article
  23. Mathioudakis M, Koudas N, Marbach P: Early online identification of attention gathering items in social media. In Proceedings of the third ACM international conference on web search and datamining (WSDM). ACM, New York; 2010:301–310.View Article
  24. Kooti F, Yang H, Cha M, Gummadi KP, Mason WA: The emergence of conventions in online social networks. AAAI conference on weblogs and social media (ICWSM) 2012.
  25. Kwak H, Lee C, Park H, Moon S: What is Twitter, a social network or a news media? In Proceedings of the 19th international conference on world wide web(WWW). ACM, New York; 2010:591–600.
  26. Schifanella R, Barrat A, Cattuto C, Markines B, Menczer F: Folks in folksonomies: social link prediction from shared metadata. In Proceedings of the third ACM international conference on web search and datamining. ACM, New York; 2010:271–280.View Article
  27. Arthur S, Sheffrin SM: Economics: principles in action. Prentice Hall, New York; 2003.
  28. Wang X, Smith K, Hyndman R: Characteristic-based clustering for time series data. Data Min Knowl Discov 2006, 13(3):335–364. 10.1007/s10618-005-0039-xMathSciNetView Article
  29. Fu T-C: A review on time series data mining. Eng Appl Artif Intell 2011, 24(1):164–181. 10.1016/j.engappai.2010.09.007View Article
  30. Geurts P: Pattern extraction for time series classification. In Principles of data mining and knowledge discovery. Springer, Berlin; 2001:115–127.View Article
  31. Warren Liao T: Clustering of time series data - a survey. Pattern Recognit 2005, 38(11):1857–1874. 10.1016/j.patcog.2005.01.025View Article
  32. Lin J, Keogh E, Wei L, Lonardi S: Experiencing sax: a novel symbolic representation of time series. Data Min Knowl Discov 2007, 15(2):107–144. 10.1007/s10618-007-0064-zMathSciNetView Article
  33. Lee CHL, Liu A, Chen WS: Pattern discovery of fuzzy time series for financial prediction. IEEE Trans Knowl Data Eng 2006, 18(5):613–625.View Article
  34. Härdle W, Vieu P: Kernel regression smoothing of time series. J Time Ser Anal 1992, 13(3):209–232. 10.1111/j.1467-9892.1992.tb00103.xMathSciNetView Article
  35. Assfalg J, Kriegel HP, Kroger P, Kunath P, Pryakhin A, Renz M: Similarity search on time series based on threshold queries. Advances in database technology - EDBT 2006, 276–294.
  36. Palshikar G: Simple algorithms for peak detection in time-series. Proceedings of the 1st international conference on advanced data analysis,business analytics and intelligence (ADABAI) 2009.
  37. Basseville M, Nikiforov IV: Detection of abrupt changes: theory and applications. Prentice Hall, New York; 1993.
  38. Chen J, Gupta AK: Parametric statistical change point analysis: with applications to genetics,medicine, and finance. Birkhäuser, Basel; 2011.
  39. Lin J, Li Y: Finding structural similarity in time series data using bag-of-patternsrepresentation. In Scientific and statistical database management. Springer, Berlin; 2009:461–477.View Article

Copyright

© Vaca Ruiz et al.; licensee Springer. 2014

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permitsunrestricted use, distribution, and reproduction in any medium, provided the originalwork is properly credited.