Modeling dynamics of attention in social media with user efficiency
© Vaca Ruiz et al.; licensee Springer. 2014
Received: 19 December 2013
Accepted: 13 February 2014
Published: 4 March 2014
Skip to main content
© Vaca Ruiz et al.; licensee Springer. 2014
Received: 19 December 2013
Accepted: 13 February 2014
Published: 4 March 2014
Evolution of online social networks is driven by the need of their members to shareand consume content, resulting in a complex interplay between individual activity andattention received from others. In a context of increasing information overload andlimited resources, discovering which are the most successful behavioral patterns toattract attention is very important. To shed light on the matter, we look into thepatterns of activity and popularity of users in the Yahoo Meme microblogging service.We observe that a combination of different type of social and content-producingactivity is necessary to attract attention and the efficiency of users, namely theaverage attention received per piece of content published, for many users has adefined trend in its temporal footprint. The analysis of the user time series ofefficiency shows different classes of users whose different activity patterns giveinsights on the type of behavior that pays off best in terms of attention gathering.In particular, sharing content with high spreading potential and then supporting theattention raised by it with social activity emerges as a frequent pattern for usersgaining efficiency over time.
In this paper, we address questions that focus on social media users’ behavior atdifferent stages of their participation in social media platforms. In particular, weintroduce a new way to examine attention dynamics, and from this perspective perform adeep analysis of the evolution of user activity and attention in a social network fromits beginning until the service ceased to exist. Analyzing the weekly efficiency, i.e.the amount of attention received in the platform normalized by the amount of contentproduced, we observe that 56% of the users in the dataset exhibit a footprint of theirefficiency with a clearly defined trend (i.e., sharply increasing/decreasing orpeaking). We are able to extract patterns of user behavior from these temporalfootprints that reveal differences in the activity behavior of users of differentclasses. We focus our analysis on Yahoo Meme, a microblogging service that was launchedby Yahoo in 2009 and discontinued in 2012. While the mechanisms of interaction in YahooMeme were similar to those found in other social media platforms, to the best of ourknowledge, this is the first study that examines in detail the questions we areaddressing from the perspective of user efficiency, using data from a service from itsinitial launch.
The main contributions of this work include:
Study of the attention dynamics in social networks from the angle ofefficiency, namely the ratio between attention received and activityperformed. The notion of efficiency in time allows to detect patterns that could notemerge using other raw popularity or activity indicators.
Definition of a method to classify noisy time series ofuser-generated events. The method is successfully used to find classes of users based onthe time series of their efficiency scores, with an accuracy ranging from 0.85 to 0.93,depending on the different classes.
Extraction of insights useful to detect and prevent user churn. Forinstance, exploration of the efficiency time series reveals that increase in efficiencyis determined by creation of high-quality content, but the acquired attention has to besustained with additional social activity to keep the efficiency high. If such socialexchange is missing, attention received drops very quickly.
Much effort has been spent lately in measuring the effect that the activity of contentproduction and sharing has in influencing the actions of social media participants.Depending on whether the investigation adopts the perspective of the user whois sharing or of the content being shared, emphasis has been given to thecharacterization of either the influential users or the process of information spreadingalong social connections.
Different methods to identify influentials, namely individuals who seed viralinformation cascades, have been proposed recently , and it has been observed that simple measures such as the raw number ofsocial connections are not good predictors of influence potential [5–7]. Instead, the ease of propagation of a piece of content is correlated withmany other features, including the position of the content creator in the social network , demographic factors [9, 10], and the sentiment conveyed in the message .
For what concerns content-centered analysis, much attention has been devoted to thestudy of the structure and diffusion speed of information cascades in social and newsmedia [12–14], including Yahoo Meme [14, 15]. Weng et al. for instance have shown that triadic closure helps to explain the linkformation in early stages of the user’s lifetime but later in time it is theinformation flow the driver for new connections. Despite the difficulty of determiningwhether observed cascades are generated by a real influence effect  (unless performing controlled experiments ), the role of influence in social network dynamics is widely recognized,albeit not fully understood. Factors related to influence include geolocation,visibility of the content, or exogenous factors like major geopolitical or news eventsfor news media [18–20].
Patterns of temporal variation of popularity have been investigated previously, mostlyfocusing on the attention given to pieces of user-generated content. Previous workincludes characterization of the peakness and saturation of video popularity on YouTubein relation to content visibility , crowd productivity dependence on the attention gathered by videos , the classification of bursty Twitter hashtags in relation to topic detectiontasks , and the clustering of hashtag popularity histograms based on their shape . Time series has been used to predict popularity in blogs, where the earlyreactions of the crowd to a piece of content is strongly correlated to the expectedoverall popularity [3, 23].
In this work we focus on users as opposed to content and we analyze time series of ametric combining the user activity and the attention received. We do not focus on thepopularity gained at a global scale, but instead we characterize temporal patterns ofactivity and attention of each individual. We show that time series of individual useractivity cannot be clustered accurately based on their shapes by state-of-the artmethods, so we propose an algorithm to fix that. Finally, except in rare cases (e.g., ), previous work on network analysis has relied mostly on limited temporalsnapshots. In contrast, we use the temporal data of the entire life-span of Meme, fromits release date until its shutdown.
Followers network statistics
Activity and attention are the two dimensions we aim to examine with our study. Afterdefining the features, we look at their relationship in terms of correlations of theirraw indicators and then we study them from a novel perspective by defining a metric ofuser efficiency. We find that very efficient users tend to write fewer posts per weekbut are heavily involved in social activities such as commenting.
We define activity and attention indicators that are computed forevery user. Activity indicators are measured by the number of posts (pd),reposts (rd), and comments done (cd), or by the number of newfollowees added (fwee), while attention is determined by the number ofreposts (rr) ot comments received (cr) from others, and by thenumber of new incoming follower links (fw). Reposts received can bedirect or indirect (i.e., reposting a repost). To measureattention we consider direct reposts.
The possibility of indirect reposting originates repost cascades that can bemodeled as trees rooted in the original post and whose descendants are the direct(depth 1) and indirect (depth 2 to the leaves) reposts. Besides being anotherattention indicator, the cascade size (cs) is a good proxy for theperceived interestingness of the content because, intuitively, sharing apiece of content originated by someone who is not directly linked through a socialtie, and therefore is likely to be unknown to the reposter, implies a higherlikelihood that the reposter was interested in that piece content. Therefore, weconsider the cascade size as a measure of content interestingness.
Even though several measure of influence, authoritativeness, or more in generalimportance of a user in a networked system have been developed in the past (see forinstance the work by Romero et al.), here we adopt the perspective of a single user, rather than of the wholecommunity. Therefore, we are going to interpret the system as a black box thatreceives input from a user (activity) and returns some output (attention), withoutconsidering the actual effect that the input causes inside the system. Although thisis a simplification, it allows us to better focus on the user dimension and tocluster users with respect to the perception they get from the interaction with thesystem (i.e., attention in exchange for activity).
When dealing with multidimensional behavioral data, detecting causation betweenevents can be difficult , but potential mechanisms driving the interactions between the differentdimensions at play can be spotted through the investigation of correlations . In this case, the correlations between activity and attention metricsgive a first hint about the potential payoff of some user actions in terms ofattention received.
Statistics for the number of users considered in each bucket of the heatmapsdepicting the correlations between activity and popularity metrics(Figure 1 )
First, we observe that attention in terms of followers and comments(Figures 1(A)-(B)) is correlated with both number ofposts and comments done, resulting in a color gradient becoming brighter whentransitioning from the lower-left corner to the upper-right one. Users who gainedmore followers were heavier content producers and an even more evident correlation isfound when considering comments received (Figure 1(B)),likely due to a comment reciprocity tendency (we calculated the comment reciprocitybeing around 24%, much higher than reciprocity in the follower network). We observe apartially similar effect when looking at content-centered indicators, namely thereposts received and the cascade size (Figures 1(C)-(D)).In these cases we find a positive correlation with the number of posts, but not withthe amount of comments, suggesting that social interaction, such as commenting onother people posts, does not strongly characterize content propagation.
The two plots on the right of Figure 1 show the relationbetween pairs of attention metrics with the number of posts. From Figure 1(E) we learn that social exposure (i.e., being followed) andproductivity (i.e., number of posts) are both heavily correlated with the number ofreposts. However, people with moderate or heavy posting activity can reach a highlevel of attention even having a relatively small audience (as shown by the brightcolors extending down along the right side of the map). This intuition is confirmedby the fact that swapping the axes of the two attention measures, the correlation isdisrupted (Figure 1(F)), meaning that people with highnumber of posts and reposts do not necessarily have a large number of followers.
Analogous definitions have been used in different disciplines such as physics andeconomics , and in most of the cases the efficiency is upper bounded to 1, i.e., theoutcome from the system cannot exceed the energy given in input. On the contrary, ina social media setting the efficiency is unbounded and it constitutes an objectivefunction to maximize in order to increase the engagement of the user base. Even ifcomments can be strong indicators of involved user participation, the main focus ofthe online service under study is posting and reposting, similarly to Twitter.Therefore we always consider the number of posts as the metric of activity in theefficiency formula. In the above definition (Formula (1)) we assume that theattention that we take into account should be the one that is directly triggered bythe activity considered, we use either the number of reposts () or the number of comments () as proxies for attention received, since other metricssuch as number of followers are not necessarily responses to the postingactivity.
Insightful patterns emerge. First, the higher the , the lower the activity in terms of number of posts,but not in the range (containing most of the users), in which the number ofposts grows with . However, when looking at the average number of postssubmitted per week instead, the trend becomes monotonic, confirming the theory aboutthe limited attention of the audience being a barrier for attention gathering . Second, the higher the , the higher the amount of comments: the more efficientusers are the ones who comment the most. Finally, the longevity of the profile andthe prestige on the follower network (computed with standard PageRank) are alsodistinctive features of efficient users.
Attention attracted by users, and by consequence their efficiency, is not constant intime. It depends on the amount of activity, the position in the network and otherfactors. However next we show that, even if many users exhibit a oscillating butglobally stable values of efficiency in time, more than half the users show sharpvariations in their efficiency time series, that tell more about the activity behaviorin different periods of the user lifetime. First, we give the definition of efficiencytime series. Then, we explain the algorithm used to classify users efficiency tracesaccording to the shape of their trend and discuss the properties of the four classes wefound. We (i) find that state-of-the-art algorithms for clustering of timeseries do notperform well on the noisy traces such the ones generated by human activity, therefore,based on the observed shapes, (ii) we propose a new classification method and evaluateit against a human-curated ground truth, and (iii) we analyze the differences betweenuser behaviors in the four main user efficiency classes around the main changepoint ofthe efficiency curve.
where represents the set of posts published by useru on week , is the total number of direct reposts received in theuser’s lifetime for the set of posts , and is the sorted list of weeks in which the useru published at least one post.
Characterizing users based on the exhibited temporal behavior of their efficiencyrequires to extract automatically patterns out of the generated time series. Thereare two main families of state-of-the art methods for this task. The first oneincludes feature-based approaches that cluster series based on theirkurtosis, skewness, trend, and chaos . The latter one includes area-under-the-curve methods [29–31] that consist into dividing the time series into equally sized fragments,measure the area under the curve in each fragment, represent the time series as avector of such quantities, and then apply a clustering algorithm over them(specifically, we used k-means). We first tried those state-of-the artmethods to cluster the efficiency time series. We do not report extensively theresults obtained for the sake of brevity, but both feature-based approachesarea-under-the-curve methods produce clusters containing extremely heterogeneouscurves, as we assessed by manual inspection. In addition to that, we tried also aseparate approach, proposed few years ago, that transforms the curves throughPiecewise Aggregate Approximation and Symbolic Aggregate Approximation and thenclusters the resulting representations with k-means . Also this method lead to very imbalanced clusters, being the 99% ofcurves put in one single cluster. The main issue with those approaches is that theyhave been tested in the past mainly on synthetic time series. When time seriesrepresent the activity of single actors they may have an extremely broad variety oflength, shapes, and oscillation of the curve that the mentioned methods are not ableto handle properly.
Smoothing. Apply the kernel regression estimator of Nadaraya and Watson  to the user temporal data to obtain a smoothed time series t. The smoothing process gets rid of very sharp and punctual fluctuations, which are very frequent in human activity time series. Examples of raw curves compared to their smoothed versions are shown in Figure 4 (bottom).
Linguistic transform. Generate a qualitative representation of the time series t for a user u using three states: High, Medium, Low (H, M, L). We empirically set the threshold for high values to 0.6 and for medium values to 0.3 (i.e., values greater than the 60% of the maximum efficiency reached by the user are considered High). The idea of using threshold values is supported by previous work in time-series segmentation .
Fluctuation reduction. Search for contiguous subsequences of a given state and drop the subsequences whose length is less than the 10% of the total length. Similarly to the smoothing procedure, this step helps to eliminate noisy fluctuations in the time series. For example, in the series HHHMHHMMMLLL, the fourth element, M is dropped.
String collapsing. Collapse the string representation of t by replacing subsequences of the same state with a single symbol of the same type. For instance, the resulting series from the previous example, HHHHHMMMLLL, is transformed to HML.
Detection of Increasing/Decreasing classes. Look for collapsed sequences with just two groups of symbols and classify as “Increasing” a sequence transitioning from L or M to the state H and as “Decreasing” those transitioning from H to L or M. The second and third columns in Figure 4 show the threshold for High values as a dotted red line.
Detection of Peaky class. For the unclassified series, find those exhibiting a peaky shape by looking at outliers in the series whose value is higher than x times the average value. This method has been successfully used before in the context of Twitter, with . Other methods for peak detection we tested  find just local peaks, which are very frequent in noisy time series.
Detection of changepoint. Accurately locating the point in which a curve transitions between different levels is important to study the behavior of users in their single activity and popularity metrics around the point in time when these changes occur . For the peak type curves, the changepoint is intuitively defined by the highest peak, whereas for the increasing and decreasing types the point is identified by the time in which the linguistic representation of the series transitions from H to M or L status (decreasing) or from L or M to H status (increasing). For the sake of comparison, we match our simple technique with the statistical change point analysis recently proposed by Chen et al.. We find that, although for most time series the values from the two methods were very close (at most 1 or 2 weeks difference in around 80% of the cases), the statistical changepoint detection often identifies points right before or right after a change of efficiency.
Detection of Steady class. The remaining time series are classified as steady.
As in most previous work , in absence of an automatic way to compute the quality of the classes, twoof the authors annotated a random sample of time series per class to assess the goodness of ouralgorithm. Since the expected shapes of the curves for each class are very clear (seeexamples in Figure 4) a human evaluator can decide withcertainty whether the instances from the sample match the expected template. Theoutcome of the labeling is very encouraging, with 93% correct instances in theDecreasing class, 86% in the Increasing, and 85% in Peak, and almost perfectagreement between evaluators (Fleiss ). For the Steady class, where shapes can vary much, welabeled as misclassified any curve belonging to the other classes. We founda low portion of misclassifed instances (12%). We observe that the users inthe steady class are around 44%, meaning that 56% of the users exhibit a temporalfootprint of the efficiency curve that has a clearly defined trend. This is a findingwith important implications on the applicative side, meaning that the majority ofusers could be accurately profiled as having consistently increasing or decreasingefficiency patterns.
Accurately locating the point in which a curve transitions between different levelsis important to characterize the user behavior when his efficiency significantlyincreases or drops, thus allowing to study how single activity and popularity metricsvary when these changes occur . Changepoint detection refers to the problem of finding time instantswhere abrupt changes occur . Except for the steady time series, which denote a user behavior that isquite constant in time (or for which transition to higher or lower efficiency levelsare much slower), all the other three types have a changepoint in which theefficiency trend changes radically in a relatively short period of time compared tothe total length of the user lifetime. For the peak type curves, the changepoint isintuitively defined by the highest peak, whereas for the increasing and decreasingtypes the point is identified by the time in which the linguistic representation ofthe series transitions from H or M to L status(decreasing) or from L or M to H status (increasing). Moregeneral methods to identify changepoints relying on the changes in mean and variancehave been proposed in the past. For the sake of comparison, we match our simpletechnique with the statistical change point analysis recently proposed by Chen etal.. We find that, although for most time series the values from the twomethods were very close (at most 1 or 2 weeks difference in around 80% of the cases),the statistical changepoint detection sometimes identifies points right before orright after a change of efficiency. In fact, the generality of statistical methods isnot a plus in cases in which the set of curves in input is quite homogeneous and forwhich ad-hoc methods result more reliable. For this reason, we use our definition ofchangepoint.
Once users with similar profiles in their temporal efficiency evolution have beengrouped, time series are analyzed to identify meaningful changepoints.
For each detected class, we perform an analysis in aggregate over all the users firstand then we characterize the evolution of the same metrics in time. We find that (i)publishing interesting content helps to boost the efficiency of the subsequent poststhrough attention gathering and that (ii) the efficiency gained in that way should besustained by intense social activity to avoid it to drop.
Activity, popularity and longevity indicators for the four user classes
Here we investigate deeper how users in each class distribute the amount of activityin time. We perform an analysis around the changepoint of the efficiency curve, andsee if the different temporal patterns can explain why their efficiencylevel changed over time. We decompose the timeseries into different phasesand study the relations between them in terms of the activity and attentionindicators. Specifically, for all the users belonging to the classes where thechangepoint is given (i.e., all but the Steady class).
Let us define three user-dependent time steps: the week in which the user activitystarted , the week of the changepoint of the efficiency curve, and the week of the end of the activity, after which no other action is performed by the user.Accordingly, we define three phases of the user lifespan referred asBefore, CP, After, which represent, respectively: theweeks in the interval, the changepoint week , and the weeks in the interval. We calculate the average weekly amount ofactivity and attention metrics during these three macro-aggregates of weeks. Thethree values obtained for each indicator capture the variation of activity andattention when approaching the critical point in which a consistent change ofefficiency is detected.
To detect the variation of the values in the three phases we compute two ratios foreach user: (a) RatioCP = activity-or-attention metric measured in divided by the same metric computed in, and (b) RatioAfter = activity-or-attention metricmeasured in divided by the same metric during. Ratios are then averaged over all the users of eachclass. Comparison of ratios between different user classes reveals the keydifferences between them: values above 1 mean that the value of the indicator grew inCP of in After phases compared to the Before phase.Final results for different values of activity and attention are reported inFigure 5. For instance, in Figure 5(a), we observe that RatioAfter is above 1 just for users in the Peakyclass. It means that the users in that class have published more posts after thechangepoint than they did before it. We can summarize our findings as follows:
Activity and attention at CP. Users of all classesmaintain a similar trend in the number of posts done in CP with a slightincrease in the case of the Increasing class (Figure 5(a)). For Peak and Increasing classes, the number of reposts received,cascade size and followers increases significantly in CP compared toBefore (Figure 5(d), (e), (f) respectively).Since reposts received and cascade size are proxies for content interestingness, thisindicates the production of content that attracts the attention of a much highernumber of users. For both classes, this is the most likely cause of the rise of theirefficiency at CP. For the Decreasing class the attention values startdropping instead. Finally, differently from other classes, users in the Increasingclass produce a higher number of comments in CP (Figure 5(b)).
Social activity after CP. In the After phase,social interaction such as the number of comments and the addition of new followeesconsiderably increase compared to Before for the Increasing class(Figures 5(b), (c)), while they remain stable or inslight decrease for the Peak class. Decreasing class values drop also in thiscase.
Content production activity after CP. The reversescenario is found when looking at the posting activity. In the After phase,Peak post messages at a higher rate than Before (Figure 5(a)), while Increasing posting activity drops in favor of a higherattention to social interaction.
The main lesson learned from the above findings is that the submission of pieces of“interesting” content, namely posts that attract the attention of a wideraudience than usual, is the trigger to transition to higher efficiency levels.However, efficiency cannot be maintained without cost. Increasing engagement insocial activity and expanding the potential audience turns out to be an effectivestrategy not to lose efficiency. Conversely, producing more content withoutreinforcing the social relationships with the potential consumers of the contentresults in a rapid drop of efficiency to the original levels. The difference betweenthe Increasing and Peaky classes is particularly striking, having the Increasing-typeusers fully exploiting social activity with 17% more followees, 23% more comments and61% reposts after their changepoint, while Peaky-type users keep their activityapproximately stable (except for an increase of reposts done). Moreover, as expected,when a status of equilibrium between attention received and activity is disrupted byan arbitrary reduction of productivity and social interactions, the efficiency isdestined to fade quickly.
We explored the interplay between activity and attention in Yahoo Meme by defining thenotion of user efficiency, namely the amount of attention received in relationto the content produced. We find that, unlike the raw attention measures, efficiency hasstrong negative correlation with the amount of user activity and users who are involvedin social activities such as commenting, have higher centrality in the social networkthan average, but are not necessarily heavy content producers.
However, if we consider commenting as a form of content creation, we observe thatcomment takes less effort than creating a post but, frequently, it can be moreeffective. It is so because the reciprocity plays a role and the comments networkexhibits a higher reciprocity than that of the follower network. Users can, thus,benefit from the visibility of a post whenever they comments on it.
We classify into four main classes (sharp increasing/decreasing steps, peaks or stabletrend) the time series of user efficiency with a novel algorithm that overcomeslimitations of previous approaches and we find four main clusters. By analyzing thevariation of activity and attention around the changepoints of the timeseries, we findevidences that user efficiency is boosted by a particular combination of production ofinteresting content and constant social interactions (e.g., comments). In these cases,users gather the attention from a wider audience by publishing content with higherspreading potential and then they manage to keep the attention high through regular andintensified social activity. These insights find direct application on the detection andprevention of user churn: being able to detect users who increase their efficiency butthat are frustrated by not being able to keep it high can be helped either byrecommending them social activities or pushing their contacts to interact with them. Thetask of churn prediction is a natural continuation of the present work that we plan toaddress in the future.
This work is supported by the SocialSensor FP7 project, partially funded by the ECunder contract number 287975. Carmen Vaca research work has been funded by ESPOL andthe Ecuadorian agency SENESCYT. We would like to thank Amin Mantrach, NeilO’Hare, Daniele Quercia, and Rossano Schifanella for the usefuldiscussions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permitsunrestricted use, distribution, and reproduction in any medium, provided the originalwork is properly credited.