Positive words carry less information than negative words
© Garcia et al.; licensee Springer. 2012
Received: 27 January 2012
Accepted: 18 May 2012
Published: 18 May 2012
Skip to main content
© Garcia et al.; licensee Springer. 2012
Received: 27 January 2012
Accepted: 18 May 2012
Published: 18 May 2012
We show that the frequency of word use is not only determined by the word length  and the average information content , but also by its emotional content. We have analyzed three established lexica of affective word usage in English, German, and Spanish, to verify that these lexica have a neutral, unbiased, emotional content. Taking into account the frequency of word usage, we find that words with a positive emotional content are more frequently used. This lends support to Pollyanna hypothesis  that there should be a positive bias in human expression. We also find that negative words contain more information than positive words, as the informativeness of a word increases uniformly with its valence decrease. Our findings support earlier conjectures about (i) the relation between word frequency and information content, and (ii) the impact of positive emotions on communication and social links.
PACS Codes:89.65.-s, 89.70.Cf, 89.90.+n.
One would argue that human languages, in order to facilitate social relations, should be biased towards positive emotions. This question becomes particularly relevant for sentiment classification, as many tools assume as null hypothesis that human expression has neutral emotional content [4, 5], or reweight positive and negative emotions  without a quantification of the positive bias of emotional expression. We have tested and measured this bias in the context of online written communication by analyzing three established lexica of affective word usage. These lexica cover three of the most used languages on the Internet, namely English , German , and Spanish . The emotional content averaged over all the words in each of them is neutral. Considering, however, the everyday usage frequency of these words we find that the overall emotion of the three languages is strongly biased towards positive values, because words associated with a positive emotion are more frequently used than those associated with a negative emotion.
Historically, the frequency of words was first analyzed by Zipf [1, 10] showing that frequency predicts the length of a word as result of a principle of least effort. Zipf’s law highlighted fundamental principles of organization in human language , and called for an interdisciplinary approach to understand its origin [12–14] and its relation to word meaning . Recently Piantadosi et al.  extended Zipf’s approach by showing that, in order to have efficient communication, word length increases with information content. Further discussions [15–17] highlighted the relevance of meaning as part of the communication process as, for example, more abstract ideas are expressed through longer words . Our work focuses on one particular aspect of meaning, namely the emotion expressed in a word, and how this is related to word frequency and information content. This approach requires additional data beyond word length and frequency, which became available thanks to large datasets of human behaviour on the Internet. Millions of individuals write text online, for which a quantitative analysis can provide new insights into the structure of human language and even provide a validation of social theories . Sentiment analysis techniques allow to quantify the emotions expressed through posts and messages [5, 6]. Recent studies have provided statistical analyses [20–23] and modelling approaches [24, 25] of individual and collective emotions on the Internet.
An emotional bias in written expressions, however, would have a strong impact, as it shifts the balance between positive and negative expressions. Thus, for all researchers dealing with emotions in written text it would be of particular importance to know about such bias, how it can be quantified, and how it affects the baseline, or reference point, for expressed emotions. Our investigation is devoted to this problem by combining two analyses, (i) quantifying the emotional content of words in terms of valence, and (ii) quantifying the frequency of word usage in the whole indexable web . We provide a study of the baseline of written emotional expression on the Internet in three languages that span more than 67.7% of the websites : English (56.6%), German (6.5%), and Spanish (4.6%). These languages are used everyday by more than 805 million users, who create the majority of the content available on the Internet.
In order to link the emotionality of each word with the information it carries, we build on the recent work of Piantadosi et al. . This way, we reveal the importance of emotional content in human communication which influences the information carried by words. While the rational process that optimizes communication determines word lengths by the information they carry, we find that the emotional content affects the word frequency such that positive words appear more frequently. This points towards an emotional bias in used language and supports Pollyanna hypothesis , which asserts that there is a bias towards the usage of positive words. Furthermore, we extend the analysis of information content by taking into account word context rather than just word frequency. This leads to the conclusion that positive words carry less information than negative ones. In other words, the informativeness of words highly depends on their emotional polarity.
We wish to emphasize that our work distinguishes itself both regarding its methodology and its findings from a recent article . There, the authors claim a bias in the amount of positive versus negative words in English, while no relation between emotionality and frequency of use was found. A critical examination of the conditions of that study shows that the quantification of emotions was done in an uncontrolled setup through the Amazon Mechanical Turk. Participants were shown a scale similar to the ones used in previous works [7–9], as explained in . Thanks to the popular usage of the Mechanical Turk, the authors evaluated more than 10,000 terms from the higher frequency range in four different corpora of English expression. However, the authors did not report any selection criterion for the participant reports, opposed to the methodology presented in  where up to 50% of the participants had to be discarded in some experiments.
Because of this lack of control in their experimental setup, the positive bias found in  could be easily explained as an acquiescent bias [30, 31], a result of the human tendency to agree in absence of further knowledge or relevance. In particular, this bias has been repeatedly shown to exist in self assessments of emotions [32, 33], requiring careful response formats, scales, and analyses to control for it. Additionally, the wording used to quantify word emotions in  (happiness), could imply two further methodological biases: The first one is a possible social desirability bias , as participants tend to modify their answers towards more socially acceptable answers. The positive social perception of displaying happiness can influence the answers given by the participants of the study. Second, the choice of the word happiness implies a difference compared with the standard psychological term valence. Valence is interpreted as a static property of the word while happiness is understood as a dynamic property of the surveyed person when exposed to the word. This kind of framing effects have been shown to have a very large influence in survey results. For example, a recent study  showed a large change in the answers by simply changing voting for being a voter in a voter turnout survey.
Hence, there is a strong sensitivity to such influences which are not controlled for in . Because of all these limitations, in our analysis we chose to use the current standard lexica of word valence. These lexica, albeit limited to 1,000 to 3,000 words, were produced in three controlled, independent setups, and provide the most reliable estimation of word emotionality for our analysis. Our results on these lexica are consistent with recent works on the relation between emotion and word frequency [37, 38] for English in corpora of limited size.
In detail, we have analyzed three lexica of affective word usage which contain 1,034 English words, 2,902 German words and 1,034 Spanish words, together with their emotional scores obtained from extensive human ratings. These lexica have effectively established the standard for emotion analyses of human texts . Each word in these lexica is assigned a set of values measuring different aspects of word emotionality. The three independent studies that generated the lexica for English , German , and Spanish  used the Self-Assessment Mannequin (SAM) method to ask participants about the different emotion values associated to each word in the lexicon. One of these values, a scalar variable v called valence, represents the degree of pleasure induced by the emotion associated with the word, and it is known to explain most of the variance in emotional meaning . In this article, we use v to quantify word emotionality.
Frequency-based information content metrics like self-information are commonly used in computational linguistics to systematically analyze communication processes. Information content is a better predictor for word length than word frequency [2, 41], and the relation between information content and meaning, including emotional content, is claimed to be crucial for the way humans communicate [15–17]. We use the self-information of a word as an estimation of information content for a context size of 1, to build up later on larger context sizes. This way, we frame our analysis inside the larger framework of N-gram information measures, aiming at an extensible approach that can be incorporated in the fields of computational linguistics and sentiment analysis.
Correlations between word valence and information measurements.
Eventually, we also performed a control analysis using alternative frequency datasets, to account for possible anomalies in the Google dataset due to its online origin. We used the word frequencies estimated from traditional written corpuses, i.e. books, as reported in the original datasets for English , German , and Spanish . Calculating the self-information from these and relating them to the valences given, we obtained similar, but slightly lower Pearson’s correlation coefficients (see Table 1). So, we conclude that our results are robust across different types of written communication, for the three languages analyzed.
where N is the total frequency of the word in the corpus used for the estimation. These values were calculated as approximations of the information content given the words surrounding w up to size 4.
We analyzed how word valence is related to the information content up to context size 4 using the original calculations provided by Piantadosi et al. . This estimation is based on the frequency of sequences of N words, called N-grams, from the Google dataset  for . This dataset contains frequencies for single words and N-grams, calculated from an online corpus of more than a trillion tokens. The source of this dataset is the whole Google crawl, which aimed at spanning a large subset of the web, providing a wide point of view on how humans write on the Internet. For each size of the context N, we have a different estimation of the information carried by the studied words, with self-information representing the estimation from a context of size 1.
The left column of Figure 3 shows how valence decreases with the estimation of the information content for each context size. Each bar represents the same amount of words within a language and has an area proportional to the rescaled average information content carried by these words. The color of each bar represents the average valence of the binned words. The decrease of average valence with information content is similar for estimations using 2-grams and 3-grams. For the case of 4-grams it also decreases for English and Spanish, but this trend is not so clear for German. These trends are properly quantified by Pearson’s correlation coefficients between valence and information content for each context size (Table 1). Each correlation coefficient becomes smaller for larger sizes of the context, as the information content estimation includes a larger context but becomes less accurate.
Additional correlations between valence, self-information and length.
Subsequently, we found that the correlation coefficient between word length and self-information () is positive, showing that word length increases with self-information. These values of are consistent with previous results [1, 2]. Pearson’s and Spearman’s correlation coefficients between valence and length are very low or not significant. In order to test the combined influence of valence and length to self-information, we calculated the partial correlation coefficients and . The results are shown in Table 2, and are within the 95% confidence intervals of the original correlation coefficients and . This provides support for the existence of an additional dimension in the communication process closely related to emotional content rather than communication efficiency. This is consistent with the known result that word lengths adapt to information content , and we discover the independent semantic feature of valence. Valence is also related to information content but not to the symbolic representation of the word through its length.
Partial correlation coefficients between valence and information content.
Our analysis provides strong evidence that words with a positive emotional content are used more often. This lends support to Pollyanna hypothesis , i.e. positive words are more often used, for all the three languages studied. Our conclusions are consistent for, and independent of, different corpuses used to obtain the word frequencies, i.e. they are shown to hold for traditional corpuses of formal written text, as well as for the Google dataset and cannot be attributed as artifacts of Internet communication.
Furthermore, we have pointed out the relation between the emotional and the informational content of words. Words with negative emotions are less often used, but because of their rareness they carry more information, measured in terms of self-information, compared to positive words. This relation remains valid even when considering the context composed of sequences of up to four words (N-grams). Controlling for word length, we find that the correlation between information and valence does not depend on the length, i.e. it is indeed the usage frequency that matters.
In our analysis, we did not explore the role of syntactic rules and grammatical classes such as verbs, adjectives, etc. However, previous studies have shown the existence of a similar bias when studying adjectives and their negations . The question of how syntax influences emotional expression is beyond the scope of the present work. Note that the lexica we use are composed mainly of nouns, verbs and adjectives, due to their emotional relevance. Function words such as “a” or “the” are not considered to have any emotional content and therefore were excluded from the original studies. In isolation, these function words do not contain explicit valence content, but their presence in text can modify the meaning of neighboring words and thus modify the emotional content of a sentence as a whole. Our analysis on partial correlations show that there is a correlation between the structure of a sentence and emotional content beyond the simple appearance of individual words. This result suggests the important role of syntax in the process of emotional communication. Future studies can extend our analysis by incorporating valence scores for word sequences, exploring how syntactical rules represent the link between context and emotional content.
The findings reported in this paper suggest that the process of communication between humans, which is known to optimize information transfer , also creates a bias towards positive emotional content. A possible explanation is the basic impact of positive emotions on the formation of social links between humans. Human communication should reinforce such links, which it both shapes and depends on. Thus, it makes much sense that human languages on average have strong bias towards positive emotions, as we have shown (see Figure 2). Negative expressions, on the other hand, mostly serve a different purpose, namely that of transmitting highly informative and relevant events. They are less used, but carry more information.
Our findings are consistent with emotion research in social psychology. According to , the expression of positive emotions increases the level of communication and strengthens social links. This would lead to stronger pro-social behaviour and cooperation, giving evolutionary advantage to societies whose communication shows a positive bias. As a consequence, positive sentences would become more frequent and even advance to a social norm (cf. “Have a nice day”), but they would provide less information when expressed. Our analysis provides insights on the asymmetry of evaluative processes, as frequent positive expression is consistent with the concept of positivity offset introduced in  and recently reviewed in . In addition, Miller’s negativity bias (stronger influence of proximal negative stimuli) found in experiments provides an explanation for the higher information content of negative expression. When writing, people could have a tendency to avoid certain negative topics and bring up positive ones just because it feels better to talk about nice things. That would lower the frequency of negative words and lower the amount of information carried by positive expression, as negative expression would be necessary to transmit information about urgent threats and dangerous events.
Eventually, we emphasize that the positive emotional “charge” of human communication has a further impact on the quantitative analysis of communication on the Internet, for example in chatrooms, forums, blogs, and other online communities. Our analysis provides an estimation of the emotional baseline of human written expression, and automatic tools and further analyses will need to take this into account. In addition, this relation between information content and word valence might be useful to detect anomalies in human emotional expression. Fake texts supposed to be written by humans could be detected, as they might not be able to reproduce this spontaneous balance between information content and positive expression.
All authors designed and performed research, analyzed data, and wrote the article.
The lexica focus on single words rather than on phrases or longer expressions.
This research has received funding from the European Community’s Seventh Framework Programme FP7-ICT-2008-3 under grant agreement no 231323 (CYBEREMOTIONS).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.