SentiBench - a benchmark comparison of state-of-the-practice sentiment analysis methods

Ribeiro, Filipe N; Araújo, Matheus; Gonçalves, Pollyanna; André Gonçalves, Marcos; Benevenuto, Fabrício

doi:10.1140/epjds/s13688-016-0085-1

Regular article
Open access
Published: 07 July 2016

SentiBench - a benchmark comparison of state-of-the-practice sentiment analysis methods

Filipe N Ribeiro^1,2,
Matheus Araújo¹,
Pollyanna Gonçalves¹,
Marcos André Gonçalves¹ &
…
Fabrício Benevenuto¹

EPJ Data Science volume 5, Article number: 23 (2016) Cite this article

34k Accesses
251 Citations
47 Altmetric
Metrics details

Abstract

In the last few years thousands of scientific papers have investigated sentiment analysis, several startups that measure opinions on real data have emerged and a number of innovative products related to this theme have been developed. There are multiple methods for measuring sentiments, including lexical-based and supervised machine learning methods. Despite the vast interest on the theme and wide popularity of some methods, it is unclear which one is better for identifying the polarity (i.e., positive or negative) of a message. Accordingly, there is a strong need to conduct a thorough apple-to-apple comparison of sentiment analysis methods, as they are used in practice, across multiple datasets originated from different data sources. Such a comparison is key for understanding the potential limitations, advantages, and disadvantages of popular methods. This article aims at filling this gap by presenting a benchmark comparison of twenty-four popular sentiment analysis methods (which we call the state-of-the-practice methods). Our evaluation is based on a benchmark of eighteen labeled datasets, covering messages posted on social networks, movie and product reviews, as well as opinions and comments in news articles. Our results highlight the extent to which the prediction performance of these methods varies considerably across datasets. Aiming at boosting the development of this research area, we open the methods’ codes and datasets used in this article, deploying them in a benchmark system, which provides an open API for accessing and comparing sentence-level sentiment analysis methods.

1 Introduction

Sentiment analysis has become an extremely popular tool, applied in several analytical domains, especially on the Web and social media. To illustrate the growth of interest in the field, Figure 1 shows the steady growth on the number of searches on the topic, according to Google Trends,^{Footnote 1} mainly after the popularization of online social networks (OSNs). More than 7,000 articles have been written about sentiment analysis and various startups are developing tools and strategies to extract sentiments from text [1].

The number of possible applications of such a technique is also considerable. Many of them are focused on monitoring the reputation or opinion of a company or a brand with the analysis of reviews of consumer products or services [2]. Sentiment analysis can also provide analytical perspectives for financial investors who want to discover and respond to market opinions [3, 4]. Another important set of applications is in politics, where marketing campaigns are interested in tracking sentiments expressed by voters associated with candidates [5].

Due to the enormous interest and applicability, there has been a corresponding increase in the number of proposed sentiment analysis methods in the last years. The proposed methods rely on many different techniques from different computer science fields. Some of them employ machine learning methods that often rely on supervised classification approaches, requiring labeled data to train classifiers [6]. Others are lexical-based methods that make use of predefined lists of words, in which each word is associated with a specific sentiment. The lexical methods vary according to the context in which they were created. For instance, LIWC [7] was originally proposed to analyze sentiment patterns in formally written English texts, whereas PANAS-t [8] and POMS-ex [9] were proposed as psychometric scales adapted to the Web context.

Overall, the above techniques are acceptable by the research community and it is common to see concurrent important papers, sometimes published in the same computer science conference, using completely different methods. For example, the famous Facebook experiment [10] which manipulated users feeds to study emotional contagion, used LIWC [7]. Concurrently, Reis et al. used SentiStrength [11] to measure the negativeness or positiveness of online news headlines [12, 13], whereas Tamersoy [14] explored VADER’s lexicon [15] to study patterns of smoking and drinking abstinence in social media.

As the state-of-the-art has not been clearly established, researchers tend to accept any popular method as a valid methodology to measure sentiments. However, little is known about the relative performance of the several existing sentiment analysis methods. In fact, most of the newly proposed methods are rarely compared with all other pre-existing ones using a large number of existing datasets. This is a very unusual situation from a scientific perspective, in which benchmark comparisons are the rule. In fact, most applications and experiments reported in the literature make use of previously developed methods exactly how they were released with no changes and adaptations and with none or almost none parameter setting. In other words, the methods have been used as a black-box, without a deeper investigation on their suitability to a particular context or application.

To sum up, existing methods have been widely deployed for developing applications without a deeper understanding regarding their applicability in different contexts or their advantages, disadvantages, and limitations in comparison with each another. Thus, there is a strong need to conduct a thorough apple-to-apple comparison of sentiment analysis methods, as they are used in practice, across multiple datasets originated from different data sources.

This state-of-the-practice situation is what we propose to investigate in this article. We do this by providing a thorough benchmark comparison of twenty-four state-of-the-practice methods using eighteen labeled datasets. In particular, given the recent popularity of online social networks and of short texts on the Web, many methods are focused in detecting sentiments at the sentence-level, usually used to measure the sentiment of small sets of sentences in which the topic is known a priori. We focus on such context - thus, our datasets cover messages posted on social networks, movie and product reviews, and opinions and comments in news articles, TED talks, and blogs. We survey an extensive literature on sentiment analysis to identify existing sentence-level methods covering several different techniques. We contacted authors asking for their codes when available or we implemented existing methods when they were unavailable but could be reproduced based on their descriptions in the original published paper. We should emphasize that our work focus on off-the-shelf methods as they are used in practice. This excludes most of the supervised methods which require labeled sets for training, as these are usually not available for practitioners. Moreover, most of the supervised solutions do not share the source code or a trained model to be used with no supervision.

Our experimental results unveil a number of important findings. First, we show that there is no single method that always achieves the best prediction performance for all different datasets, a result consistent with the ‘there is no free lunch theorem’ [16]. We also show that existing methods vary widely regarding their agreement, even across similar datasets. This suggests that the same content could be interpreted very differently depending on the choice of a sentiment method. We noted that most methods are more accurate in correctly classifying positive than negative text, suggesting that current approaches tend to be biased in their analysis towards positivity. Finally, we quantify the relative prediction performance of existing efforts in the field across different types of datasets, identifying those with higher prediction performance across different datasets.

Based on these observations, our final contribution consists on releasing our gold standard dataset and the codes of the compared methods.^{Footnote 2} We also created a Web system through which we allow other researchers to easily use our data and codes to compare results with the existing methods.^{Footnote 3} More importantly, by using our system one could easily test which method would be the most suitable to a particular dataset and/or application. We hope that our tool will not only help researchers and practitioners for accessing and comparing a wide range of sentiment analysis techniques, but can also help towards the development of this research field as a whole.

The remainder of this paper is organized as follows. In Section 2, we briefly describe related efforts. Then, in Section 3 we describe the sentiment analysis methods we compare. Section 4 presents the gold standard data used for comparison. Section 5 summarizes our results and findings. Finally, Section 6 concludes the article and discusses directions for future work.

2 Background and related work

Next we discuss important definitions and justify the focus of our benchmark comparison. We also briefly survey existing related efforts that compare sentiment analysis methods.

2.1 Focus on sentence-level sentiment analysis

Since sentiment analysis can be applied to different tasks, we restrict our focus on comparing those efforts related to detect the polarity (i.e. positivity or negativity) of a given short text (i.e. sentence-level). Polarity detection is a common function across all sentiment methods considered in our work, providing valuable information to a number of different applications, specially those that explore short messages that are commonly available in social media [1].

Sentence-level sentiment analysis can be performed with supervision (i.e. requiring labeled training data) or not. An advantage of supervised methods is their ability to adapt and create trained models for specific purposes and contexts. A drawback is the need of labeled data, which might be highly costly, or even prohibitive, for some tasks. On the other hand, the lexical-based methods make use of a pre-defined list of words, where each word is associated with a specific sentiment. The lexical methods vary according to the context in which they were created. For instance, LIWC [7] was originally proposed to analyze sentiment patterns in English texts, whereas PANAS-t [8] and POMS-ex [9] are psychometric scales adapted to the Web context. Although lexical-based methods do not rely on labeled data, it is hard to create a unique lexical-based dictionary to be used for all different contexts.

We focus our effort on evaluating unsupervised efforts as they can be easily deployed in Web services and applications without the need of human labeling or any other type of manual intervention. As described in Section 3, some of the methods we consider have used machine learning to build lexicon dictionaries or even to build models and tune specific parameters. We incorporate those methods in our study, since they have been released as black-box tools that can be used in an unsupervised manner.

2.2 Existing efforts on comparison of methods

Despite the large number of existing methods, only a limited number of them have performed a comparison among sentiment analysis methods, usually with restricted datasets. Overall, lexical methods and machine learning approaches have been evolving in parallel in the last years, and it comes as no surprise that studies have started to compare their performance on specific datasets and use one or another strategy as baseline for comparison. A recent survey summarizes several of these efforts [17] and conclude that a systematic comparative study that implements and evaluates all relevant algorithms under the same framework is still missing in the literature. As new methods emerge and compare themselves only against one, at most two other methods, using different evaluation datasets and experimental methodologies, it is hard to conclude if a single method triumphs over the remaining ones, or even in specific scenarios. To the best of our knowledge, our effort is the first of kind to create a benchmark that provides such thorough comparison.

An important effort worth mentioning consists of an annual workshop - The International Workshop on Semantic Evaluation (SemEval). It consists of a series of exercises grouped in tracks, including sentiment analysis, text similarity, among others, that put several together competitors against each other. Some new methods such as Umigon [18] have been proposed after obtaining good results on some of these tracks. Although, SemEval has been playing an important role for identifying relevant methods, it requires authors to register for the challenge and many popular methods have not been evaluated in these exercises. Additionally, SemEval labeled datasets are usually focused on one specific type of data, such as tweets, and do not represent a wide range of social media data. In our evaluation effort, we consider one dataset from SemEval 2013 and two methods that participated in the competition in that same year.

Ahmadi et al. [19] performed a comparison of Twitter-based sentiment analysis tools. They selected twenty tools and tested them across five Twitter datasets. This benchmark is the work that most approximate from ours, but it is different in some meaningful aspects. Firstly, we embraced distinct contexts such as reviews, comments and social networks aiming at providing a broader evaluation of the tools. Secondly, the methods they selected included supervised and unsupervised approaches which, in our view, could be unfair for the unsupervised ones. Although the results have been presented separately, the supervised methods, as mentioned by authors, required extensive parameter tuning and validation in a training environment. Therefore, supervised approaches tend to adapt to the context they were applied to. As previously highlighted, our focus is on off-the-shelf tools as they have been extensively and recently used. Many researchers and practitioners have also used supervised approaches but this is out of scope of our work. Finally, most of the unsupervised methods selected in the Twitter Benchmark are paid tools, except from two of them, both of which were developed as a result of published academic research. Oppositely we made an extensive bibliography review to include relevant academic outcomes without excluding the most used commercial options.

Finally, in a previous effort [20], we compared eight sentence-level sentiment analysis methods, based on one public dataset used to evaluate SentiStrength [11]. This article largely extends our previous work by comparing a much larger set of methods across many different datasets, providing a much deeper benchmark evaluation of current popular sentiment analysis methods. The methods used in this paper were also incorporated as part of an existing system, namely iFeel [21].

3 Sentiment analysis methods

This section provides a brief description of the twenty-four sentence-level sentiment analysis methods investigated in this article. Our effort to identify important sentence-level sentiment analysis methods consisted of systematically search for them in the main conferences in the field and then checking for papers that cited them as well as their own references. Some of the methods are available for download on the Web; others were kindly shared by their authors under request; and a small part of them were implemented by us based on their descriptions in the original paper. This usually happened when authors shared only the lexical dictionaries they created, letting the implementation of the method that use the lexical resource to ourselves.

Table 1 and Table 2 present an overview of these methods, providing a description of each method as well as the techniques they employ (L for Lexicon Dictionary and ML for Machine Learning), their outputs (e.g. −1, 0, 1, meaning negative, neutral, and positive, respectively), the datasets they used to validate, the baseline methods used for comparison and finally lexicon details, as well as the Lexicon size column describing the number of terms contained in the method’s lexicon. The methods are organized in chronological order to allow a better overview of the existing efforts over the years. We can note that the methods generate different outputs formats. We colored in blue the positive outputs, in black the neutral ones, and in red those that are negative. Note that we included LIWC and LIWC15 entries in Table 2, which represents the former version, launched in 2007, and the latest version, from 2015, respectively. We considered both versions because the first one was extensively used in the literature. This also allows to compare the improvements between both versions.

Table 1 Overview of the sentence-level methods available in the literature

SentiBench - a benchmark comparison of state-of-the-practice sentiment analysis methods

Abstract

1 Introduction

2 Background and related work

2.1 Focus on sentence-level sentiment analysis

2.2 Existing efforts on comparison of methods

3 Sentiment analysis methods

3.1 Adapting lexicons for the sentence level task

3.2 Output adaptations

3.3 Paid softwares

3.4 Methods not included

3.5 Datasets and comparison among methods

4 Gold standard data

5 Comparison results

5.1 Experimental details

5.2 Comparison metrics

5.3 Comparing prediction performance

6 Concluding remarks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic Supplementary Material

13688_2016_85_MOESM1_ESM.pdf

Rights and permissions

About this article

Cite this article

Share this article

Keywords