In the previous post we discussed the structure of the tweet data. In this post we’ll address the process of selecting or building the right data dictionary for our purpose.
What constitutes a good dictionary?
A crucial data set for any kind of text mining is a dictionary. As for sentiment analysis there are two big families of analysis algorithm. Both leverage dictionaries. The lexicon-based approach is where the ultimate score is calculated based on a per-word score from the dictionary and machine learning approach, where dictionaries are used to reduce data dimensionality. A good dictionary for our purpose contains words bearing a strong positive or negative connotation, but does not contain neutral words. It may include two- or three-word phrases to improve handling negations and other common cases, like “can’t stand.” Ideally, the dictionary should come from the same topic and writing style that will be analysed. (A lexicon for short texts like tweets written by teenagers will be quite different from a lexicon for diplomatic messages.) So, the best dictionary for our case would be based on tweets about movies, and if we want to get sophisticated we may need it will need to go beyond words and short phrases and include the role of a word in a sentence. For that level of model we would need a dictionary that would include features such as the part of speech for each word. But in our simple Social Movie Reviews demo, we will stick to simple lexicons that only require words. And, of course, we want them to be open source. Constructing a dictionary suitable for Twitter stream sentiment analysis of movie reviews
There are many open source sentiment analysis projects but most of them are based on just a few dictionaries:
Because of the large number (and credibility) of projects relying on these dictionaries, we consider them solid and suitable for the purpose of the concept demonstration. So let’s take a closer look at these dictionaries to understand their peculiarities.
The root word dictionary contains about 1506 entries that looks like this (selected examples):
:(,-1 :),1 ^__^,1 ^___^,2 abandon,-2 abducted,-2 affected,-1 affection,3 battle,-1 battles,-1 can't stand,-3 cool stuff,3 glamorous,3 glamourous,3 green wash,-3 wtf,-4
Actually, as you can see from the selected example, the “root word dictionary” contains not root words but lemmas in most cases. However, in some cases it contains full forms of the words (for example, “battles” and “battle”). Another interesting observation is that, in some cases, reducing a word to its root may result in losing the original sentiment; “affected” is scored as -1, for example, while “affection” is scored as +3. The dictionary author tried to address at least some challenges related to variability in spelling (example: glamorous vs glamourous). And last but not least, the dictionary contains n-grams like “cool stuff” that directly impact how we will process the data at the data preparation phase.
The MPQA lexicon dictionary contains 8222 entries that look like this (again, selected examples)
type=weaksubj len=1 word1=abandoned pos1=adj stemmed1=n priorpolarity=negative type=weaksubj len=1 word1=abandonment pos1=noun stemmed1=n priorpolarity=negative type=weaksubj len=1 word1=abandon pos1=verb stemmed1=y priorpolarity=negative type=weaksubj len=1 word1=affect pos1=verb stemmed1=y priorpolarity=neutral type=weaksubj len=1 word1=affectation pos1=noun stemmed1=n priorpolarity=negative type=strongsubj len=1 word1=affected pos1=adj stemmed1=n priorpolarity=neutral type=strongsubj len=1 word1=affection pos1=adj stemmed1=n priorpolarity=positive type=strongsubj len=1 word1=affection pos1=noun stemmed1=n priorpolarity=positive type=strongsubj len=1 word1=wound pos1=adj stemmed1=n priorpolarity=negative type=strongsubj len=1 word1=wound pos1=noun stemmed1=n priorpolarity=negative type=weaksubj len=1 word1=wound pos1=verb stemmed1=y priorpolarity=negative type=weaksubj len=1 word1=wounds pos1=adj stemmed1=n priorpolarity=negative type=weaksubj len=1 word1=wounds pos1=noun stemmed1=n priorpolarity=negative
As you can see, this dictionary contains additional information like a “part of speech” label for every entry. That is useful, especially together with additional Natural Language Processing (NLP) techniques such as part-of-speech tagging. But if we go with a simple technique that works at a bag-of-words granularity level, information like which part of speech a word is will be ignored. As a result, the dictionary will have far fewer entries than 8222, since many of them will be considered duplicates. Also, we can see from the sample above that the dictionary suggests three-level polarity: positive, negative and neutral. In case we are going to classify tweets in negative/positive space we should think how to deal with “neutral” entries in the dictionary. Having very granular separation of the word-form from its sentiment has a side effect; if the form of a word in the text doesn’t match the dictionary entry, we can’t use that word for the analysis.
Jeffrey Breen’s dictionary consists of two files: positive words (2006 entries) and negative words (4783 entries) without any other attributes or labels. Small example from the”positive” file.
affectation affection affectionate
Again, the dictionary contains exact word forms and if we perform text normalization using stemming techniques, the actual number of entries in the dictionary will be much less due to duplication of stems.
Having closely reviewed the dictionaries mentioned above, we will use the root word dictionary as our primary choice, with Jeffrey Breen’s dictionary as an alternative. The root word dictionary is based on microblogs and very likely represents the specific lexicon we need for Social Movie Reviews. As one sign of its suitability, it’s the only dictionary that contains emojis. It also provides strong lexicon coverage (1500 entries) and this is the only dictionary available to us that reflects the strength of a word’s sentiment on an eight-level scale (-4 to+4).
Jeffrey Breen’s dictionary is a strong second choice. After stemming, it contains about 4000 entries. That means wide lexicon coverage, so it gives us a wide selection of words that reflect sentiments. But it only has two sentiment levels: positive and negative.
We will leave the MPQA lexicon alone for the moment. It has a more complex structure and more features than the others, which might be useful for a more sophisticated model that considers sentence syntax. However, after simplification via stemming, removing duplications because of part of speech ID and identical stems, it has about 4400 entries, and that is similar for Jeffrey Breen’s dictionary.
In contrast to MPQA, Jeffrey Breen’s dictionary is built for opinion mining in social media, which is why we chose it over MPQA for Social Movie Reviews, but is still our second choice. The root word dictionary is number one.
Therefore, we will use the root word dictionary for both the lexicon-based model and the machine learning one. The lexicon-based model (the simplest possible) based on the root word dictionary will be our baseline. Additionally we will create a machine learning model with Jeffrey Breen’s dictionary. Then we’ll compare the two models’ performance to each other and the baseline. Then we will use the best performing model in the deployment phase — with that, we’ll save further discussion of training and test data sets for the next blog post.
Victoria Livschitz, Anton Ovchinnikov, Joseph Gorelik