Sentiment Analysis: Movie Reviews
Changing the tokenizer for Naive Bayes Classifier to improve the prediction of movie review sentiments.
Principal Author: Samprity Kashyap
Collaborators: Junyuan Zheng
Sentiment analysis (also known as opinion mining) deals with using natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. A movie review website lets users submit reviews along with what they either liked or disliked about a particular movie. Being able to study these reviews and generate valuable metadata that explains its content provides an opportunity to understand the general sentiment around that movie in a democratized way. Using machine learning we can democratize subjectivity about anything in the world. We can make an objective analysis of subjective content, giving us the ability to better understand trends around products and services that we can use to make better decisions as consumers. In this page, we have tried to perform sentiment analysis for movie reviews in this page. The primary goal is to modify the Naive Bayes classifier so that it works well for sentiment analysis. This can then be used for predicting sentiments for movie reviews. We have also performed some experiments to assess the prediction accuracy.
What is Sentiment Analysis?
Sentiments are basically feelings which include emotions, attitude and opinions. It is the feeling that comes from within a review or a comment. Is someone against or does he support a trend? Do they think a service was good or bad? Do they like or dislike a product?. They are subjective to impressions and are not facts. Generally, a binary opposition in opinions is assumed : For/against, like/dislike, good/bad, etc. However, it is often more complex. There are sentiments that offer neither a good or bad opinion, often described as a neutral opinion.
We can use Natural language processing,statistics, or machine learning methods to extract, recognize, or characterize the sentiment content of a text. It is also referred to as opinion mining. In this case, the emphasis is on extraction.
Sentiment Analysis aims to determine the attitude and opinion of the reviewer/commentator/author of a piece of text with respect to the topic in question. The text could be comments and reviews and they can be viewed as Positive, Negative, Neutral or having no sentiment at all. Figure 1 describes a positive and a negative movie review. Sentiment analysis is widely applicable to reviews and social media. It can be applied to a variety of applications, for instance: marketing, customer service etc. In a nutshell, sentiment analysis aims to determine the attitude of a writer with respect to some topic. It can be used to determine the overall contextual polarity of a text/document. The attitude may be a person's judgment or evaluation, effective state (the emotional state of the author when writing), or the emotional communication intended(the emotional effect the author wishes to have on the reader).
Questions that might be asked in Sentiment Analysis:
- Is the review of the product positive or negative?
- Does this email from the customer imply he is content or angry?
- How are people reacting to this ad campaign/product based on tweets ?
- How have bloggers' attitudes about the president changed since the election?
- Is this movie review positive or negative?
Examples of Sentiment Analysis
There are various examples of Sentiment Analysis being used in many fields such as online retail, to all forms of blogging, and even in politics. Sentiment analysis is very useful when it comes to social media monitoring. It allows users to gain an overview of the public opinion regarding certain topics. The applications of sentiment analysis are varied and impressive. The ability to extract insights from social data is a practice that is being widely adopted by organisations across the world. It has been shown that changes in sentiment on social media correlate with changes in the stock market. The Obama administration also used sentiment analysis to assess public opinion regarding policy announcements and campaign messages before 2012 presidential election.
The ability to quickly perceive and infer consumer attitudes and take action accordingly is something that Expedia Canada took advantage of. They applied this when they noticed that there was a consistent increase in negative feedback to the music used in one of their television adverts. Sentiment analysis conducted by the brand showed that the music that was played on the commercial had become incredibly annoying after multiple airings. The consumers were moving to social media to vent their frustrations. A few weeks after the commercial first aired, over half of online conversation regarding the campaign was negative. Rather than marking up the commercial as a failure, Expedia was able to address the negative sentiment in a positive and playful way by airing a new version of the commercial which featured the offending violin being smashed.
In recent years, we have witnessed a large number of websites that enable users to contribute, modify, and grade the content. Users have an opportunity to express their personal opinions about specific topics. The examples of such websites include blogs, forums, product review sites, and social networks. Sentiment analysis is becoming one of the most significant research areas for classification and prediction. Movie review analysis is one of the popular fields to analyze public sentiment. The focus of our page is the analysis of the sentiments in the movie review comments. For our experiment, we have taken an open source movie review dataset from Cornell University. We are trying to modify Naive Bayes Classifier to improve sentiment prediction accuracy. NB helps in finding the relationship between movie review data ie which words are more likely to be present in positive or negative reviews. We are not modifying the Bayes Classification algorithm. Instead we are modifying the tokenizer in the training step. This can then be a source of knowledge for predicting future sentiments. We have compared the average accuracy of the different tokenizers over 10 test runs. We have also manually tested the tool by copy pasting reviews from IMDB pages. We are aiming for better than random results for predictions of sentiments.
Our method of sentiment analysis is based on machine learning. The steps are:
- Label data: We obtained the data set and shuffled it. It consists of 10,000+ sentences from movie reviews. They are labeled "positive" and "negative" and are split evenly. 80% of the data for training is used for training. 20% was used for testing.
- Training: We performed training on the data set which included pre-preprocessing, tokenization, registering labels etc. We have modified the tokenizer part for our experiment.
- Learn the model: Bayes theorem is applied to calculate individual word probabilities. The individual word probabilities are summed up to get the total negative and positive sentiment score. We learn a function guess for determining this. The higher score is taken into consideration.
- Validation: The model on the training dataset is learnt on the training dataset. It is then applied on the remaining 20% data and the accuracy is calculated
We are not going to talk much about Bayes Theorem as it has been extensively covered in the lecture classes. It is not enough to say that this word is a negative word or a positive word. We have to find the probability that this word is positive along with its context. Bayes' theorem exactly does that. We combine the individual probabilities to find a total probability that the sentiment is positive considering all these words are included. If that probability is high enough, we can perform actions on it.
Naive Bayes Classifier
The Naive Bayes Classifier is named so because it makes the assumption that each word in the document is independent of the next word. This is a naive assumption. It is actually a great simplifying assumption as studying words separately in this manner yields very good results. We could also consider bigrams or trigrams (two or three words at a time). At this point the classifier is no longer "naive". It will need a larger amount of training data and storage space.
The NB classifier works well for document classification. This is because it de-correlates the number of times a word is seen in a given language from its statistical importance. It makes the assumption that each word is statistically independent of each other word. However in language, this assumption doesn't hold true. The word England following the word Queen of much more likely than most other words (Queen of chairs isn't something that we see every day). It is clear that words are not independent of one another. Still it has been shown that the naive assumption works well for the purposes of document classification.
Natural language is complicated. We need to reduce the entropy. Entropy is a concept used across lots of fields. It is a measure of the number of possible "states" a system can be in. We can reduce entropy by converting upper case words to lower case. This means that "BENCH", "Bench" and "bench" can be considered to be same. We convert these words to a simple "bench" and this helps us study the concept of the word and not the syntax. However, this might not be the case with spam detection.
The first step would be tokenization ie splitting a document up into discrete chunks so that we can study it. We have kept our tokenization simple by removing punctuation, making everything lowercase, and splitting the document up by spaces to get the tokens. We have also stemmed words as part of the tokenization process
We did try using bigrams and trigrams. But it ended up increasing the entropy and also took a lot of time for training. For example: If we use 1000 different words and build a unigram Bayes classifier then we will need to store word counts for only 1000 words. But if we use bigrams, we could need to store up to 1000000 different word pairs. Also, the occurrence of each word pair will be relatively rare. This also made the application slow.
Another way to reduce entropy is stemming. Stemming considers variations of word like "great", "greatly", "greatest", and "greater" same. It effectively decreases the entropy and gives more data around the idea of "great". We have used Porter Stem for this experiment.
A way of dealing with negation without using bigrams is described in this section. We can use a simple concept. Whenever a negation (like- no, not, can't, never etc) is seen an exclamation mark is added to the beginning of every word following it. For example, The sentence "Movie was not good" converts to "Movie was not !good". "!good" is retained in the Bayes classifier as found in a negative review. The best part is that the only modification needed is in the tokenizer. The Bayes classification algorithm stays the same. We have used the following regular expression for negations:
var negationList = new RegExp("^(never|none|not|couldnt|shouldnt|wont|havent|hasnt|no|nothing|nowhere|noone|isnt|arent|hadnt|cant||wouldnt|dont|doesnt|didnt|aint)$");
We have used bayes.js to train the reviews. It has a train(text) method where labels are registered and tokenization takes place. We added the exclamations (negations) to the text before feeding it in the train method for unigram with negation tokenizer. Unigram with no negation, bigram and trigram tokenizers were also tested(details in results section).
Learn the Model
This is also known as supervised classification/learning in the machine learning world. Given a labelled dataset, the task is to learn a function that will predict the label given the input. In this case, we learn a function guess(review as input, given in bayes.js) to return the probability scores(both negative and positive). Bayes theorem is applied and the probability that the sentiment of this text is positive/negative given that this word is in it is calculated. the model assigns 50% to any token that is not seen in the training dataset. The individual probabilities are combined to give an overall probability that this document is positive/negative [given that all these words are in it].
80% of the data was considered for training and 20% for testing. bayes.js was used for extracting the result ie label with higher probability and its probability score. If the probability score was less than 70% it was skipped. Increasing the threshold for acceptance had a significant effect on accuracy. Setting the threshold to 70% or greater increased the classifier's accuracy from 77% to 84%, a 7% difference.
Figure 2 shows the UI of the Movie Review tool developed. You can try out the Sentiment Analysis tool here. This implementation has unigram with negation tokenizer.
Evaluation and Results
The tests were run 10 times on various configurations using the remaining 20% data from the dataset. The average of the accuracy percentages of the five configurations are as follows:
|Tokenizer||Average Accuracy in Percentage|
|Unigram No Negation||82.56|
|Unigram with Negation||84.24|
|Bigram No Negation||79.08|
|Bigram With Negation||80.04|
|Trigram No Negation||73.72|
The experiments showed that Unigram with negation tokenizer had the best average accuracy. Also the training time for bigram and trigram tokenizers were relatively higher than unigram tokenizer. Adding negation to the bigram tokenizer improved its accuracy marginally.
We performed some experiments with the Unigram with Negation tokenizer using movie reviews from IMDB. Higher the number of stars, better the review. We have picked the first movie review in the links and tested it on the sentiment analysis tool manually(copy pasted reviews in the tool).
|Movie||Number of Stars in the review||Predicted Sentiment||Probability in Percentage|
|The Shawshank Redemption||10||Positive||100|
|The Dark Knight||10||Positive||93|
|Game of Thrones||9||Positive||89|
|The Lord of the Rings: The Fellowship of the Ring||9||Positive||99|
|The Black Dahlia||5||Negative||100|
|Star Wars: Episode VII - The Force Awakens||4||Negative||98|
|Superbabies: Baby Geniuses 2||1||Negative||95|
|Code Name: K.O.Z.||1||Negative||100|
Most of the predictions seem accurate as per the stars given by the reviewers. One thing we observed was that the probability percentages for the same review varied if the application is loaded again. This is because a random sort on the input data was done. However positive and negative verdict remained more or less the same. One off case was Fight Club where the same review was predicted positive(100% probability) once and a negative(100% probability) the next time. The review for Furious Seven was always negative even though the rating was 8 stars.
The review for Furious Seven is as follows:
By about 2009 when 'Fast & Furious' was released, the franchise had slowly began to veer away from its initial focus of street racing and instead began to turn its attention to the action genre and over-the-top big budget sequences. However along with this change of style, the franchise was actually getting better and better with each film. Today I saw the most recent instalment in the cinema... talk about action-packed. 'Fast & Furious 7' is a no-holds, over- the-top and mindless action film, but this aside, it is an extremely entertaining and fun film to watch. With an all-star cast and some brilliant action sequences, 'Fast & Furious 7' is proof that certain franchises can continually make great movies. The most notable moment however in the entire film is the emotional and respectful ending during the send-off of Paul Walker, the film finishing with a montage of Walker in the previous six films, finishing with just two words, 'For Paul', this is the first time a Fast and Furious film has affected me emotionally, and it is arguably the best in the franchise.
It is clearly a positive review but the tool was not able to predict it.
Discussion and Future Work
From the above experiments, we can conclude that unigram with negation tokenizer is more accurate than the other tokens considered in the experiments. We can also say that sentiment analysis for movie reviews using Naive Bayes Classifier has its limitations. But it is better than random approaches. It is not a 100% accurate marker. There are better and more sophisticated algorithms than the one we have developed in this page. We can use Decision Trees, Maximum-Entropy, and K-Means
clustering for sentiment analysis. But as with any automated process, it is prone to error. A human eye is often needed to watch over it. Beyond reliability, it’s important to understand the fact that human expressions do not fit into just two buckets. All sentiments cannot simply be categorized as positive and negative.
Future direction for this work would be further cleaning the dataset to remove noise. We could also consider extending the labeling on a scale of five values:
- Somewhat negative
- Somewhat positive
Obstacles like sarcasm, terseness, language ambiguity and many others make this task very challenging. Some of the key challenges would be
- Named Entity Recognition - What is the person actually talking about, e.g. is 300 Spartans a group of Greeks or a movie?
- Anaphora Resolution - the problem of resolving what a pronoun or a noun phrase refers to. "We watched the movie and went to dinner; it was awful." What does "It" refer to?
- Parsing - What is the subject and object of the sentence, which one does the verb and/or adjective actually refer to?
- Sarcasm - If we don't know the author we have no idea whether 'bad' means bad or good
- Twitter - Abbreviations, lack of capitals, poor spelling, poor punctuation, poor grammar
- Sentiment Analysis Wikipedia
- Bayes Theorem
- Introduction to Sentiment Analysis
- Sentiment Analysis Example
- Sentiment Analysis Current Example
- Sentiment Analysis for Movie Reviews
- Bayes Theorem
- Naive Bayes Classifier
- V. K. Singh; R. Piryani ; A. Uddin ; P. Waila Sentiment analysis of movie reviews: A new feature-based heuristic for aspect-level sentiment classification, Automation, Computing, Communication, Control and Compressed Sensing (iMac4s), 2013 International Multi-Conference on Date of Conference: 22-23 March 2013 Page(s): 712 - 717 Print ISBN: 978-1-4673-5089-1 INSPEC Accession Number: 13567093 Conference Location : Kottayam DOI: 10.1109/iMac4s.2013.6526500 Publisher: IEEE
- Machine Learning
- Challenges of Sentiment Analysis
- Kaggle: Sentiment Analysis on Movie Reviews
- Sentiment Analysis with Python NLTK Text Classification
- Twitter sentiment analysis using Python and NLTK
- Sentiment Analysis in Python, Second Try: Sentiment Analysis in Python
- AFINN-based sentiment analysis for Node.js.
- Creating a Sentiment Analysis Application Using Node.js