With thousands of tweets being posted around the time period a movie is released, we can certainly find out the sentiments of people watching it from those tweets. So why not try to predict the movie rating from those tweets from sentiment analysis of tweets. Here I’m going to use Naive Bayes Classifier from Python’s NLTK library to classify tweets as positive (if people like the movie) or negative (if they hated it).
For training the classifier I used Stanford AI lab’s “Large Movie Review Dataset“. If you open this database, you would see two directories, marked as negative and positive having movie reviews, one review per file. Now, I trained the classifier using this dataset and saved it using pickle. Once the classifier is trained, I loaded the pickled classifier in my main Python file to use it.
One of the major step in training classifier is feature extraction. I did a simple feature extraction here. First I converted every word in the document to lower case, then removed all stop words which I grabbed from NLTK’s corpus for English words. Naive Bayes classifier in NLTK expected features as input in the form of a dictionary. So I used the following function to get the desired result:
Classifier takes two parameters, one is the function which extract features from a given text (as described above) and training set. Training set should be labelled, so we take reviews from the directory marked positive, extracted features from it, make a list out of the features and then make a tuple from that list and review label. We do the same for directory marked negative.
Now, once we are done training the classifier, we grab 100 latest public tweets using Tweepy as per user’s query, extract features the same way as we did before and feed it to the trained classifier.
Classifier would mark it as positive or negative. Now we can predict movie rating based on the ratio of positive tweets in the 100 tweets extracted. You can see the full code on my Github profile which should be pretty explanatory now.