Using Unicode properly while using mysql with Python

I was facing an issue related to character set while downloading Facebook posts’ data. I was getting the data using Facebook Graph API and dumping it in a mysql database. When I try to insert a post content directly into the database, I was getting -

sql = sql.encode(self.encoding)
UnicodeEncodeError: ‘latin-1′ codec can’t encode characters in position

I tried encoding the content to unicode explicitly but then the few characters were malformed in the database. I already changed the default character set of mysql database to ‘utf-8′. When I was trying to use encode/decode, I was getting errors like -

return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xd1′

To resolve this first I changed the default character set (ASCII) in the Python script to Unicode as follows -

import sys reload(sys) sys.setdefaultencoding(utf-8)

Then, I made sure that while connecting to database, I set parameters ‘use_unicode’ and ‘charset’ properly as follows -

conn = pymysql.connect(host=xx, user=xx, passwd=xx, db=xx, use_unicode=True, charset=utf8)

After doing this, I could see the special characters properly in the database.

How to get likes count on posts and comments from Facebook Graph API

Though Facebook has mentioned in their Graph API documentation that we can get the count of like on a post as follows -

GET /{object-id}/likes HTTP/1.1

with total_count as a field, but it doesn’t work as expected. Instead it returns the list of  users (name and their id) in a page-wise  fashion. So to get the total number of likes you need to traverse it page-wise and look at the number of users who liked it as follows -

def get_likes_count(id, graph): count = 0 feed = graph.get_object(/+ id +/likes, limit =99) whiledatain feed and len(feed['data'])>0: count = count + len(feed['data']) ifpagingin feed andnextin feed['paging']: url = feed['paging']['next'] id = url.split(graph.facebook.com)[1].split(/likes)[0] after_val = feed['paging']['cursors']['after'] feed = graph.get_object(id +/likes, limit =99, after=after_val) else: breakreturn count

I have used this in a Python script which uses Facebook Python SDK

Movie Rating Prediction Using Twitter’s Public Feed and Naive Bayes Classifier

With thousands of tweets being posted around the time period a movie is released, we can certainly find out the sentiments of people watching it from those tweets. So why not try to predict the movie rating from those tweets from sentiment analysis of tweets. Here I’m going to use Naive Bayes Classifier from Python’s NLTK library to classify tweets as positive (if people like the movie) or negative (if they hated it).

For training the classifier I used Stanford AI lab’s “Large Movie Review Dataset“. If you open this database, you would see two directories, marked as negative and positive having movie reviews, one review per file. Now, I trained the classifier using this dataset and saved it using pickle. Once the classifier is trained, I loaded the pickled classifier in my main Python file to use it.

One of the major step in training classifier is feature extraction. I did a simple feature extraction here. First I converted every word in the document to lower case, then removed all stop words which I grabbed from NLTK’s corpus for English words. Naive Bayes classifier in NLTK expected features as input in the form of a dictionary. So I used the following function to get the desired result:

def get_features(tweet): global stop words = [w for w in tweet if w notin stop] f = {} for word in words: f[word] = word return f

Classifier takes two parameters, one is the function which extract features from a given text (as described above) and training set. Training set should be labelled, so we take reviews from the directory marked positive, extracted features from it, make a list out of the features and then make a tuple from that list and review label. We do the same for directory marked negative.

Now, once we are done training the classifier, we grab 100 latest public tweets using Tweepy as per user’s query, extract features the same way as we did before and feed it to the trained classifier.

def get_tweets(movie): auth = tweepy.OAuthHandler(ckey, csecret) auth.set_access_token(atoken, asecret) api = tweepy.API(auth) tweets = [] i=0 for tweet in tweepy.Cursor(api.search, q=movie.lower(), count=100, result_type=recent, include_entities=True, lang=en).items(): tweets.append(tweet.text) i+=1if i==100: breakreturn tweets

Classifier would mark it as positive or negative. Now we can predict movie rating based on the ratio of positive tweets  in the 100 tweets extracted.  You can see the full code on my Github profile which should be pretty explanatory now.

Saving a Trained Classifier in Python using Pickle

This is one issue which I recently faced while working on a Naive Bayes Classifier problem. Suppose we have a huge training data and a classifier which trains slowly. This would significantly increase the training time of our classifier. If we train our classifier in the same script in which we are using it on our test data, then each time we want to use the classifier on the test data, we need to train it first.

This is redundant when our training data remains the same, which is the case most of the time. To overcome this, we can train our classifier in a separate script and save it in some way and then load it quickly in our script where we are testing the classifier. This can be done using pickle.

Let me provide an example using scikit-learn classifier. I just need to do the following in the script where I’m training my classifier -

classifier = nltk.NaiveBayesClassifier.train(training_set) f = open(bayes.pickle, wb) pickle.dump(classifier, f) f.close()

Once, I’m done with training, I can unpickle this classifier in another script and use the trained classifier quickly as follows -

f= open(bayes.pickle) classifier = pickle.load(f) f.close()

Extracting Text From a Web Page Using BeautifulSoup

Extracting text from a web page is useful for many applications like indexing the web pages for implementing a search engine, or matching two documents for similarity, or finding the keywords, etc.

I would like to show you a simple and effective method to implement such a “Text Extractor” in Python using BeautifulSoup. You can see the code below with its explanation -

from BeautifulSoup import * import urllib2 def gettextonly(soup): v=soup.string #Check if the main tag has any text associated with it if not v: c=soup.contents #Here, we get list tags in the webpage resulttext='' for t in c: #Picking each tag one by one subtext=gettextonly(t) #Extracting text for that tag resulttext+=subtext+'\n' #Accumulating the result return resulttext else: return v.strip() #If main tag has any text, then return it after stripping any trailing whitespaces if __name__ == '__main__': url=urllib2.urlopen("http://www.amazon.co.uk/dp/B003IHVQTG/") soup=BeautifulSoup(url.read()) print gettextonly(soup)

This is taken from “Programming Collective Intelligence” book by “Toby Segaran”.

First Kaggle Submission – Titanic: Machine Learning from Disaster

This is my first attempt at a Kaggle competition. In this competition we are given data for a set of people and we need to predict the probability of their survival in the Titanic tragedy. I’m discussing my approach to this problem. I’ve used Python’s scikit-learn library (RandomForestClassifier) and numpy to deal with it.

We are given the following attributes for a person in the training data (train.csv) -

PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin and Embarked – which are pretty much self explanatory.

My approach involved a bit of manual tweaking with the train data and then application of Random Forest algorithm.

How I tweaked and pre-processed the train/test data?

  • I skipped few attributes which I thought couldn’t contribute to the prediction, such as PassengerId, Name, Ticket, Cabin and Embarked. Ticket information is already covered in Fare and hence I’m skipping Ticket. Rest all seem to be of little or no use in determining the survival probability.
  • I summed up, SibSp and Parch to form a new attribute “Relatives” which keeps count of total relatives a person has on the ship.
  • There is one missing value in test.csv for attribute Fare for PassengerId 1044. I took the average value on that attribute and filled that slot (with value 60).
  • I also added a dummy (with all zeroes) Survived column in test.csv just to make its dimensions compatible with train.csv. As I’ve written a single function to deal with both the files, this is needed.
  • As we can see that Sex is a categorical attribute and to apply Random Forest to this data we need to convert it to a numerical attribute, I just replaced Male with 0.1 and Female with 0.9 as there are more chances of survival of Female as compared to Male, considering all other attributes the same.
  • Different attributes span different sets of ranges. So to make them comparable we scale the data by subtracting mean and dividing it by standard deviation. So now the values follow standard normal distribution.

How to get values for Random Forest algorithm parameters?

I’ve used cross validation here to get the efficiency of the algorithm for different parameters and chose the one which gave the best result. I’m using 70% data for training and 30% data for testing. This gave me the value of n_estimators parameter as 100.

How to apply the algorithm?

This is easiest part of all. You just need to define an object of the classifier, fit the train data and predict for test data – every operation requires a single line of code.

I’ve shared my code at github - https://github.com/theharshest/Kaggle/blob/master/titanic.py

Give it a try, modify it as per your approach, and let me know the improvements which can be made.