Getting Started With Amazon Web Services – Part 2

aws

In Part 1 of this series, I showed you how to create a user and provide him privileges to do stuff in AWS. Now we move ahead and try creating EC2 instances. To create EC2 instances, first we need to have a “Key Pair” which would enable us to ssh into the EC2 instance. To create a Key Pair -

1) Go to AWS console -> EC2 -> Key Pairs -> Create Key Pair.

2) Provide a name, create Key Pair and download the pem file.

3) Change the permissions of pem file to 400.

Now, create a security group -

1) Go to AWS console -> EC2 -> Security Groups -> Create Security Group.

2) Provide a name and description.

3) Click on Add Rule. In Type, select SSH. In Source, select My IP if you have a static IP, else select Anywhere. Click Create.

Now, let’s create the EC2 instance -

1) Go to AWS console -> EC2 -> Instances -> Launch Instance.

2) Select “Ubuntu Server 14.04 LTS”. Keep the default settings till Step 4.

3) In Step 5, provide a name to the instance, so that we can identify this instance from the list of EC2 instances in the console.

4) In Step 6, choose “Select an existing security group” and select the security group created above. Click Review and Launch.

5) Now click Launch and select the Key Pair which we have created before.

Now go to AWS console -> EC2 -> Instances. Select the instance which you just created and copy its Public IP from the panel below.

Assuming that you have followed everything from Part 1 of this tutorial series, go to command line and do -

ssh -i <path_to_pem_file> ubuntu@<public_ip_of_ec2_instance>

You should be logged into the EC2 instance.

Getting Started With Amazon Web Services – Part 1

aws

AWS offers sufficient free stuff to try which makes it one of the best cloud service provider for personal use. Today I’m gonna show you how to get started with AWS quickly in few easy steps -

1) Register for AWS as shown here.

2) Now, go to AWS console and create a IAM user by going to Services -> IAM -> Users -> Create New Users.

3) Provide a username and create a user. Go ahead and click “Download Credentials” and save the file at a secure place. Now, the user is created but it doesn’t have any permissions.

4) In IAM, click on Groups -> Create New Group. Provide a group name and in policy templates select “Administrator Access”. Go ahead and complete the group creation.

5) Now again go to Users in IAM and select the user we created before. Click on User Actions -> Add User to Groups. Now select the group we created in the previous step. Now, this user has administrator privileges.

6) Now, we would be using Python SDK for AWS to manage AWS from command line. To do that clone a sample project using -

git clone https://github.com/awslabs/aws-python-sample.git

7) Install the SDK -

pip install boto

8) Now create a file “~/.aws/credentials” and put the following content in it, replacing the values properly from credentials file downloaded in step 3 -

[default]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

9) Now go to the clone directory (from step 6) and run the following -

python s3_sample.py

10) If everything goes fine, above command would create a S3 bucket and an object and you would see the results as output. If you get any error related to credentials, you should try setting the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

Move ahead with Part 2 of AWS tutorial.

How to Install Python Package When You Don’t Have Root Access?

Many times we don’t have machines with root access. This mostly happens with machines that are shared between several users like in a university. Pip is the easiest way to install Python packages. By default it installs packages to “/usr/local/” which needs root privileges to access. You can override this default installation directly easily by using the following -

pip install <package_name> –user

This would install the package in “/home/<user_name>/.local/lib/python2.7/site-packages” (2.7 would be the version number, can be different in your case) instead of “/usr/local/”. Now to use this package in a Python script you can’t use “import <package_name>” without telling Python the location of directory where this package is installed. To make sure this new path is included, use the following in the Python script -

import sys
sys.path.append(‘/home/<user_name>/.local/lib/python2.7/site-packages’)
import <package_name>

Python should accept this package now.

Using Unicode properly while using mysql with Python

I was facing an issue related to character set while downloading Facebook posts’ data. I was getting the data using Facebook Graph API and dumping it in a mysql database. When I try to insert a post content directly into the database, I was getting -

sql = sql.encode(self.encoding)
UnicodeEncodeError: ‘latin-1′ codec can’t encode characters in position

I tried encoding the content to unicode explicitly but then the few characters were malformed in the database. I already changed the default character set of mysql database to ‘utf-8′. When I was trying to use encode/decode, I was getting errors like -

return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xd1′

To resolve this first I changed the default character set (ASCII) in the Python script to Unicode as follows -

import sys reload(sys) sys.setdefaultencoding(utf-8)

Then, I made sure that while connecting to database, I set parameters ‘use_unicode’ and ‘charset’ properly as follows -

conn = pymysql.connect(host=xx, user=xx, passwd=xx, db=xx, use_unicode=True, charset=utf8)

After doing this, I could see the special characters properly in the database.

How to get likes count on posts and comments from Facebook Graph API

Though Facebook has mentioned in their Graph API documentation that we can get the count of like on a post as follows -

GET /{object-id}/likes HTTP/1.1

with total_count as a field, but it doesn’t work as expected. Instead it returns the list of  users (name and their id) in a page-wise  fashion. So to get the total number of likes you need to traverse it page-wise and look at the number of users who liked it as follows -

def get_likes_count(id, graph): count = 0 feed = graph.get_object(/+ id +/likes, limit =99) whiledatain feed and len(feed['data'])>0: count = count + len(feed['data']) ifpagingin feed andnextin feed['paging']: url = feed['paging']['next'] id = url.split(graph.facebook.com)[1].split(/likes)[0] after_val = feed['paging']['cursors']['after'] feed = graph.get_object(id +/likes, limit =99, after=after_val) else: breakreturn count

I have used this in a Python script which uses Facebook Python SDK

Movie Rating Prediction Using Twitter’s Public Feed and Naive Bayes Classifier

With thousands of tweets being posted around the time period a movie is released, we can certainly find out the sentiments of people watching it from those tweets. So why not try to predict the movie rating from those tweets from sentiment analysis of tweets. Here I’m going to use Naive Bayes Classifier from Python’s NLTK library to classify tweets as positive (if people like the movie) or negative (if they hated it).

For training the classifier I used Stanford AI lab’s “Large Movie Review Dataset“. If you open this database, you would see two directories, marked as negative and positive having movie reviews, one review per file. Now, I trained the classifier using this dataset and saved it using pickle. Once the classifier is trained, I loaded the pickled classifier in my main Python file to use it.

One of the major step in training classifier is feature extraction. I did a simple feature extraction here. First I converted every word in the document to lower case, then removed all stop words which I grabbed from NLTK’s corpus for English words. Naive Bayes classifier in NLTK expected features as input in the form of a dictionary. So I used the following function to get the desired result:

def get_features(tweet): global stop words = [w for w in tweet if w notin stop] f = {} for word in words: f[word] = word return f

Classifier takes two parameters, one is the function which extract features from a given text (as described above) and training set. Training set should be labelled, so we take reviews from the directory marked positive, extracted features from it, make a list out of the features and then make a tuple from that list and review label. We do the same for directory marked negative.

Now, once we are done training the classifier, we grab 100 latest public tweets using Tweepy as per user’s query, extract features the same way as we did before and feed it to the trained classifier.

def get_tweets(movie): auth = tweepy.OAuthHandler(ckey, csecret) auth.set_access_token(atoken, asecret) api = tweepy.API(auth) tweets = [] i=0 for tweet in tweepy.Cursor(api.search, q=movie.lower(), count=100, result_type=recent, include_entities=True, lang=en).items(): tweets.append(tweet.text) i+=1if i==100: breakreturn tweets

Classifier would mark it as positive or negative. Now we can predict movie rating based on the ratio of positive tweets  in the 100 tweets extracted.  You can see the full code on my Github profile which should be pretty explanatory now.

Saving a Trained Classifier in Python using Pickle

This is one issue which I recently faced while working on a Naive Bayes Classifier problem. Suppose we have a huge training data and a classifier which trains slowly. This would significantly increase the training time of our classifier. If we train our classifier in the same script in which we are using it on our test data, then each time we want to use the classifier on the test data, we need to train it first.

This is redundant when our training data remains the same, which is the case most of the time. To overcome this, we can train our classifier in a separate script and save it in some way and then load it quickly in our script where we are testing the classifier. This can be done using pickle.

Let me provide an example using scikit-learn classifier. I just need to do the following in the script where I’m training my classifier -

classifier = nltk.NaiveBayesClassifier.train(training_set) f = open(bayes.pickle, wb) pickle.dump(classifier, f) f.close()

Once, I’m done with training, I can unpickle this classifier in another script and use the trained classifier quickly as follows -

f= open(bayes.pickle) classifier = pickle.load(f) f.close()

Extracting Text From a Web Page Using BeautifulSoup

Extracting text from a web page is useful for many applications like indexing the web pages for implementing a search engine, or matching two documents for similarity, or finding the keywords, etc.

I would like to show you a simple and effective method to implement such a “Text Extractor” in Python using BeautifulSoup. You can see the code below with its explanation -

from BeautifulSoup import * import urllib2 def gettextonly(soup): v=soup.string #Check if the main tag has any text associated with it if not v: c=soup.contents #Here, we get list tags in the webpage resulttext='' for t in c: #Picking each tag one by one subtext=gettextonly(t) #Extracting text for that tag resulttext+=subtext+'\n' #Accumulating the result return resulttext else: return v.strip() #If main tag has any text, then return it after stripping any trailing whitespaces if __name__ == '__main__': url=urllib2.urlopen("http://www.amazon.co.uk/dp/B003IHVQTG/") soup=BeautifulSoup(url.read()) print gettextonly(soup)

This is taken from “Programming Collective Intelligence” book by “Toby Segaran”.

First Kaggle Submission – Titanic: Machine Learning from Disaster

This is my first attempt at a Kaggle competition. In this competition we are given data for a set of people and we need to predict the probability of their survival in the Titanic tragedy. I’m discussing my approach to this problem. I’ve used Python’s scikit-learn library (RandomForestClassifier) and numpy to deal with it.

We are given the following attributes for a person in the training data (train.csv) -

PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin and Embarked – which are pretty much self explanatory.

My approach involved a bit of manual tweaking with the train data and then application of Random Forest algorithm.

How I tweaked and pre-processed the train/test data?

  • I skipped few attributes which I thought couldn’t contribute to the prediction, such as PassengerId, Name, Ticket, Cabin and Embarked. Ticket information is already covered in Fare and hence I’m skipping Ticket. Rest all seem to be of little or no use in determining the survival probability.
  • I summed up, SibSp and Parch to form a new attribute “Relatives” which keeps count of total relatives a person has on the ship.
  • There is one missing value in test.csv for attribute Fare for PassengerId 1044. I took the average value on that attribute and filled that slot (with value 60).
  • I also added a dummy (with all zeroes) Survived column in test.csv just to make its dimensions compatible with train.csv. As I’ve written a single function to deal with both the files, this is needed.
  • As we can see that Sex is a categorical attribute and to apply Random Forest to this data we need to convert it to a numerical attribute, I just replaced Male with 0.1 and Female with 0.9 as there are more chances of survival of Female as compared to Male, considering all other attributes the same.
  • Different attributes span different sets of ranges. So to make them comparable we scale the data by subtracting mean and dividing it by standard deviation. So now the values follow standard normal distribution.

How to get values for Random Forest algorithm parameters?

I’ve used cross validation here to get the efficiency of the algorithm for different parameters and chose the one which gave the best result. I’m using 70% data for training and 30% data for testing. This gave me the value of n_estimators parameter as 100.

How to apply the algorithm?

This is easiest part of all. You just need to define an object of the classifier, fit the train data and predict for test data – every operation requires a single line of code.

I’ve shared my code at github - https://github.com/theharshest/Kaggle/blob/master/titanic.py

Give it a try, modify it as per your approach, and let me know the improvements which can be made.