Model Hyperparameter Tuning In Scikit Learn Using GridSearch

Getting the right hyperparameters for a machine learning model is an essential task for getting the best generalization performance. Hyperparameters provides regularization to the model and hence prevents it from overfitting. Grid search basically does a exhaustive search on the set of hyperparameters’ values provided, finding the best combination of values provided using cross validation or some other evaluation method and a scoring function.

Let us see how to do parameter optimization in Scikit Learn using GridSearch -

clf_LR = Pipeline([('chi2', SelectKBest(chi2)), ('lr', LogisticRegression())])
params = {
          'chi2__k': [800, 1000, 1200, 1400, 1600, 1800, 2000],
          'lr__C': [0.0001, 0.001, 0.01, 0.5, 1, 10, 100, 1000],
          'lr__class_weight': [None, 'auto'],
          'lr__tol': [1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7]
          }
gs = GridSearchCV(clf_LR, params, cv=5, scoring='f1')
gs.fit(X, y)

Here, I’m taking the classifier as Logistic Regression and feature selection method as chi2. clf_LR represents the standard pipeline, where my data (represented as set of feature vectors) first goes through a feature selection process (chi2) and then through the classifier (LogisticRegression).

params is where I’ve defined all the values of parameters that I want to try out. If I don’t use any parameter here, its default value would be considered. It is represented as a dictionary, where keys represent the hyperparameter name (defined as <model_name>__<hyperparamater_name>) and values represent the list of values that we want grid search to try. For keys, “model_name” comes from the Pipeline, it is followed by two underscores and then by the hyperparameter name. Hyperparameter name should be taken from the scikit learn’s documentation, from the model’s page, listed under section “Parameters:”, like here. Remember that few values of hyperparameters are dependent on each other, so if you fix two hyperparameters and try to vary the third which is dependent on the first two, then the third hyperparameter can only take limited values. Trying to access the values outside such limit would raise error, which is perfectly fine. In that case, just change the values appropriately.

Lastly, you see how I called the GridSearch. cv defines the number of cross-validations to use, in this case I’m doing 5 fold cross validation. scoring defines the scoring function to optimize. You can see the different scoring values possible, here. Now, just fit the dataset with the labels. X is my sample feature matrix and y is the corresponding labels.

Now, we can get the best combination of parameters from the set of parameters provided above based on the scoring and evaluation criteria -

print gs.best_estimator_.get_params()

We can also get the best score (here, F1 measure) for these combination of parameters -

print gs.best_score_

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Visualize Real Time Twitter Trends With Kibana

Kibana offers a simple way to visualize logs/data coming from Logstash. With Twitter as one of the standard inputs of Logstash, it is very easy to visualize Twitter trends in Kibana. In this post, I would be demonstrating how you can get Twitter data for few search terms, pass it through Elasticsearch to get indexes and customize Kibana dashboard to get real time changes in the tweets. I’m using Ubuntu 14.04 on AWS. If you haven’t used AWS before, you might want to look at my previous post on getting started with AWS. Let’s have a look at what we are gonna get at the end of this tutorial -

I assume that you have instantiated an EC2 instance (Ubuntu 14.04) and logged into it. Combination of Logstash, Elasticsearch and KIbana is so popular to track and manage logs that it is termed as ELK stack. As this tutorial is for learning purpose, we would be setting up everything on a single machine. Let’s get started -

    1. Install Java -
      sudo add-apt-repository -y ppa:webupd8team/java
      sudo apt-get update
      sudo apt-get -y install oracle-java7-installer
      
    2. Install Elasticsearch -
      wget -O - http://packages.elasticsearch.org/GPG-KEY-elasticsearch | sudo apt-key add -
      echo 'deb http://packages.elasticsearch.org/elasticsearch/1.1/debian stable main' | sudo tee /etc/apt/sources.list.d/elasticsearch.list
      sudo apt-get update
      sudo apt-get -y install elasticsearch=1.1.1
      
    3. Edit Elasticsearch config file – Open file “/etc/elasticsearch/elasticsearch.yml” and line “script.disable_dynamic: true” at the end of file. Search for line with “network.host:” and change it to “network.host: localhost”. Now, start elasticsearch using -
      sudo service elasticsearch restart
    4. Install Kibana -
      cd ~; wget https://download.elasticsearch.org/kibana/kibana/kibana-3.0.1.tar.gz
      tar xvf kibana-3.0.1.tar.gz
      
    5. Edit Kibana config file “kibana-3.0.1/config.js” and change line with keyword “elasticsearch” to “elasticsearch: “http://”+window.location.hostname+”:80″,”. Now move files to proper location -
      sudo mkdir -p /var/www/kibana3
      sudo cp -R kibana-3.0.1/* /var/www/kibana3/
      
    6. Install nginx -
      sudo apt-get install nginx
      cd ~; wget https://gist.githubusercontent.com/thisismitch/2205786838a6a5d61f55/raw/f91e06198a7c455925f6e3099e3ea7c186d0b263/nginx.conf
      
    7. Now edit this config file (nginx.conf) and change “server_name” value to the Elastic IP of the node and “root” to “root /var/www/kibana3;”. Now, copy the file to the right location. Provide a username in place of <username> below (and proper password, when asked) -
      sudo cp nginx.conf /etc/nginx/sites-available/default
      sudo apt-get install apache2-utils
      sudo htpasswd -c /etc/nginx/conf.d/kibana.myhost.org.htpasswd <username>
      sudo service nginx restart
      
    8. Install Logstash -
      echo 'deb http://packages.elasticsearch.org/logstash/1.4/debian stable main' | sudo tee /etc/apt/sources.list.d/logstash.list
      sudo apt-get update
      sudo apt-get install logstash=1.4.2-1-2c0f5a1
      
    9. Let’s configure Logstash now. Create a file “logstash.conf” in home directory and put the following content in it. “term1″ (eg. modi) is any term you want to search in tweets, “term2″ (eg. “obama”) is any other term that that you want to compare the results with. “tweets1″ (eg. “moditweets”) and “tweets2″ (eg. “obamatweets”) can be anything, you can give some meaningful names to it. We would refer to both the kinds of tweets using these name in Kibana. Other values of “consumer_key”, “consumer_secret”, “oauth_token” and “oauth_token_secret” should be taken from a Twitter app which you need to create using the Twitter developer account -
      input {
       twitter {
       consumer_key => "<proper_value>"
       consumer_secret => "<proper_value>"
       keywords => ["<term1>"]
       oauth_token => "<proper_value>"
       oauth_token_secret => "<proper_value>"
       type => "tweets1"
      }
      
      twitter {
      consumer_key => "<proper_value>"
      consumer_secret => "<proper_value>"
      keywords => ["term2"]
      oauth_token => "<proper_value>"
      oauth_token_secret => "<proper_value>"
      type => "tweets2"
      
      }
      }
      
      output {
        elasticsearch { host => localhost }
        stdout { codec => rubydebug }
      }
      
    10. Once this is done, you should see Kibana dashboard if you point your browser to the Elastic IP address of the EC2 node.
    11. Configuring Kibana – To visualize the tweets in real time as shown in the video we need to make few changes -
      1. Add two queries in the “QUERY” section. Click + to add another query. Enter “tweets1″ in one query and “tweets2″ in another query.
      2. In the top right corner, click on “Configure dashboard”. Click on Timepicker and change “Relative time options” to “1m” and “Auto-refresh options” to “1s”.
      3. Now go to “FILTERING” and close all filterings. In the top right section, there is one drop down to select time filtering. Click it and select “Last 1m”. Click it again and select “Auto-Refresh -> Every 1s”.
      4. You can configure the main graph area the way you want. For eg. I converted bars to lines. There are many options you can change and select the one best suited to your needs.
    12.  Now is the time to see the magic. Let’s start Logstash and wait for sometime. We would start observing the trends for the two kinds of tweets in Kibana dashboard. To start Logstash, run -
      sudo /opt/logstash/bin/logstash -f ~/logstash.conf
      

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Quick Guide to Using Python’s Virtualenv

This is a quick guide to knowing and using Python’s virtualenv which is an awesome utility to manage multiple isolated Python environments.

Download-Python-3.3.3-Full-Version

Why virtualenv is needed?

Case 1: If we have a Python application A which uses version 1.0 of package X, but is incompatible with higher versions of it. On the other hand we have an application B which uses version 1.1 of package X, but is incompatible with lower versions of it. We want to use both the applications on the same machine.

Case 2: There is a package Y, which is released in beta version recently and we want to try it out, but don’t want to install it in the global site-packages directory as it might break something.

In above two cases, virtualenv is very helpful. In first case we can create two virtualenvs and then install version 1.0 of package X in one and version 1.1 of package X in another. Two virtualenvs would be isolated from each, hence we are not overwriting each other while installing different versions. In second case, we can create a virtualenv and install package Y in that, which would install Y in site-packages directory of the virtualenv instead of the global site-packages.

How to install virtualenv?

virtualenv can be installed directly using pip.

pip install virtualenv

How to use virtualenv?

First create a virtualenv using –

virtualenv env1

“env1″ is your environment name, change it accordingly. This would create a vritualenv named env1. Now wherever you run this command, it would create a directory env1 there. To activate and start using this newly created virtualenv, go to env1/bin and run -

source activate

Now you are inside the virtualenv. Any package that you install now, would get installed to env1/lib/pythonx.x/site-packages, instead of global site-packages. Any package that you install now doesn’t affect your global packages and other virtualenvs. To exit this virtualenv, run -

deactivate

To remove a virtualenv, simply delete the corresponding directory.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Basic Git Commands for Open Source Contribution

In my last post I discussed about how one can get started with contribution to open source. You need to know some basic Git commands to work flawlessly with open source projects.

git_logo

Let’s discuss them one by one by taking an example project, Titanic -

  1. Set up Git as given here. This is a pretty quick tutorial on setting up Git.
  2. You would need to fork the repository on which you want to work, in this case it is Titanic. Forking would bring that project’s code to our workspace and hence allows us to make changes to it. Login to Github, go to the repository you want to fork and click on “Fork” on the upper right hand side of the page. Once forked, you can see the repository in your Github profile and below that you would see “forked from <orginal_repository>”.
  3. Now we need to copy this code to our machine so that we can make changes to it. For that we would “clone” the forked repository. To clone the repository to your machine, run the following in any directory in your machine -
    git clone https://github.com/theharshest/titanic
  4. We have whole source code of this project now. This is the time to make changes to the file, i.e. we fix the bug in the concerned file. After you are done making changes to the file/files, we need to “stage” the files. For that, do the following from the repository directory -
    git add --all
  5. At any point of time you can run git status to see the status of our work in git. Now, we need to “commit” the changes we just made using -
    git commit -m "Adding a bugfix"
  6. After committing the changes, we would push the changes to our forked Github repository. To do that, run -
    git push origin master
  7. Now our forked repository has the changes, but the main repository from where we forked our repository doesn’t know anything about the changes we made. We, of course, can’t make changes to it directly, as we are not the owner of the repository. We would request the owner of the repository to look at the changes we made and if he feels the bugfix we made is correct, he can approve and merge those changes to the main repository. To achieve this, we would do something called as “pull request” -
    1. Go to your forked repository and in the right sidebar, click on “Pull Requests”.
    2. Now click on “New Pull Request”, provide a description and create the pull request.
  8. After this, the author of original repository would see your pull request and if he thinks that you’ve made the correct changes, he would go ahead and merge and close the pull request. Now, the main repository has the changes that you’ve made.

This is a very high level talk on Git which doesn’t go into the meaning of commands. I would suggest you to watch the following screencast to get hold of Git basics -

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Getting Started with Open Source Contribution, The Easy Way!

I recently made my first open source contribution and would like to share my experience here. This post would also serve as a tutorial for others who want to contribute to open source but couldn’t find the right direction to start. I made a bug fix to the Mozilla project called Titanic. You can see my username (theharshest) in the contributors list of the project. Let me put a step-by-step guideline on how I did that -

220px-Opensource.svg_

1) To start, you need to first create a Bugzilla account.

2) Now, let’s search for some easy bugs relevant to our interest. For that go to Bugs Ahoy.

3) Scroll down and on the left side under “Display only” select both the options. “Bugs with no owner” to make sure the bug that we pick up is not assigned to someone else and “Simple bugs” to make sure that we pick the bugs tagged with [good first bug] tag, which are easy bugs.

4) Now in the “Do you know” section, select the languages you are good in. Now wait for some time and the list of bugs as per the selected filters would get populated.

5) Select a bug from the list and log in with Bugzilla account created in Step 1. You would see a field “Mentors” in the bug details. This person would be your point of contact while you work on the bugs.

6) Now, open a IRC client, like XChat and connect to Mozilla network as per the instructions given here. Go to #ateam (or the channel corresponding to the team whose product’s bug you are working on) channel. Find your mentor here and start talking to him regarding everything you need to fix that bug.

This is the basic workflow that should be followed. You need to have basic Git skills to get it done. My next post covers the basic Git commands you would need to accomplish this task.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Getting Started With Amazon Web Services – Part 2

aws

In Part 1 of this series, I showed you how to create a user and provide him privileges to do stuff in AWS. Now we move ahead and try creating EC2 instances. To create EC2 instances, first we need to have a “Key Pair” which would enable us to ssh into the EC2 instance. To create a Key Pair -

1) Go to AWS console -> EC2 -> Key Pairs -> Create Key Pair.

2) Provide a name, create Key Pair and download the pem file.

3) Change the permissions of pem file to 400.

Now, create a security group -

1) Go to AWS console -> EC2 -> Security Groups -> Create Security Group.

2) Provide a name and description.

3) Click on Add Rule. In Type, select SSH. In Source, select My IP if you have a static IP, else select Anywhere. Click Create.

Now, let’s create the EC2 instance -

1) Go to AWS console -> EC2 -> Instances -> Launch Instance.

2) Select “Ubuntu Server 14.04 LTS”. Keep the default settings till Step 4.

3) In Step 5, provide a name to the instance, so that we can identify this instance from the list of EC2 instances in the console.

4) In Step 6, choose “Select an existing security group” and select the security group created above. Click Review and Launch.

5) Now click Launch and select the Key Pair which we have created before.

Now go to AWS console -> EC2 -> Instances. Select the instance which you just created and copy its Public IP from the panel below.

Assuming that you have followed everything from Part 1 of this tutorial series, go to command line and do -

ssh -i <path_to_pem_file> ubuntu@<public_ip_of_ec2_instance>

You should be logged into the EC2 instance.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Getting Started With Amazon Web Services – Part 1

aws

AWS offers sufficient free stuff to try which makes it one of the best cloud service provider for personal use. Today I’m gonna show you how to get started with AWS quickly in few easy steps -

1) Register for AWS as shown here.

2) Now, go to AWS console and create a IAM user by going to Services -> IAM -> Users -> Create New Users.

3) Provide a username and create a user. Go ahead and click “Download Credentials” and save the file at a secure place. Now, the user is created but it doesn’t have any permissions.

4) In IAM, click on Groups -> Create New Group. Provide a group name and in policy templates select “Administrator Access”. Go ahead and complete the group creation.

5) Now again go to Users in IAM and select the user we created before. Click on User Actions -> Add User to Groups. Now select the group we created in the previous step. Now, this user has administrator privileges.

6) Now, we would be using Python SDK for AWS to manage AWS from command line. To do that clone a sample project using -

git clone https://github.com/awslabs/aws-python-sample.git

7) Install the SDK -

pip install boto

8) Now create a file “~/.aws/credentials” and put the following content in it, replacing the values properly from credentials file downloaded in step 3 -

[default]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

9) Now go to the clone directory (from step 6) and run the following -

python s3_sample.py

10) If everything goes fine, above command would create a S3 bucket and an object and you would see the results as output. If you get any error related to credentials, you should try setting the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

Move ahead with Part 2 of AWS tutorial.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

How to Install Python Package When You Don’t Have Root Access?

Many times we don’t have machines with root access. This mostly happens with machines that are shared between several users like in a university.

access_denied

Pip is the easiest way to install Python packages. By default it installs packages to “/usr/local/” which needs root privileges to access. You can override this default installation directly easily by using the following -

pip install <package_name> –user

This would install the package in “/home/<user_name>/.local/lib/python2.7/site-packages” (2.7 would be the version number, can be different in your case) instead of “/usr/local/”. Now to use this package in a Python script you can’t use “import <package_name>” without telling Python the location of directory where this package is installed. To make sure this new path is included, use the following in the Python script -

import sys
sys.path.append(‘/home/<user_name>/.local/lib/python2.7/site-packages’)
import <package_name>

Python should accept this package now.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Using Unicode Properly while using MySql with Python

I was facing an issue related to character set while downloading Facebook posts’ data. I was getting the data using Facebook Graph API and dumping it in a mysql database.

mysql-databases

When I try to insert a post content directly into the database, I was getting -

sql = sql.encode(self.encoding)
UnicodeEncodeError: ‘latin-1′ codec can’t encode characters in position

I tried encoding the content to unicode explicitly but then the few characters were malformed in the database. I already changed the default character set of mysql database to ‘utf-8′. When I was trying to use encode/decode, I was getting errors like -

return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xd1′

To resolve this first I changed the default character set (ASCII) in the Python script to Unicode as follows -

import sys reload(sys) sys.setdefaultencoding(utf-8)

Then, I made sure that while connecting to database, I set parameters ‘use_unicode’ and ‘charset’ properly as follows -

conn = pymysql.connect(host=xx, user=xx, passwd=xx, db=xx, use_unicode=True, charset=utf8)

After doing this, I could see the special characters properly in the database.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

How to get likes count on posts and comments from Facebook Graph API

facebook-likes

Though Facebook has mentioned in their Graph API documentation that we can get the count of like on a post as follows -

GET /{object-id}/likes HTTP/1.1

with total_count as a field, but it doesn’t work as expected. Instead it returns the list of  users (name and their id) in a page-wise  fashion. So to get the total number of likes you need to traverse it page-wise and look at the number of users who liked it as follows -

def get_likes_count(id, graph): count = 0 feed = graph.get_object(/+ id +/likes, limit =99) whiledatain feed and len(feed['data'])>0: count = count + len(feed['data']) ifpagingin feed andnextin feed['paging']: url = feed['paging']['next'] id = url.split(graph.facebook.com)[1].split(/likes)[0] after_val = feed['paging']['cursors']['after'] feed = graph.get_object(id +/likes, limit =99, after=after_val) else: breakreturn count

I have used this in a Python script which uses Facebook Python SDK

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS