Whenever you are doing a classification task, you should always look at some of the most informative features after training the classifier. This would give you an idea that you are on the right track. For eg, you have a sentiment classification task and you see some of the most informative features as “shocked”, “unhappy”, “depressed”, etc. in the positive sentiment category, then it is an indicator that you have made a big blunder somewhere. Apart from this most informative features provide an insight into the data that you are trying to run your classification tasks on.
Let’s get started and see how to get the most informative features in scikit learn. Firstly, you should have a mapping of feature names to the feature vectors. If you are using any of the standard vectorizer in scikit learn (like sklearn.feature_extraction.text.CountVectorizer) then getting the feature names is straightforward and can be done as follows -
count_vect = CountVectorizer() feature_names = count_vect.get_feature_names()
Once you get the feature names, you need to look for the “coef_” attribute of the classifier that you are using (remember that all classifiers don’t have this attribute). If you look at LinearSVC‘s documentation then you can see it in the attributes list. It would be a vector of length “m” if there are two classes, else a matrix of size “nxm”, if there are more than two classes, where m is the total number of features and n is the number of classes. Now, if we want to see the top 20 most discriminative features of class 3, then we would just sort the corresponding row and extract the feature names of top 20 features as follows -
# Initialize the classifier clf_SVM = LinearSVC(C=1, tol=0.001) # Fit the dataset, where X_vect comes from passing the dataset # through a vectorizer/transformer like CountVectorizer() clf = clf_SVM.fit(X_vect, y) # Sort the coef_ as per feature weights and select largest 20 of them # 2 shows that we are considering the third class inds = np.argsort(clf.coef_[2, :])[-20:] # Now, just iterate over all these indices and get the corresponding # feature names for i in inds: print feature_names[i]