sentiment analysis using sklearn in python

Question

I am very new to python as well as machine learning. I am trying to work on Sentiment Analysis of twitter data , so while working out I directly use sklearn without any preprocess in nltk.

#reading data from csv having 1 column with text and other with sentiment as pos and neg
for index, row in val.iterrows():
   statement = row['tweets'].strip() #get the tweet from csv
   tweets.append((statement, row['emo'])) #append the tweet and emotion(pos,neg)

Then I used this classfier

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', OneVsRestClassifier(LinearSVC())
    )])

#Dividing data into training and Testing
np.random.shuffle(tweets)
for key, value in tweets:
    keys.append(key)
    values.append(value)

size = len(keys) * 1 / 2

X_train = np.array(keys[0:size])
y_train = np.array(values[0:size])

X_test = np.array(keys[size + 1: len(keys)])
y_test = np.array(values[size + 1: len(keys)])

classifier

classifier = classifier.fit(X_train, y_train)

K-Fold Accuracy Test

X_folds = np.array_split(X_test, 3)
y_folds = np.array_split(y_test, 3)

scores = list()
for k in range(3):
    X_train = list(X_folds)
    X_test = X_train.pop(k)
    X_train = np.concatenate(X_train)
    y_train = list(y_folds)
    y_test = y_train.pop(k)
    y_train = np.concatenate(y_train)
    clsf = classifier.fit(X_train, y_train)

    scores.append(clsf.score(X_test, y_test))

With the above I get an accuracy of [0.92494226327944573, 0.91974595842956119, 0.93360277136258663] using k-fold with k = 3.

As much I see in the code of TfidfTransformer , I found its a kind of data preprocessing only. So does it mean if I work with sklearn , I need not to pre process like its given in nltk.

My Question is -

If I can directly run the dataset on scikit library without any pre-processing and getting quite a good result , when is the scenario where I will have to use preprocessing (nltk) before running the data on skicit ?

The title is slightly at odds with the multitude of questions presented here. Try to narrow it down to a single question. Also please review your question and fix the code formatting. — Hooked
– Hooked, Commented Dec 18, 2014 at 15:42
You can use NLTK for tag your corpus once is tagged and arranged you can work classification with scikit-learn. I didn't understand your question help me to help you. — john doe
– john doe, Commented Dec 19, 2014 at 1:21
without doing any nltk tagging with corpus , I am getting good accuracy directly over scikit-learn , so exactly why do I need to do tagging the corpus ? — Abhishek Choudhary
– Abhishek Choudhary, Commented Dec 19, 2014 at 9:49

Ben Allison · Accepted Answer · 2014-12-19 09:57:43Z

5

You will likely find that topic covaries with sentiment (i.e. most articles about Mother Teresa are positive in sentiment, most articles about murder are negative). Your bag of words classifier is probably learning topic categories rather than sentiment ones. You can verify this by inspecting the weights on the terms in your classifier: my guess is the highest weighted terms are topic specific.

Why is this a problem? Because what you've learned won't generalise to topics not in your training set. This will be a big problem on, for instance, Twitter, where topic shifts rapidly. Try learning a model like this in month M, and predicting sentiment in tweets in month M + 6. I would imagine it won't work very well!

answered Dec 19, 2014 at 9:57

Ben Allison

7,4441 gold badge18 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

DACW · Accepted Answer · 2014-12-18 17:31:33Z

-1

Tfidf is a way find out how significant a word is in a document. To get meaningful results from your tfidf a good pre-processing is a must in terms of stemming,ngraming etc. NLTK library has good support for it.

The results of TFidf is as good as your pre-processing else its gonna be GIGO(Garbage In Garbage Out). Since you are doing sentiment analysis sometimes it is better to replace negations like "din't" to "did not" in your pre-processing step.

answered Dec 18, 2014 at 17:31

DACW

2,8312 gold badges20 silver badges17 bronze badges

Collectives™ on Stack Overflow

sentiment analysis using sklearn in python

classifier

K-Fold Accuracy Test

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

classifier

K-Fold Accuracy Test

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related