0

I am very new to python as well as machine learning. I am trying to work on Sentiment Analysis of twitter data , so while working out I directly use sklearn without any preprocess in nltk.

#reading data from csv having 1 column with text and other with sentiment as pos and neg
for index, row in val.iterrows():
   statement = row['tweets'].strip() #get the tweet from csv
   tweets.append((statement, row['emo'])) #append the tweet and emotion(pos,neg)

Then I used this classfier

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', OneVsRestClassifier(LinearSVC())
    )])

#Dividing data into training and Testing
np.random.shuffle(tweets)
for key, value in tweets:
    keys.append(key)
    values.append(value)

size = len(keys) * 1 / 2

X_train = np.array(keys[0:size])
y_train = np.array(values[0:size])

X_test = np.array(keys[size + 1: len(keys)])
y_test = np.array(values[size + 1: len(keys)])

classifier

classifier = classifier.fit(X_train, y_train)

K-Fold Accuracy Test

X_folds = np.array_split(X_test, 3)
y_folds = np.array_split(y_test, 3)

scores = list()
for k in range(3):
    X_train = list(X_folds)
    X_test = X_train.pop(k)
    X_train = np.concatenate(X_train)
    y_train = list(y_folds)
    y_test = y_train.pop(k)
    y_train = np.concatenate(y_train)
    clsf = classifier.fit(X_train, y_train)

    scores.append(clsf.score(X_test, y_test))

With the above I get an accuracy of [0.92494226327944573, 0.91974595842956119, 0.93360277136258663] using k-fold with k = 3.

As much I see in the code of TfidfTransformer , I found its a kind of data preprocessing only. So does it mean if I work with sklearn , I need not to pre process like its given in nltk.

My Question is -

If I can directly run the dataset on scikit library without any pre-processing and getting quite a good result , when is the scenario where I will have to use preprocessing (nltk) before running the data on skicit ?

3
  • 2
    The title is slightly at odds with the multitude of questions presented here. Try to narrow it down to a single question. Also please review your question and fix the code formatting. Commented Dec 18, 2014 at 15:42
  • You can use NLTK for tag your corpus once is tagged and arranged you can work classification with scikit-learn. I didn't understand your question help me to help you. Commented Dec 19, 2014 at 1:21
  • without doing any nltk tagging with corpus , I am getting good accuracy directly over scikit-learn , so exactly why do I need to do tagging the corpus ? Commented Dec 19, 2014 at 9:49

2 Answers 2

5

You will likely find that topic covaries with sentiment (i.e. most articles about Mother Teresa are positive in sentiment, most articles about murder are negative). Your bag of words classifier is probably learning topic categories rather than sentiment ones. You can verify this by inspecting the weights on the terms in your classifier: my guess is the highest weighted terms are topic specific.

Why is this a problem? Because what you've learned won't generalise to topics not in your training set. This will be a big problem on, for instance, Twitter, where topic shifts rapidly. Try learning a model like this in month M, and predicting sentiment in tweets in month M + 6. I would imagine it won't work very well!

Sign up to request clarification or add additional context in comments.

Comments

-1

Tfidf is a way find out how significant a word is in a document. To get meaningful results from your tfidf a good pre-processing is a must in terms of stemming,ngraming etc. NLTK library has good support for it.

The results of TFidf is as good as your pre-processing else its gonna be GIGO(Garbage In Garbage Out). Since you are doing sentiment analysis sometimes it is better to replace negations like "din't" to "did not" in your pre-processing step.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.