I am very new to python as well as machine learning. I am trying to work on Sentiment Analysis of twitter data , so while working out I directly use sklearn without any preprocess in nltk.
#reading data from csv having 1 column with text and other with sentiment as pos and neg
for index, row in val.iterrows():
statement = row['tweets'].strip() #get the tweet from csv
tweets.append((statement, row['emo'])) #append the tweet and emotion(pos,neg)
Then I used this classfier
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('classifier', OneVsRestClassifier(LinearSVC())
)])
#Dividing data into training and Testing
np.random.shuffle(tweets)
for key, value in tweets:
keys.append(key)
values.append(value)
size = len(keys) * 1 / 2
X_train = np.array(keys[0:size])
y_train = np.array(values[0:size])
X_test = np.array(keys[size + 1: len(keys)])
y_test = np.array(values[size + 1: len(keys)])
classifier
classifier = classifier.fit(X_train, y_train)
K-Fold Accuracy Test
X_folds = np.array_split(X_test, 3)
y_folds = np.array_split(y_test, 3)
scores = list()
for k in range(3):
X_train = list(X_folds)
X_test = X_train.pop(k)
X_train = np.concatenate(X_train)
y_train = list(y_folds)
y_test = y_train.pop(k)
y_train = np.concatenate(y_train)
clsf = classifier.fit(X_train, y_train)
scores.append(clsf.score(X_test, y_test))
With the above I get an accuracy of [0.92494226327944573, 0.91974595842956119, 0.93360277136258663] using k-fold with k = 3.
As much I see in the code of TfidfTransformer , I found its a kind of data preprocessing only. So does it mean if I work with sklearn , I need not to pre process like its given in nltk.
My Question is -
If I can directly run the dataset on scikit library without any pre-processing and getting quite a good result , when is the scenario where I will have to use preprocessing (nltk) before running the data on skicit ?