Implementing scikit-learn machine learning algorithm

Question

Linked: https://stackoverflow.com/questions/18154278/is-there-a-maximum-size-for-the-nltk-naive-bayes-classifer

I'm having trouble implementing a scikit-learn machine learning algorithm in my code. One of the authors of the scikit-learn kindly helped me in the question I linked above, but I can't quite get it working and as my original question was about a different matter, I thought it would be best to open a new one.

This code is taking an input of tweets and reading their text and sentiment into a dictionary. It then parses each line of text and adds the text to one list and its sentiment to another (at the advice of the author in the linked question above).

However, despite using the code in the link and looking up the API as best I can, I think I am missing something. Running the code below gives me first a bunch of output separated by a colon, like this:

  (0, 299)  0.270522159585
  (0, 271)  0.32340892262
  (0, 266)  0.361182814311
  : :
  (48, 123) 0.240644787937

followed by:

['negative', 'positive', 'negative', 'negative', 'positive', 'negative', 'negative', 'negative', etc]

and then:

ValueError: empty vocabulary; perhaps the documents only contain stop words

Am I assigning the classifier in the wrong way? This is my code:

test_file = 'RawTweetDataset/SmallSample.csv'
#test_file = 'RawTweetDataset/Dataset.csv'
sample_tweets = 'SampleTweets/FlumeData2.txt'
csv_file = csv.DictReader(open(test_file, 'rb'), delimiter=',', quotechar='"')

tweetsDict = {}

for line in csv_file:
    tweetsDict.update({(line['SentimentText'],line['Sentiment'])})

tweets = []
labels = []
shortenedText = ""
for (text, sentiment) in tweetsDict.items():
    text = HTMLParser.HTMLParser().unescape(text.decode("cp1252", "ignore"))
    exclude = set(string.punctuation)
    for punct in string.punctuation:
        text = text.replace(punct,"")
    cleanedText = [e.lower() for e in text.split() if not e.startswith(('http', '@'))]
    shortenedText = [e.strip() for e in cleanedText if e not in exclude]

    text = ' '.join(ch for ch in shortenedText if ch not in exclude)
    tweets.append(text.encode("utf-8", "ignore"))
    labels.append(sentiment)

vectorizer = TfidfVectorizer(input='content')
X = vectorizer.fit_transform(tweets)
y = labels
classifier = MultinomialNB().fit(X, y)

X_test = vectorizer.fit_transform(sample_tweets)
y_pred = classifier.predict(X_test)

Update: Current code:

all_files = glob.glob (tweet location)
for filename in all_files:
    with open(filename, 'r') as file:
        for line file.readlines():
            X_test = vectorizer.transform([line])
            y_pred = classifier.predict(X_test)
            print line
            print y_pred

This always produces something like:

happy bday trish
['negative'] << Never changes, always negative

This is not related to the question, but maybe you want to store your data into mysql for usage afterward. Sorry to interrupt. — Patrick the Cat
– Patrick the Cat, Commented Aug 13, 2013 at 22:56
No worries, thanks for the input. The thing is, I'm not planning on doing anything other than getting the sentiment. I've no plans for future analysis or anything like that, this is just a one off project. — Andrew Martin
– Andrew Martin, Commented Aug 13, 2013 at 22:58
What do you get when you print tweets? Have you tried creating your own NumPy array for X rather than using vectorizer.fit_transform? — Peter Foti
– Peter Foti, Commented Aug 13, 2013 at 23:23
If I print tweets I get a list of tweets, in single quotes, separated by commas, with the entire list surrounded by square brackets. I hadn't considered creating my own NumPy array because I honestly don't know how - am usually a Java coder, not Python — Andrew Martin
– Andrew Martin, Commented Aug 13, 2013 at 23:35
@PeterFoti: Done. That gives me: ValueError: Found array with dim 50. Expected 1 — Andrew Martin
– Andrew Martin, Commented Aug 14, 2013 at 0:01

Fred Foo · Accepted Answer · 2013-08-14 07:25:36Z

6

The problem is here:

X_test = vectorizer.fit_transform(sample_tweets)

fit_transform is intended to be called on the training set, not the test set. On the test set, call transform.

Also, sample_tweets is a filename. You should open it and read the tweets from it before passing it to a vectorizer. If you do that, then you should finally be able to do something like

for tweet, sentiment in zip(list_of_sample_tweets, y_pred):
    print("Tweet: %s" % tweet)
    print("Sentiment: %s" % sentiment)

edited Aug 14, 2013 at 7:25

answered Aug 14, 2013 at 7:17

Fred Foo

365k80 gold badges765 silver badges852 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Andrew Martin Over a year ago

Can I ask another thing. I was working late last night and realise the silly mistake about the filename. I'm not reading the files and for each line in each file, am using the "X_test = " and "y_pred = " lines. As a test, I put a print statement after this, to see what was coming out. The tweet is being read correctly and no error comes out. But rather than come out with one statement, a single line of tweet produces a list of some 31 "negative"s, all in single quotes separated by commas. Why is this?

Fred Foo Over a year ago

@AndrewMartin: in Python, a string is iterable. Probably, the scikit-learn starts iterating over the string and trying to classify the individual characters. Try wrapping the tweet in a list, so transform([tweet]) instead of transform(tweet). (This may seem cumbersome, but scikit-learn gets its speed from its batch-oriented API and implementation, so everything you feed it is treated as an "iterable", a batch of samples, when possible.)

Andrew Martin Over a year ago

That worked perfectly, but there's one final thing I've noticed. Everything is coming back negative. Every tweet (tried around 30 so far, some of which are def positive). My vectorizer.fit_transform(tweets) line produces solely positive output I notice (not sure if some if meant to be negative?) I updated my code in my question to show how I'm currently trying to do it.

Fred Foo Over a year ago

@AndrewMartin: are you still training on the handful of tweets from your CSV file? If so, try with a larger set, since training a good classifier with a small input set is practically impossible. A few hundred training samples is the recommended minimum for this kind of task.

Andrew Martin Over a year ago

It certainly seems to be. It's processing tweets A LOT faster than the nltk classifier, for sure.

|

Peter Foti · Accepted Answer · 2013-08-14 00:06:30Z

1

To do this in TextBlob (as alluded to in the comments), you would do

from text.blob import TextBlob

tweets = ['This is tweet one, and I am happy.', 'This is tweet two and I am sad']

for tweet in tweets:
    blob = TextBlob(tweet)
    print blob.sentiment #Will return (Polarity, Subjectivity)

answered Aug 14, 2013 at 0:06

Peter Foti

5,6646 gold badges36 silver badges48 bronze badges

6 Comments

Andrew Martin Over a year ago

I'll look into this (and will probably try and use it in addition to what I'm already trying). However, I do want to try and get what I've got working (if only because I've spent so long on it!!)

Andrew Martin Over a year ago

Although I want to get my original code working, I did try this, and I got this error: AttributeError: 'module' object has no attribute 'compat'

Peter Foti Over a year ago

I cloned your repo for the sci kit problem, and am trying to figure it out. What version of python are you using? Did you install TextBlob?

Andrew Martin Over a year ago

I installed TextBlob and am using Python 2.7.3

Peter Foti Over a year ago

I can't quite decipher that error message, we're not calling a 'compat' attribute within that code anywhere. Also i've been working on the other code. I turned y into a numpy array as well and now what im seeing is:

Traceback (most recent call last):   File "naive-BayesClassifier.py", line 44, in <module>     classifier.fit(X, y)   File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 308, in fit     X = X.astype(np.float) ValueError: could not convert string to float: all time low shall be my motivation for the rest of the week

|

Collectives™ on Stack Overflow

Implementing scikit-learn machine learning algorithm

2 Answers 2

6 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related