1

Linked: https://stackoverflow.com/questions/18154278/is-there-a-maximum-size-for-the-nltk-naive-bayes-classifer

I'm having trouble implementing a scikit-learn machine learning algorithm in my code. One of the authors of the scikit-learn kindly helped me in the question I linked above, but I can't quite get it working and as my original question was about a different matter, I thought it would be best to open a new one.

This code is taking an input of tweets and reading their text and sentiment into a dictionary. It then parses each line of text and adds the text to one list and its sentiment to another (at the advice of the author in the linked question above).

However, despite using the code in the link and looking up the API as best I can, I think I am missing something. Running the code below gives me first a bunch of output separated by a colon, like this:

  (0, 299)  0.270522159585
  (0, 271)  0.32340892262
  (0, 266)  0.361182814311
  : :
  (48, 123) 0.240644787937

followed by:

['negative', 'positive', 'negative', 'negative', 'positive', 'negative', 'negative', 'negative', etc]

and then:

ValueError: empty vocabulary; perhaps the documents only contain stop words

Am I assigning the classifier in the wrong way? This is my code:

test_file = 'RawTweetDataset/SmallSample.csv'
#test_file = 'RawTweetDataset/Dataset.csv'
sample_tweets = 'SampleTweets/FlumeData2.txt'
csv_file = csv.DictReader(open(test_file, 'rb'), delimiter=',', quotechar='"')

tweetsDict = {}

for line in csv_file:
    tweetsDict.update({(line['SentimentText'],line['Sentiment'])})

tweets = []
labels = []
shortenedText = ""
for (text, sentiment) in tweetsDict.items():
    text = HTMLParser.HTMLParser().unescape(text.decode("cp1252", "ignore"))
    exclude = set(string.punctuation)
    for punct in string.punctuation:
        text = text.replace(punct,"")
    cleanedText = [e.lower() for e in text.split() if not e.startswith(('http', '@'))]
    shortenedText = [e.strip() for e in cleanedText if e not in exclude]

    text = ' '.join(ch for ch in shortenedText if ch not in exclude)
    tweets.append(text.encode("utf-8", "ignore"))
    labels.append(sentiment)

vectorizer = TfidfVectorizer(input='content')
X = vectorizer.fit_transform(tweets)
y = labels
classifier = MultinomialNB().fit(X, y)

X_test = vectorizer.fit_transform(sample_tweets)
y_pred = classifier.predict(X_test)

Update: Current code:

all_files = glob.glob (tweet location)
for filename in all_files:
    with open(filename, 'r') as file:
        for line file.readlines():
            X_test = vectorizer.transform([line])
            y_pred = classifier.predict(X_test)
            print line
            print y_pred

This always produces something like:

happy bday trish
['negative'] << Never changes, always negative
11
  • This is not related to the question, but maybe you want to store your data into mysql for usage afterward. Sorry to interrupt. Commented Aug 13, 2013 at 22:56
  • No worries, thanks for the input. The thing is, I'm not planning on doing anything other than getting the sentiment. I've no plans for future analysis or anything like that, this is just a one off project. Commented Aug 13, 2013 at 22:58
  • 1
    What do you get when you print tweets? Have you tried creating your own NumPy array for X rather than using vectorizer.fit_transform? Commented Aug 13, 2013 at 23:23
  • If I print tweets I get a list of tweets, in single quotes, separated by commas, with the entire list surrounded by square brackets. I hadn't considered creating my own NumPy array because I honestly don't know how - am usually a Java coder, not Python Commented Aug 13, 2013 at 23:35
  • 1
    @PeterFoti: Done. That gives me: ValueError: Found array with dim 50. Expected 1 Commented Aug 14, 2013 at 0:01

2 Answers 2

6

The problem is here:

X_test = vectorizer.fit_transform(sample_tweets)

fit_transform is intended to be called on the training set, not the test set. On the test set, call transform.

Also, sample_tweets is a filename. You should open it and read the tweets from it before passing it to a vectorizer. If you do that, then you should finally be able to do something like

for tweet, sentiment in zip(list_of_sample_tweets, y_pred):
    print("Tweet: %s" % tweet)
    print("Sentiment: %s" % sentiment)
Sign up to request clarification or add additional context in comments.

6 Comments

Can I ask another thing. I was working late last night and realise the silly mistake about the filename. I'm not reading the files and for each line in each file, am using the "X_test = " and "y_pred = " lines. As a test, I put a print statement after this, to see what was coming out. The tweet is being read correctly and no error comes out. But rather than come out with one statement, a single line of tweet produces a list of some 31 "negative"s, all in single quotes separated by commas. Why is this?
@AndrewMartin: in Python, a string is iterable. Probably, the scikit-learn starts iterating over the string and trying to classify the individual characters. Try wrapping the tweet in a list, so transform([tweet]) instead of transform(tweet). (This may seem cumbersome, but scikit-learn gets its speed from its batch-oriented API and implementation, so everything you feed it is treated as an "iterable", a batch of samples, when possible.)
That worked perfectly, but there's one final thing I've noticed. Everything is coming back negative. Every tweet (tried around 30 so far, some of which are def positive). My vectorizer.fit_transform(tweets) line produces solely positive output I notice (not sure if some if meant to be negative?) I updated my code in my question to show how I'm currently trying to do it.
@AndrewMartin: are you still training on the handful of tweets from your CSV file? If so, try with a larger set, since training a good classifier with a small input set is practically impossible. A few hundred training samples is the recommended minimum for this kind of task.
It certainly seems to be. It's processing tweets A LOT faster than the nltk classifier, for sure.
|
1

To do this in TextBlob (as alluded to in the comments), you would do

from text.blob import TextBlob

tweets = ['This is tweet one, and I am happy.', 'This is tweet two and I am sad']

for tweet in tweets:
    blob = TextBlob(tweet)
    print blob.sentiment #Will return (Polarity, Subjectivity)

6 Comments

I'll look into this (and will probably try and use it in addition to what I'm already trying). However, I do want to try and get what I've got working (if only because I've spent so long on it!!)
Although I want to get my original code working, I did try this, and I got this error: AttributeError: 'module' object has no attribute 'compat'
I cloned your repo for the sci kit problem, and am trying to figure it out. What version of python are you using? Did you install TextBlob?
I installed TextBlob and am using Python 2.7.3
I can't quite decipher that error message, we're not calling a 'compat' attribute within that code anywhere. Also i've been working on the other code. I turned y into a numpy array as well and now what im seeing is: Traceback (most recent call last): File "naive-BayesClassifier.py", line 44, in <module> classifier.fit(X, y) File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 308, in fit X = X.astype(np.float) ValueError: could not convert string to float: all time low shall be my motivation for the rest of the week
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.