Sentiment analysis on reviews using NLTK in Python

Question

I have a csv data file containing column 'notes' with satisfaction answers in Hebrew.

I would like to use Sentiment analysis in order to assign a score for each word or bigrm in the data and receive positive/negative probability using logistic regression.

My code so far:

PYTHONIOENCODING="UTF-8"  
df= pd.read_csv('keep.csv', encoding='utf-8' , usecols=['notes'])

txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
tokens=[word.lower() for word in words if word.isalpha()]
bigrm = list(nltk.bigrams(tokens))

word_index = {}
current_index = 0
    for token in tokens:
    if token not in word_index:
        word_index[token] = current_index
        current_index += 1

def tokens_to_vector(tokens, label):
    x = np.zeros(len(word_index) + 1) 
    for t in tokens:
        i = word_index[t]
        x[i] += 1
    x = x / x.sum() 
    x[-1] = label
    return x

N= len(word_index)
data = np.zeros((N, len(word_index) + 1))
i = 0
for token in tokens:
xy = tokens_to_vector(tokens, 1)
data[i,:] = xy
i += 1

This loop isn't working. How can I generate the data and then receive positive/negative probabilities for each bigrm?

bugo99iot · Accepted Answer · 2019-08-06 13:20:07Z

1

Is your code snippet correct? You need indent in all for loops.

df= pd.read_csv('keep.csv', encoding='utf-8' , usecols=['notes'])

txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
tokens=[word.lower() for word in words if word.isalpha()]
bigrm = list(nltk.bigrams(tokens))

word_index = {}
current_index = 0
    for token in tokens:
        if token not in word_index:
            word_index[token] = current_index
            current_index += 1

def tokens_to_vector(tokens, label):
    x = np.zeros(len(word_index) + 1) 
    for t in tokens:
        i = word_index[t]
        x[i] += 1
    x = x / x.sum() 
    x[-1] = label
    return x

N= len(word_index)
data = np.zeros((N, len(word_index) + 1))
i = 0
for token in tokens:
    xy = tokens_to_vector(tokens, 1)
    data[i,:] = xy
    i += 1```

edited Aug 6, 2019 at 13:20

answered Aug 5, 2019 at 13:37

bugo99iot

3072 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Lili Over a year ago

Not sure I understand. My loop isn't working. I ran your answer and it worked. Also, I want to know if the "data" array produce the right outcome? How can I get probabilities for positive/negative words?

bugo99iot Over a year ago

Python is whitespace sensitive, you need 4 white-spaces or a tab after each for loop. See difference in line below 'for token in tokens:'. Please consider selecting my answer as resolving.

Collectives™ on Stack Overflow

Sentiment analysis on reviews using NLTK in Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related