1

When I execute the below code

import networkx as nx
import numpy as np
from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

def textrank(document):
    sentence_tokenizer = PunktSentenceTokenizer()
    sentences = sentence_tokenizer.tokenize(document)

    bow_matrix = CountVectorizer().fit_transform(sentences)
    normalized = TfidfTransformer().fit_transform(bow_matrix)

    similarity_graph = normalized * normalized.T

    nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
    scores = nx.pagerank(nx_graph)
    return sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

fp = open("QC")    
txt = fp.read()
sents = textrank(txt)
print sents

I get the following error

Traceback (most recent call last):
  File "Textrank.py", line 44, in <module>
    sents = textrank(txt)
  File "Textrank.py", line 10, in textrank
    sentences = sentence_tokenizer.tokenize(document)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1237, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter
    for el in it:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1291, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1337, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1472, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)

I am executing the code in Ubuntu. To get the text, I referred this website https://uwaterloo.ca/institute-for-quantum-computing/quantum-computing-101. I created a file QC (not QC.txt) and copy pasted the data paragraph by paragraph to the file. Kindly help me resolve the error. Thank You

2
  • Welcome to Stackoverflow! Please look at stackoverflow.com/help/how-to-ask. Also, please search on Google or somewhere for the problem first before posting a question. Commented Apr 7, 2017 at 7:03
  • Sorry, I was not able to understand the existing solutions. EDIT : I just rechecked the link and now it makes a bit sense. I could not figure out how the solution fits in. I am new to Python and have taken a dive into NLP so I get overwhelmed easily. Please excuse me for that. Commented Apr 7, 2017 at 7:12

1 Answer 1

-1

Please try if the following works for you.

import networkx as nx
import numpy as np
import sys

reload(sys)
sys.setdefaultencoding('utf8')

from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

def textrank(document):
    sentence_tokenizer = PunktSentenceTokenizer()
    sentences = sentence_tokenizer.tokenize(document)

    bow_matrix = CountVectorizer().fit_transform(sentences)
    normalized = TfidfTransformer().fit_transform(bow_matrix)

    similarity_graph = normalized * normalized.T

    nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
    scores = nx.pagerank(nx_graph)
    return sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

fp = open("QC")    
txt = fp.read()
sents = textrank(txt.encode('utf-8'))
print sents
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much. Once I get the sentences, I am printing them as for s in sents: st = str(s[1]) print st When I print sents as it is a lot of unicode type things are displayed but when I convert them to string they go away. Why does this happen?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.