0

This is related to following questions -

I have python app doing following tasks -

# -*- coding: utf-8 -*-

1. Read unicode text file (non-english) -

def readfile(file, access, encoding):
    with codecs.open(file, access, encoding) as f:
        return f.read()

text = readfile('teststory.txt','r','utf-8-sig')

This returns given text file as string.

2. Split text into sentences.

3. Go through words in each sentence and identify verbs, nouns etc.

Refer - Searching for Unicode characters in Python and Find word infront and behind of a Python list

4. Add them into separate variables as below

nouns = "CAR" | "BUS" |

verbs = "DRIVES" | "HITS"

5. Now I'm trying to pass them into NLTK context free grammer as below -

grammar = nltk.parse_cfg('''
    S -> NP VP
    NP -> N
    VP -> V | NP V

    N -> '''+nouns+'''
    V -> '''+verbs+'''
    ''')

It gives me following error -

line 40, in V -> '''+verbs+''' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 114: ordinal not in range(128)

How can i overcome this matter and pass variable into NLTK CFG ?

Complete Code - https://dl.dropboxusercontent.com/u/4959382/new.zip

5
  • Can you show the full traceback of the error? Commented Aug 18, 2013 at 7:38
  • I'm using Pycharm. How can i print full traceback ? print_stack() didn't work. Anyway can figure out issue with given exception ? Commented Aug 19, 2013 at 5:20
  • import logging; try: your-code; except: logging.exception("ouch") # for clarity, use newlines and indentation instead of ; Commented Aug 19, 2013 at 9:49
  • plese also paste proper code that defines nouns and verbs. See, "CAR" | "BUS" (literally) is not possible in Python, I guess it's some string passed to the parser? Commented Aug 19, 2013 at 9:52
  • @qarma I will attach complete code for your reference. nouns and verbs are variables which holds some unicode text in format of "CAR" | "BUS" Commented Aug 19, 2013 at 10:21

1 Answer 1

1

Overall you have these strategies:

  • treat input as sequence of bytes, then both input and grammar are utf-8-encoded data (bytes)
  • treat input as sequence of unicode code points, then both input and grammar are unicode.
  • rename unicode code points to ascii, that is use escape sequences.

nltk that is installed with pip, 2.0.4 in my case, doesn't accept unicode directly, but accepts quoted unicode constants, that is all of the following appear to work:

In [26]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar')
Out[26]: <Grammar with 2 productions>

In [27]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("utf-8"))
Out[27]: <Grammar with 2 productions>

In [28]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("unicode_escape"))
Out[28]: <Grammar with 2 productions>

Note, that I quoted unicode text and not normal text "€" vs bar.

Sign up to request clarification or add additional context in comments.

1 Comment

Hmm. how to apply above encoding to my code ? grammar = nltk.parse_cfg(''' S -> NP VP NP -> N | D N | ADJ N | ADJ N P | D N P | D ADJ N P | ADJ N N N N N DET VP -> V | NP V | ADV V N -> '''+nouns+pronouns+''' D -> '''+determiners+''' ADJ -> '''+adjectives+''' ADV -> '''+adverbs+''' P -> '''+prepositions+''' V -> '''+verbs+''' ''')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.