UnicodeDecodeError: 'ascii' codec can't decode byte - Python

Question

This is related to following questions -

I have python app doing following tasks -

# -*- coding: utf-8 -*-

1. Read unicode text file (non-english) -

def readfile(file, access, encoding):
    with codecs.open(file, access, encoding) as f:
        return f.read()

text = readfile('teststory.txt','r','utf-8-sig')

This returns given text file as string.

2. Split text into sentences.

3. Go through words in each sentence and identify verbs, nouns etc.

Refer - Searching for Unicode characters in Python and Find word infront and behind of a Python list

4. Add them into separate variables as below

nouns = "CAR" | "BUS" |

verbs = "DRIVES" | "HITS"

5. Now I'm trying to pass them into NLTK context free grammer as below -

grammar = nltk.parse_cfg('''
    S -> NP VP
    NP -> N
    VP -> V | NP V

    N -> '''+nouns+'''
    V -> '''+verbs+'''
    ''')

It gives me following error -

line 40, in V -> '''+verbs+''' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 114: ordinal not in range(128)

How can i overcome this matter and pass variable into NLTK CFG ?

Complete Code - https://dl.dropboxusercontent.com/u/4959382/new.zip

I'm using Pycharm. How can i print full traceback ? print_stack() didn't work. Anyway can figure out issue with given exception ? — ChamingaD
– ChamingaD, Commented Aug 19, 2013 at 5:20
import logging; try: your-code; except: logging.exception("ouch") # for clarity, use newlines and indentation instead of ; — Dima Tisnek
– Dima Tisnek, Commented Aug 19, 2013 at 9:49
plese also paste proper code that defines nouns and verbs. See, "CAR" | "BUS" (literally) is not possible in Python, I guess it's some string passed to the parser? — Dima Tisnek
– Dima Tisnek, Commented Aug 19, 2013 at 9:52
@qarma I will attach complete code for your reference. nouns and verbs are variables which holds some unicode text in format of "CAR" | "BUS" — ChamingaD
– ChamingaD, Commented Aug 19, 2013 at 10:21

Dima Tisnek · Accepted Answer · 2013-08-19 14:23:48Z

1

Overall you have these strategies:

treat input as sequence of bytes, then both input and grammar are utf-8-encoded data (bytes)
treat input as sequence of unicode code points, then both input and grammar are unicode.
rename unicode code points to ascii, that is use escape sequences.

nltk that is installed with pip, 2.0.4 in my case, doesn't accept unicode directly, but accepts quoted unicode constants, that is all of the following appear to work:

In [26]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar')
Out[26]: <Grammar with 2 productions>

In [27]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("utf-8"))
Out[27]: <Grammar with 2 productions>

In [28]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("unicode_escape"))
Out[28]: <Grammar with 2 productions>

Note, that I quoted unicode text and not normal text "€" vs bar.

answered Aug 19, 2013 at 14:23

Dima Tisnek

11.9k4 gold badges73 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ChamingaD Over a year ago

Collectives™ on Stack Overflow

UnicodeDecodeError: 'ascii' codec can't decode byte - Python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related