UnicodeDecodeError: 'ascii' codec can't decode byte ... Python 2.7 and

Question

I'm reading a text file that has unicode characters from many different countries. The data in the file is also in JSON format.

I'm working on a CentOS machine. When I open the file in a terminal, the unicode characters display just fine (so my termininal is configured for unicode).

When I test my code in Eclipse, it works fine. When I run my code in the terminal, it throws an error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128)

for line in open("data-01083"):
    try:
        tmp = line
        if tmp == "":
            break
        theData = json.loads(tmp[41:]) 

        for loc in theData["locList"]:
            outLine = tmp[:40] 
            outLine = outLine + delim + theData["names"][0]["name"]
            outLine = outLine + delim + str(theData.get("Flagvalue"))
            outLine = outLine + delim + str(loc.get("myType"))
            flatAdd = ""
            srcAddr = loc.get("Address")
            if srcAddr != None:
                flatAdd = delim + str(srcAddr.get("houseNumber"))
                flatAdd = flatAdd + delim + str(srcAddr.get("streetName"))
                flatAdd = flatAdd + delim + str(srcAddr.get("postalCode"))
                flatAdd = flatAdd + delim + str(srcAddr.get("CountryCode"))
            else:
                 flatAdd = delim + "None" + delim + "None" + delim +"None" + delim +"None" + delim +"None"

            outLine = outLine + FlatAdd

            sys.stdout.write(("%s\n" % (outLine)).encode('utf-8'))
    except:
        sys.stdout.write("Error Processing record\n")

So everything works until it gets to StreetName, where it crashes with the UnicodeDecodeError, which is where the non-ascii characters start showing up.

I can fix that instance by added .encode('utf-8'):

 flatAdd = flatAdd + delim + str(srcAddr.get("streetName").encode('utf-8'))

but then it crashes with the UnicodeDecodeError on the next line:

outLine = outLine + FlatAdd

I have been stumbling through these types of issues for a month. Any feedback would be greatly appreciated!!

Robᵩ, thank you!!! I feel like Neo after he sees the bytes. — user1826936
– user1826936, Commented Mar 28, 2013 at 14:29

user1467267 · Accepted Answer · 2013-03-26 20:11:58Z

1

This might fix your problem. I'm saying might because encoding sometimes makes weird stuff happen ;)

#!/usr/bin/python
# -*- coding: utf-8 -*-

text_file_utf8 = text_file.encode('utf8')

From this point on you should be rid of the messages. If not so, please give feedback on what kind of file you have, the language. Maybe some file header data.

text_file.decode("ISO-8859-1") might also be a solution.

If all fails, look into codecs() here; http://docs.python.org/2/library/codecs.html

with codecs.open('your_file.extension', 'r', 'utf8') as indexKey:
    pass
    # Your code here

answered Mar 26, 2013 at 20:11

user1467267

Sign up to request clarification or add additional context in comments.

Comments

user1826936 · Accepted Answer · 2013-03-28 14:37:18Z

1

The presentation from Robᵩ (http://nedbatchelder.com/text/unipain.html) REALLY helped with my understanding unicode. HIGHLY recommend it to anyone with unicode issues.

My take away:

Convert everthing to unicode as you ingest it into your app.
Use only unicode strings in your code
Specify the encoding as you output the data from your app.

For me, I was reading from stdin and a file and output to stdout:

For stdin:

inData = codecs.getreader('utf-8')(sys.stdin)

for a file:

inData = codecs.open("myFile","r","utf-8")

for stdout (do this once before writing anything to stdout):

sys.stdout = codecs.getwriter('utf-8')(sys.stdout)

answered Mar 28, 2013 at 14:37

user1826936

432 silver badges6 bronze badges

Collectives™ on Stack Overflow

UnicodeDecodeError: 'ascii' codec can't decode byte ... Python 2.7 and

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related