0

I'm reading a text file that has unicode characters from many different countries. The data in the file is also in JSON format.

I'm working on a CentOS machine. When I open the file in a terminal, the unicode characters display just fine (so my termininal is configured for unicode).

When I test my code in Eclipse, it works fine. When I run my code in the terminal, it throws an error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128)

for line in open("data-01083"):
    try:
        tmp = line
        if tmp == "":
            break
        theData = json.loads(tmp[41:]) 

        for loc in theData["locList"]:
            outLine = tmp[:40] 
            outLine = outLine + delim + theData["names"][0]["name"]
            outLine = outLine + delim + str(theData.get("Flagvalue"))
            outLine = outLine + delim + str(loc.get("myType"))
            flatAdd = ""
            srcAddr = loc.get("Address")
            if srcAddr != None:
                flatAdd = delim + str(srcAddr.get("houseNumber"))
                flatAdd = flatAdd + delim + str(srcAddr.get("streetName"))
                flatAdd = flatAdd + delim + str(srcAddr.get("postalCode"))
                flatAdd = flatAdd + delim + str(srcAddr.get("CountryCode"))
            else:
                 flatAdd = delim + "None" + delim + "None" + delim +"None" + delim +"None" + delim +"None"

            outLine = outLine + FlatAdd

            sys.stdout.write(("%s\n" % (outLine)).encode('utf-8'))
    except:
        sys.stdout.write("Error Processing record\n")

So everything works until it gets to StreetName, where it crashes with the UnicodeDecodeError, which is where the non-ascii characters start showing up.

I can fix that instance by added .encode('utf-8'):

 flatAdd = flatAdd + delim + str(srcAddr.get("streetName").encode('utf-8'))

but then it crashes with the UnicodeDecodeError on the next line:

outLine = outLine + FlatAdd

I have been stumbling through these types of issues for a month. Any feedback would be greatly appreciated!!

2
  • 1
    How Do I Stop The Pain? Commented Mar 26, 2013 at 20:25
  • Robᵩ, thank you!!! I feel like Neo after he sees the bytes. Commented Mar 28, 2013 at 14:29

2 Answers 2

1

This might fix your problem. I'm saying might because encoding sometimes makes weird stuff happen ;)

#!/usr/bin/python
# -*- coding: utf-8 -*-

text_file_utf8 = text_file.encode('utf8')

From this point on you should be rid of the messages. If not so, please give feedback on what kind of file you have, the language. Maybe some file header data.

text_file.decode("ISO-8859-1") might also be a solution.

If all fails, look into codecs() here; http://docs.python.org/2/library/codecs.html

with codecs.open('your_file.extension', 'r', 'utf8') as indexKey:
    pass
    # Your code here
Sign up to request clarification or add additional context in comments.

Comments

1

The presentation from Robᵩ (http://nedbatchelder.com/text/unipain.html) REALLY helped with my understanding unicode. HIGHLY recommend it to anyone with unicode issues.

My take away:

  • Convert everthing to unicode as you ingest it into your app.
  • Use only unicode strings in your code
  • Specify the encoding as you output the data from your app.

For me, I was reading from stdin and a file and output to stdout:

For stdin:

inData = codecs.getreader('utf-8')(sys.stdin)

for a file:

inData = codecs.open("myFile","r","utf-8")

for stdout (do this once before writing anything to stdout):

sys.stdout = codecs.getwriter('utf-8')(sys.stdout)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.