Python 2.7 decode error using UTF-8 header: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

Question

Traceback:

Traceback (most recent call last):
  File "venues.py", line 22, in <module>
    main()
  File "venues.py", line 19, in main
    print_category(category, 0)
  File "venues.py", line 13, in print_category
    print_category(subcategory, ident+1)
  File "venues.py", line 10, in print_category
    print u'%s: %s' % (category['name'].encode('utf-8'), category['id'])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

Code:

# -*- coding: utf-8 -*-

# Using https://github.com/marcelcaraciolo/foursquare
import foursquare 

# Prints categories and subcategories
def print_category(category, ident):
    for i in range(0,ident):
        print u'\t',
    print u'%s: %s' % (category['name'].encode('utf-8'), category['id'])

    for subcategory in category.get('categories', []):
        print_category(subcategory, ident+1)

def main():
    client = foursquare.Foursquare(client_id='id',
                                   client_secret='secret')
    for category in client.venues.categories()['categories']:
        print_category(category, 0)

if __name__ == '__main__':
    main()

Did you mean decode instead of encode? What kind of string is category['name']? — Cairnarvon
– Cairnarvon, Commented Aug 18, 2013 at 3:21
it is <type 'unicode'>, removed the encode part. However, the error still occurs when I try to do: python venues.py > categories.txt, but not when output goes to the terminal: python venues.py — blaze
– blaze, Commented Aug 18, 2013 at 3:27

Mark Tolonen · Accepted Answer · 2013-08-18 19:55:58Z

The trick is, keep all your string processing in the source completely Unicode. Decode to Unicode when reading input (files/pipes/console) and encode when writing output. If category['name'] is Unicode, keep it that way (remove `.encode('utf8').

Also Per your comment:

However, the error still occurs when I try to do: python venues.py > categories.txt, but not when output goes to the terminal: python venues.py

Python can usually determine the terminal encoding and will automatically encode to that encoding, which is why writing to the terminal works. If you use shell redirection to output to a file, you need to tell Python the I/O encoding you want via an environment variable, for example:

set PYTHONIOENCODING=utf8
python venues.py > categories.txt

Working example, using my US Windows console that uses cp437 encoding. The source code is saved in "UTF-8 without BOM". It's worth pointing out that the source code bytes are UTF-8, but declaring the source encoding and using a Unicode string in allows Python to decode the source correctly, and encode the print output automatically to the terminal using its default encoding

#coding:utf8
import sys
print sys.stdout.encoding
print u'üéâäàåçêëèïîì'

Here Python uses the default terminal encoding, but when redirected, does not know what the encoding is, so defaults to ascii:

C:\>python example.py
cp437
üéâäàåçêëèïîì

C:\>python example.py >out.txt
Traceback (most recent call last):
  File "example.py", line 4, in <module>
    print u'├╝├⌐├ó├ñ├á├Ñ├º├¬├½├¿├»├«├¼'
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-12: ordinal not in range(128)

C:\>type out.txt
None

Since we're using shell redirection, use a shell variable to tell Python what encoding to use:

C:\>set PYTHONIOENCODING=cp437

C:\>python example.py >out.txt

C:\>type out.txt
cp437
üéâäàåçêëèïîì

We can also force Python to use another encoding, but in this case the terminal doesn't know how to display UTF-8. The terminal is still decoding the bytes in the file using cp437:

C:\>set PYTHONIOENCODING=utf8

C:\>python example.py >out.txt

C:\>type out.txt
utf8
├╝├⌐├ó├ñ├á├Ñ├º├¬├½├¿├»├«├¼

Armin Rigo · Accepted Answer · 2013-08-18 08:41:01Z

I'm not sure, but I think the culprit is the "u" character at the start of u"%s: %s". This is assuming that what you want to print is a byte string and not a unicode string --- which would be reasonable(*): you output bytes, suitably encoded. Modified like this:

print '%s: %s' % (category['name'].encode('utf-8'), category['id'])

this would turn the unicode string category['name'] to a UTF-8 byte string, and then the rest of the processing is done with byte strings.

(*) It is reasonable in one point of view; another point of view is to print unicode strings and let the environment decide how it should be encoded, but then you're at the mercy of several factors that you don't really control. That's why you see differences between the output going to the terminal or to a file. To avoid all these issues, just print byte strings.

Collectives™ on Stack Overflow

Python 2.7 decode error using UTF-8 header: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related