UnicodeDecodeError: 'ascii' codec can't decode byte in Python

Question

I've got a very peculiar encoding problem. I've looked at plenty of questions about this error with no actual answers. I am aware of Unicode issues in Python, so I start every file with:

#  -*- coding: utf-8 -*-

However, I still get UnicodeDecodeError when I run my software. Moreover, the following code works:

#  -*- coding: utf-8 -*-
g = " "
s = "2 000€"
if g in s:
    print s

The error occurs at:

if gap not in tokenString:

tokenString string contains Unicode. The funny thing is that if I try to print it just before that line it prints without an error.

What could be the cause of that? I feel like I'm missing something and I don't understand what.

EDITED gap is of type unicode and tokenString of type str.

Please include the full traceback. Are you printing Unicode data to the Windows console or a Unix terminal? Then see wiki.python.org/moin/PrintFails. — Martijn Pieters
– Martijn Pieters, Commented Jun 7, 2013 at 23:23

abarnert · Accepted Answer · 2013-06-08 00:49:33Z

3

You haven't given us enough information to solve your problem for sure, but I can make a guess:

If gap is a str, and tokenString is a unicode, this line:

if gap not in tokenString:

… will try to convert gap to unicode to do the search. But if gap has any non-ASCII characters—e.g., because it's a Unicode string encoded into UTF-8—this conversion will fail.

For example:

>>> if 'é' in u'a':
...    print 'Yes'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

You will get the same problem if gap is a unicode and tokenString is a str holding non-ASCII:

>>> if u'a' in 'é':
...     print 'Yes'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

And you'll also get the same problem, or similar ones, with various other mixed-type operator and method calls (e.g., u'a'.find('é')).

The solution is to use the same type on both sides of the in. For example:

>>> if 'é'.decode('utf-8') in u'a':
...     print 'Yes'

No error.

The larger solution is to always use one type or the other everywhere within our code. Of course at the boundaries, you can't do that (e.g., if you're using unicode everywhere, but then you want to write to an 8-bit file), so you need to explicitly call decode and encode at those boundaries. But even then, you can usually wrap that up (e.g., with codecs.open, or with a custom file-writing function, or whatever, so all of your visible code is Unicode, fill stop.

Or, of course, you can use Python 3, which will immediately catch you trying to compare byte strings and Unicode strings and raise a TypeError, instead of trying to decode the bytes from ASCII and either misleadingly working or giving you a more confusing error…

edited Jun 8, 2013 at 0:49

answered Jun 8, 2013 at 0:35

abarnert

368k54 gold badges626 silver badges691 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Aleksandar Savkov Over a year ago

The problem was that gap was loaded using the json module, which created unicode objects, which I didn't know about. I thought I was using the same str type consistently.

Collectives™ on Stack Overflow

UnicodeDecodeError: 'ascii' codec can't decode byte in Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related