0

I've got a very peculiar encoding problem. I've looked at plenty of questions about this error with no actual answers. I am aware of Unicode issues in Python, so I start every file with:

#  -*- coding: utf-8 -*-

However, I still get UnicodeDecodeError when I run my software. Moreover, the following code works:

#  -*- coding: utf-8 -*-
g = " "
s = "2 000€"
if g in s:
    print s

The error occurs at:

if gap not in tokenString:

tokenString string contains Unicode. The funny thing is that if I try to print it just before that line it prints without an error.

What could be the cause of that? I feel like I'm missing something and I don't understand what.

EDITED gap is of type unicode and tokenString of type str.

2
  • Please include the full traceback. Are you printing Unicode data to the Windows console or a Unix terminal? Then see wiki.python.org/moin/PrintFails. Commented Jun 7, 2013 at 23:23
  • What type is gap? What type is tokenString? Commented Jun 8, 2013 at 0:33

1 Answer 1

3

You haven't given us enough information to solve your problem for sure, but I can make a guess:

If gap is a str, and tokenString is a unicode, this line:

if gap not in tokenString:

… will try to convert gap to unicode to do the search. But if gap has any non-ASCII characters—e.g., because it's a Unicode string encoded into UTF-8—this conversion will fail.

For example:

>>> if 'é' in u'a':
...    print 'Yes'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

You will get the same problem if gap is a unicode and tokenString is a str holding non-ASCII:

>>> if u'a' in 'é':
...     print 'Yes'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

And you'll also get the same problem, or similar ones, with various other mixed-type operator and method calls (e.g., u'a'.find('é')).


The solution is to use the same type on both sides of the in. For example:

>>> if 'é'.decode('utf-8') in u'a':
...     print 'Yes'

No error.


The larger solution is to always use one type or the other everywhere within our code. Of course at the boundaries, you can't do that (e.g., if you're using unicode everywhere, but then you want to write to an 8-bit file), so you need to explicitly call decode and encode at those boundaries. But even then, you can usually wrap that up (e.g., with codecs.open, or with a custom file-writing function, or whatever, so all of your visible code is Unicode, fill stop.


Or, of course, you can use Python 3, which will immediately catch you trying to compare byte strings and Unicode strings and raise a TypeError, instead of trying to decode the bytes from ASCII and either misleadingly working or giving you a more confusing error…

Sign up to request clarification or add additional context in comments.

1 Comment

The problem was that gap was loaded using the json module, which created unicode objects, which I didn't know about. I thought I was using the same str type consistently.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.