Decoding UTF-8 strings in Python

Question

I'm writing a web crawler in python, and it involves taking headlines from websites.

One of the headlines should've read : And the Hip's coming, too

But instead it said: And the Hipâ€™s coming, too

What's going wrong here?

It would be easier to help you if you included the relevant code, and the particular website you're parsing. — jbowes
– jbowes, Commented Oct 28, 2012 at 16:27

Zero Piraeus · Accepted Answer · 2012-10-28 16:44:02Z

66

It's an encoding error - so if it's a unicode string, this ought to fix it:

text.encode("windows-1252").decode("utf-8")

If it's a plain string, you'll need an extra step:

text.decode("utf-8").encode("windows-1252").decode("utf-8")

Both of these will give you a unicode string.

By the way - to discover how a piece of text like this has been mangled due to encoding issues, you can use chardet:

>>> import chardet
>>> chardet.detect(u"And the Hipâ€™s coming, too")
{'confidence': 0.5, 'encoding': 'windows-1252'}

answered Oct 28, 2012 at 16:36

Zero Piraeus

59.7k28 gold badges158 silver badges164 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Small warning: chardet is LGPL-licensed, so that's a consideration if it's going in something that's distributed to end users.

A string can't be decoded, so the second codeline you posted must be updated. ( using python3)

Mikko Ohtamaa · Accepted Answer · 2012-10-28 16:26:34Z

15

You need to properly decode the source text. Most likely the source text is in UTF-8 format, not ASCII.

Because you do not provide any context or code for your question it is not possible to give a direct answer.

I suggest you study how unicode and character encoding is done in Python:

answered Oct 28, 2012 at 16:26

Mikko Ohtamaa

85k63 gold badges296 silver badges479 bronze badges

Yes, it's UTF-8 treated like Windows 1252: u'\N{RIGHT SINGLE QUOTATION MARK}'.encode('utf-8').decode('cp1252').