39

I'm writing a web crawler in python, and it involves taking headlines from websites.

One of the headlines should've read : And the Hip's coming, too

But instead it said: And the Hip’s coming, too

What's going wrong here?

1
  • 4
    It would be easier to help you if you included the relevant code, and the particular website you're parsing. Commented Oct 28, 2012 at 16:27

2 Answers 2

66

It's an encoding error - so if it's a unicode string, this ought to fix it:

text.encode("windows-1252").decode("utf-8")

If it's a plain string, you'll need an extra step:

text.decode("utf-8").encode("windows-1252").decode("utf-8")

Both of these will give you a unicode string.

By the way - to discover how a piece of text like this has been mangled due to encoding issues, you can use chardet:

>>> import chardet
>>> chardet.detect(u"And the Hip’s coming, too")
{'confidence': 0.5, 'encoding': 'windows-1252'}
Sign up to request clarification or add additional context in comments.

2 Comments

Small warning: chardet is LGPL-licensed, so that's a consideration if it's going in something that's distributed to end users.
A string can't be decoded, so the second codeline you posted must be updated. ( using python3)
15

You need to properly decode the source text. Most likely the source text is in UTF-8 format, not ASCII.

Because you do not provide any context or code for your question it is not possible to give a direct answer.

I suggest you study how unicode and character encoding is done in Python:

http://docs.python.org/2/howto/unicode.html

1 Comment

Yes, it's UTF-8 treated like Windows 1252: u'\N{RIGHT SINGLE QUOTATION MARK}'.encode('utf-8').decode('cp1252').

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.