4

Using Python 2.7, I'm grabbing some HTML from a website as strings and immediately decoding it into unicode. Because I need to know later where any decoding errors occurred, I thought it would be best to use errors="replace" to prevent exceptions from non-ASCII characters:

linkname = curlinkname.decode("utf-8", errors="replace")

In most cases, this replaces the problem character with a placeholder. However, when I run the code I am still getting an exception from this line on one particular character (ū):

UnicodeEncodeError: 'charmap' codec can't encode character u'\u016b' in position 1: character maps to <undefined>

What's going on?

3
  • 2
    Maybe the encoding is not utf-8, check it first you can use this lib for the encoding detection github.com/chardet/chardet Commented Jul 1, 2015 at 17:04
  • Can you please share the full traceback? Commented Jul 1, 2015 at 18:11
  • are you reading from text file? Commented Jul 1, 2015 at 19:11

1 Answer 1

1

you need to install the lib first

pip install chardet

then use it

import chardet
code = chardet.detect(curlinkname)
linkname = curlinkname.decode(code['encoding'], errors="replace")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.