Python Decoding with Errors=Replace

Question

Using Python 2.7, I'm grabbing some HTML from a website as strings and immediately decoding it into unicode. Because I need to know later where any decoding errors occurred, I thought it would be best to use errors="replace" to prevent exceptions from non-ASCII characters:

linkname = curlinkname.decode("utf-8", errors="replace")

In most cases, this replaces the problem character with a placeholder. However, when I run the code I am still getting an exception from this line on one particular character (ū):

UnicodeEncodeError: 'charmap' codec can't encode character u'\u016b' in position 1: character maps to <undefined>

What's going on?

Maybe the encoding is not utf-8, check it first you can use this lib for the encoding detection github.com/chardet/chardet — efirvida
– efirvida, Commented Jul 1, 2015 at 17:04

efirvida · Accepted Answer · 2015-07-01 19:23:59Z

1

you need to install the lib first

pip install chardet

then use it

import chardet
code = chardet.detect(curlinkname)
linkname = curlinkname.decode(code['encoding'], errors="replace")

edited Jul 1, 2015 at 19:23

answered Jul 1, 2015 at 19:16

efirvida

4,8855 gold badges50 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python Decoding with Errors=Replace

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related