0

I'm new to Python 3 and it seems that I can't quite completely grasp unicode and character encoding.

I'm working with the output of another tool that returns the content of an html page as a bytes object. Other tools we use need this output to be in bytes type, but, I'd like to convert the bytes output to a string for some parsing and comparison to other strings. For cases that I'm interested in, printing the output bytes object shows only characters and no \x or \u binary. I'm a little confused on how best to do this and why the methods that create the desired output, actually do work.

I've read elsewhere that .decode() should be used in this context and this does work, but I don't understand why I am decoding an object that is already characters. From what I understand, decoding is intended for binary numbers, for example:

>>> b'\x41'.decode('utf-8')
'A'

In my understanding, all I really want to do is tell Python that an object that's been labeled as a bytes type object is actually a str object. Simply using the str() function on the bytes object also accomplishes this goal, but adds the "b" prefix and adds quotations around the string.

Here are the two solutions I'm working with:

>>> str(b'htmltext')
"b'htmltext'"

>>> b'htmltext'.decode('utf-8')
'htmltext'

Essentially, either of these solutions appears to achieve what I'm looking for, but the decode() obviously seems cleaner and, from what I've read, the recommended method. I'm wondering why decode() works, given that, apparently, I'm not converting binary numbers to characters. Furthermore, is there any reason other than the unappealing "b" and quotation marks in the output that str() would not be a valid solution here?

3
  • Once you understand why Python3 separates strings and binary data into two different types, this will be a lot easier to answer. See eli.thegreenplace.net/2012/01/30/… Commented Jan 4, 2017 at 22:27
  • 1
    Everything is binary data. Commented Jan 4, 2017 at 22:34
  • It is so natural to think that every thing in computers has a binary representation, but in Python it's not that way - too bad! In particular, strings are unicode objects with no encoding and an encoding is a map from unicode objects to bytes objects. It's one way to see strings, bytes objects and their relationship, but I don't see what is gained. Commented Apr 18, 2017 at 7:25

1 Answer 1

5

Don't confuse the developer-friendly representation of the bytes object with the data that is contained in it. You have binary data either way.

The developer representation makes it easy for you to see what is contained by showing anything that just happens to be a valid ASCII codepoint as that ASCII character, rather than the \xhh escape code. It's just easier to read text encoded as ASCII that way, and a lot of the world's text happens to be ASCII encoded.

You'll have a harder time when the data is not within the ASCII range however:

>>> 'Åæøéï'.encode('utf8')
b'\xc3\x85\xc3\xa6\xc3\xb8\xc3\xa9\xc3\xaf'

That's a UTF-8 byte sequence encoding text with accents. The above may be a little bit contrived, but most non-English text will include some non-ASCII text. Even English text can contain em-dashes or fancy quotes, and the b'...' bytes version of that is not nearly as readable as the properly decoded text version:

>>> '“Kragerø” is a town in Norway – in the province of Vestfold'.encode('utf8')
b'\xe2\x80\x9cKrager\xc3\xb8\xe2\x80\x9d is a town in Norway \xe2\x80\x93 in the province of Vestfold'

Note that the b'....' output is the result of using the repr() function on a bytes object; that calls the object.__repr__() method, which has the explicit function of producing a developer-friendly string for you. There is no dedicated object.__str__() method on a bytes object, the __repr__ method is called instead, even when you use the str() function. The proper way to convert a bytes value to a string is to decode (using the correct codec for the data).

Of course, when you have binary data that represents something else, like, say, image data, then keep it as bytes. There is no text to decode there.

Sign up to request clarification or add additional context in comments.

1 Comment

This explanation is very helpful. This fills in a gap in my understanding that was haunting me elsewhere as well. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.