Converting Python 3 bytes object to string when bytes object apparently only contains characters

Question

I'm new to Python 3 and it seems that I can't quite completely grasp unicode and character encoding.

I'm working with the output of another tool that returns the content of an html page as a bytes object. Other tools we use need this output to be in bytes type, but, I'd like to convert the bytes output to a string for some parsing and comparison to other strings. For cases that I'm interested in, printing the output bytes object shows only characters and no \x or \u binary. I'm a little confused on how best to do this and why the methods that create the desired output, actually do work.

I've read elsewhere that .decode() should be used in this context and this does work, but I don't understand why I am decoding an object that is already characters. From what I understand, decoding is intended for binary numbers, for example:

>>> b'\x41'.decode('utf-8')
'A'

In my understanding, all I really want to do is tell Python that an object that's been labeled as a bytes type object is actually a str object. Simply using the str() function on the bytes object also accomplishes this goal, but adds the "b" prefix and adds quotations around the string.

Here are the two solutions I'm working with:

>>> str(b'htmltext')
"b'htmltext'"

>>> b'htmltext'.decode('utf-8')
'htmltext'

Essentially, either of these solutions appears to achieve what I'm looking for, but the decode() obviously seems cleaner and, from what I've read, the recommended method. I'm wondering why decode() works, given that, apparently, I'm not converting binary numbers to characters. Furthermore, is there any reason other than the unappealing "b" and quotation marks in the output that str() would not be a valid solution here?

Once you understand why Python3 separates strings and binary data into two different types, this will be a lot easier to answer. See eli.thegreenplace.net/2012/01/30/… — turbulencetoo
– turbulencetoo, Commented Jan 4, 2017 at 22:27
It is so natural to think that every thing in computers has a binary representation, but in Python it's not that way - too bad! In particular, strings are unicode objects with no encoding and an encoding is a map from unicode objects to bytes objects. It's one way to see strings, bytes objects and their relationship, but I don't see what is gained. — user6627712
– user6627712, Commented Apr 18, 2017 at 7:25

Martijn Pieters · Accepted Answer · 2017-01-04 22:34:37Z

Don't confuse the developer-friendly representation of the bytes object with the data that is contained in it. You have binary data either way.

The developer representation makes it easy for you to see what is contained by showing anything that just happens to be a valid ASCII codepoint as that ASCII character, rather than the \xhh escape code. It's just easier to read text encoded as ASCII that way, and a lot of the world's text happens to be ASCII encoded.

You'll have a harder time when the data is not within the ASCII range however:

>>> 'Åæøéï'.encode('utf8')
b'\xc3\x85\xc3\xa6\xc3\xb8\xc3\xa9\xc3\xaf'

That's a UTF-8 byte sequence encoding text with accents. The above may be a little bit contrived, but most non-English text will include some non-ASCII text. Even English text can contain em-dashes or fancy quotes, and the b'...' bytes version of that is not nearly as readable as the properly decoded text version:

>>> '“Kragerø” is a town in Norway – in the province of Vestfold'.encode('utf8')
b'\xe2\x80\x9cKrager\xc3\xb8\xe2\x80\x9d is a town in Norway \xe2\x80\x93 in the province of Vestfold'

Note that the b'....' output is the result of using the repr() function on a bytes object; that calls the object.__repr__() method, which has the explicit function of producing a developer-friendly string for you. There is no dedicated object.__str__() method on a bytes object, the __repr__ method is called instead, even when you use the str() function. The proper way to convert a bytes value to a string is to decode (using the correct codec for the data).

Of course, when you have binary data that represents something else, like, say, image data, then keep it as bytes. There is no text to decode there.

This explanation is very helpful. This fills in a gap in my understanding that was haunting me elsewhere as well. Thanks!

Collectives™ on Stack Overflow

Converting Python 3 bytes object to string when bytes object apparently only contains characters

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related