python2 to python3 migration issue with unicode and bytes

Question

I updated a Python2 package to support Python3, and am stuck on handling a single test case that fails under Python3 due to some encoding issues. The package generally deals with URL standardization and does some custom transformations before or after offloading to a few libraries on PyPi.

In Python2 I might have two strings which are both encodings of the same URL as such:

url_a = u'http://➡.ws/♥'
url_b =  'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

for which the following are true:

url_a.encode('utf-8') == url_b
>>> True
type(url_a.encode('utf-8')) == str
>>> True

After a bunch of miscellaneous routes, they are both standardized to a punycode

url_result = 'http://xn--hgi.ws/%E2%99%A5'

Under Python3 I am hitting a wall because url_a.encode('utf-8') returns a bytestring, which is the required declaration when defining the variable in this format too.

url_a.encode('utf-8')
>>> b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
url_a.encode('utf-8') == url_b
>>> False
type(url_a.encode('utf-8')) == str
>>> True
type(url_a.encode('utf-8')) == bytes
>>> True

I can not figure out a way to perform operations on url_b to have it encoded/decoded as I require it to be.

Which I could just define my test case with a bytestring declaration and everything will pass in both environments...

url_a = u'http://➡.ws/♥'
url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

there is still a possibility of something breaking in production because of data in messaging queues or databases that has not been processed yet.

essentially, in Python3, I need to detect that a short string such as

url_b = 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

should have been declared as a bytestring

url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

and convert it properly, because it is interpreted as

url_b
>>> 'http://â\x9e¡.ws/â\x99¥'

edit: The closest I've come is url_b.decode('unicode-escape') which generates b'http://\\xe2\\x9e\\xa1.ws/\\xe2\\x99\\xa5'

jez · Accepted Answer · 2019-05-03 21:59:47Z

2

You want .encode(), not .decode(), and 'raw_unicode_escape':

#!/usr/bin/env python
# -*- coding: utf-8 -*-

url_a = u'http://➡.ws/♥'
url_b =  'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

encoded_a = url_a.encode('utf-8')
try:
    # Python 3
    encoded_b = url_b.encode('raw_unicode_escape')
except UnicodeDecodeError:
    # Python 2
    encoded_b = url_b

print(repr(encoded_a))
print(repr(encoded_b))

# Output is as follows (without the leading 'b' in Python 2):
#   b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
#   b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

edited May 3, 2019 at 21:59

answered May 3, 2019 at 21:53

jez

15.5k6 gold badges43 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jonathan Vanasco Over a year ago

raw_unicode_escape !!!!!!!!!!!! Thank you! That is exactly what I needed.

Jonathan Vanasco Over a year ago

thanks. i've nearly got this solved, i can;t believe i never saw that codec before. now I just need to ensure i only operate on url_b and not url_a. so far I'm checking if input.encode('raw_unicode_escape').decode() is ascii chars; if so i use the original input. if not i use the encoded/decoded value.

Olvin Roght · Accepted Answer · 2019-05-03 21:53:53Z

0

Code:

url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
print(url_b.decode("utf-8"))

Output:

http://➡.ws/♥

answered May 3, 2019 at 21:53

Olvin Roght

7,8432 gold badges19 silver badges39 bronze badges

Collectives™ on Stack Overflow

python2 to python3 migration issue with unicode and bytes

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related