I updated a Python2 package to support Python3, and am stuck on handling a single test case that fails under Python3 due to some encoding issues. The package generally deals with URL standardization and does some custom transformations before or after offloading to a few libraries on PyPi.
In Python2 I might have two strings which are both encodings of the same URL as such:
url_a = u'http://➡.ws/♥'
url_b = 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
for which the following are true:
url_a.encode('utf-8') == url_b
>>> True
type(url_a.encode('utf-8')) == str
>>> True
After a bunch of miscellaneous routes, they are both standardized to a punycode
url_result = 'http://xn--hgi.ws/%E2%99%A5'
Under Python3 I am hitting a wall because url_a.encode('utf-8') returns a bytestring, which is the required declaration when defining the variable in this format too.
url_a.encode('utf-8')
>>> b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
url_a.encode('utf-8') == url_b
>>> False
type(url_a.encode('utf-8')) == str
>>> True
type(url_a.encode('utf-8')) == bytes
>>> True
I can not figure out a way to perform operations on url_b to have it encoded/decoded as I require it to be.
Which I could just define my test case with a bytestring declaration and everything will pass in both environments...
url_a = u'http://➡.ws/♥'
url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
there is still a possibility of something breaking in production because of data in messaging queues or databases that has not been processed yet.
essentially, in Python3, I need to detect that a short string such as
url_b = 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
should have been declared as a bytestring
url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
and convert it properly, because it is interpreted as
url_b
>>> 'http://â\x9e¡.ws/â\x99¥'
edit: The closest I've come is url_b.decode('unicode-escape') which generates b'http://\\xe2\\x9e\\xa1.ws/\\xe2\\x99\\xa5'