0

I updated a Python2 package to support Python3, and am stuck on handling a single test case that fails under Python3 due to some encoding issues. The package generally deals with URL standardization and does some custom transformations before or after offloading to a few libraries on PyPi.

In Python2 I might have two strings which are both encodings of the same URL as such:

url_a = u'http://➡.ws/♥'
url_b =  'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

for which the following are true:

url_a.encode('utf-8') == url_b
>>> True
type(url_a.encode('utf-8')) == str
>>> True

After a bunch of miscellaneous routes, they are both standardized to a punycode

url_result = 'http://xn--hgi.ws/%E2%99%A5'

Under Python3 I am hitting a wall because url_a.encode('utf-8') returns a bytestring, which is the required declaration when defining the variable in this format too.

url_a.encode('utf-8')
>>> b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
url_a.encode('utf-8') == url_b
>>> False
type(url_a.encode('utf-8')) == str
>>> True
type(url_a.encode('utf-8')) == bytes
>>> True

I can not figure out a way to perform operations on url_b to have it encoded/decoded as I require it to be.

Which I could just define my test case with a bytestring declaration and everything will pass in both environments...

url_a = u'http://➡.ws/♥'
url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

there is still a possibility of something breaking in production because of data in messaging queues or databases that has not been processed yet.

essentially, in Python3, I need to detect that a short string such as

url_b = 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

should have been declared as a bytestring

url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

and convert it properly, because it is interpreted as

url_b
>>> 'http://â\x9e¡.ws/â\x99¥'

edit: The closest I've come is url_b.decode('unicode-escape') which generates b'http://\\xe2\\x9e\\xa1.ws/\\xe2\\x99\\xa5'

2 Answers 2

2

You want .encode(), not .decode(), and 'raw_unicode_escape':

#!/usr/bin/env python
# -*- coding: utf-8 -*-

url_a = u'http://➡.ws/♥'
url_b =  'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

encoded_a = url_a.encode('utf-8')
try:
    # Python 3
    encoded_b = url_b.encode('raw_unicode_escape')
except UnicodeDecodeError:
    # Python 2
    encoded_b = url_b

print(repr(encoded_a))
print(repr(encoded_b))

# Output is as follows (without the leading 'b' in Python 2):
#   b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
#   b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
Sign up to request clarification or add additional context in comments.

2 Comments

raw_unicode_escape !!!!!!!!!!!! Thank you! That is exactly what I needed.
thanks. i've nearly got this solved, i can;t believe i never saw that codec before. now I just need to ensure i only operate on url_b and not url_a. so far I'm checking if input.encode('raw_unicode_escape').decode() is ascii chars; if so i use the original input. if not i use the encoded/decoded value.
0

Code:

url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
print(url_b.decode("utf-8"))

Output:

http://➡.ws/♥

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.