4

I am converting some code from python2 to python3.

In python2, I can do the following things:

>>> c = '\xe5\xb8\x90\xe6\x88\xb7'
>>> print c
帐户
>>> c.decode('utf8')
u'\u5e10\u6237'

How can I get that same output (u'\u5e10\u6237') in python3?


Edit

For anyone else with this problem, I realized after looking at the the responses that to make use of the result each character needs to be treated as an individual element. An escaped unicode representation like '\u5e10\u6237' is a string so it would not naturally divide into parts that correspond to the original chinese characters.

>>> c = '帐户'
>>> type(c.encode('unicode-escape').decode('ascii'))
<class 'str'>
>>> [l for l in c.encode('unicode-escape').decode('ascii')]
['\\', 'u', '5', 'e', '1', '0', '\\', 'u', '6', '2', '3', '7']

You have to separate each character in the input string and translate it separately into an array unless you want to parse it again in the next part of your program. My solution was thus:

>>> [l.encode('unicode-escape').decode('ascii') for l in c]
['\\u5e10', '\\u6237']

An alternate solution make each character into a hex representation:

>>> [hex(ord(l)) for l in c]
['0x5e10', '0x6237']

Thanks for the help.

2 Answers 2

4

This is called "unicode-escape" encoding. Here is an example of how one would achieve this behavior in python3:

In [11]: c = b'\xe5\xb8\x90\xe6\x88\xb7'

In [12]: d = c.decode('utf8')

In [13]: print(d)
帐户

In [14]: print(d.encode('unicode-escape').decode('ascii'))
\u5e10\u6237

If you want it as bytes and not str, you can simply get rid of the .decode('ascii').

Sign up to request clarification or add additional context in comments.

Comments

1

Returning the same unicode as in python2 is not possible : I have not seen unicode object like there was in python2, in python3. But it is possible to get the value of the unicode object.

To do this, you need to do several things :
- Create a byte element with value '\xe5\xb8\x90\xe6\x88\xb7' - Transform this byte element into a string - Gets the unicode code from the string

The first step is quite easy. To create a byte element 'c' with the same value as your c, just do :

c = b'\xe5\xb8\x90\xe6\x88\xb7'

Then, to read the element

c_string = c.decode() # default encoding is utf-8

Finally, I created a function to transform a string into its character + unicode representation

def get_unicode_code(text):
    result = ""
    for char in text:
        ord_value = ord(char)
        if ord_value < 128:
            result += char
        else:
            hex_string = format(ord_value, "x") # turning the int into its hex value
            if len(hex_string) == 2:
                unicode_code = "\\x"+hex_string
            elif len(hex_string) == 3:
                unicode_code = "\\u0"+hex_string
            else:
                unicode_code = "\\u"+hex_string
            result += unicode_code
    return result

get_unicode_code(d) will return the same as d.encode('unicode-escape').decode('ascii'), though it is most likely less efficient.

It takes a string as an argument and returns a string with the unicode instead of the character it represents.

2 Comments

Personally I'd write that function as def get_unicode_code(text): result = ''.join( char if ord(char) < 128 else '\\u'+format(ord(char), 'x') for char in text )
@JonathanHartley Thank you for correcting my code, and making it more pythonic. This function returns the same as Dean's last line d.encode('unicode-escape').decode('ascii') I corrected the parentheses error, and added some more code to make the function have the desired result. The format is here to transform the int into its hex value, which is then used to manually generate the unicode

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.