Converting string.decode('utf8') from python2 to python3

Question

I am converting some code from python2 to python3.

In python2, I can do the following things:

>>> c = '\xe5\xb8\x90\xe6\x88\xb7'
>>> print c
帐户
>>> c.decode('utf8')
u'\u5e10\u6237'

How can I get that same output (u'\u5e10\u6237') in python3?

Edit

For anyone else with this problem, I realized after looking at the the responses that to make use of the result each character needs to be treated as an individual element. An escaped unicode representation like '\u5e10\u6237' is a string so it would not naturally divide into parts that correspond to the original chinese characters.

>>> c = '帐户'
>>> type(c.encode('unicode-escape').decode('ascii'))
<class 'str'>
>>> [l for l in c.encode('unicode-escape').decode('ascii')]
['\\', 'u', '5', 'e', '1', '0', '\\', 'u', '6', '2', '3', '7']

You have to separate each character in the input string and translate it separately into an array unless you want to parse it again in the next part of your program. My solution was thus:

>>> [l.encode('unicode-escape').decode('ascii') for l in c]
['\\u5e10', '\\u6237']

An alternate solution make each character into a hex representation:

>>> [hex(ord(l)) for l in c]
['0x5e10', '0x6237']

Thanks for the help.

Dean Fenster · Accepted Answer · 2016-07-12 15:42:01Z

4

This is called "unicode-escape" encoding. Here is an example of how one would achieve this behavior in python3:

In [11]: c = b'\xe5\xb8\x90\xe6\x88\xb7'

In [12]: d = c.decode('utf8')

In [13]: print(d)
帐户

In [14]: print(d.encode('unicode-escape').decode('ascii'))
\u5e10\u6237

If you want it as bytes and not str, you can simply get rid of the .decode('ascii').

edited Jul 12, 2016 at 15:42

answered Jul 12, 2016 at 15:36

Dean Fenster

2,3951 gold badge19 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

HolyDanna · Accepted Answer · 2016-07-13 09:06:02Z

1

Returning the same unicode as in python2 is not possible : I have not seen unicode object like there was in python2, in python3. But it is possible to get the value of the unicode object.

To do this, you need to do several things :
- Create a byte element with value '\xe5\xb8\x90\xe6\x88\xb7' - Transform this byte element into a string - Gets the unicode code from the string

The first step is quite easy. To create a byte element 'c' with the same value as your c, just do :

c = b'\xe5\xb8\x90\xe6\x88\xb7'

Then, to read the element

c_string = c.decode() # default encoding is utf-8

Finally, I created a function to transform a string into its character + unicode representation

def get_unicode_code(text):
    result = ""
    for char in text:
        ord_value = ord(char)
        if ord_value < 128:
            result += char
        else:
            hex_string = format(ord_value, "x") # turning the int into its hex value
            if len(hex_string) == 2:
                unicode_code = "\\x"+hex_string
            elif len(hex_string) == 3:
                unicode_code = "\\u0"+hex_string
            else:
                unicode_code = "\\u"+hex_string
            result += unicode_code
    return result

get_unicode_code(d) will return the same as d.encode('unicode-escape').decode('ascii'), though it is most likely less efficient.

It takes a string as an argument and returns a string with the unicode instead of the character it represents.

edited Jul 13, 2016 at 9:06

answered Jul 12, 2016 at 15:32

HolyDanna

6294 silver badges13 bronze badges

2 Comments

Jonathan Hartley Over a year ago

Personally I'd write that function as

def get_unicode_code(text):     result = ''.join(         char if ord(char) < 128 else '\\u'+format(ord(char), 'x')         for char in text     )

HolyDanna Over a year ago

@JonathanHartley Thank you for correcting my code, and making it more pythonic. This function returns the same as Dean's last line d.encode('unicode-escape').decode('ascii') I corrected the parentheses error, and added some more code to make the function have the desired result. The format is here to transform the int into its hex value, which is then used to manually generate the unicode

Collectives™ on Stack Overflow

Converting string.decode('utf8') from python2 to python3

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related