0

I know this works:

a = u"\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print(a) # 方法,删除存储在

But if I have a string from a JSON file which does not start with "u"(a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"), I know how to make it in Python 2 (print unicode(a, encoding='unicode_escape') # Prints 方法,删除存储在). But how to do it with Python 3?

Similarly, if it's a byte string loaded from a file, how to convert it?

print("好的".encode("utf-8"))  # b'\xe5\xa5\xbd\xe7\x9a\x84'
# how to convert this?
b = '\xe5\xa5\xbd\xe7\x9a\x84'  # 好的
10
  • 1
    Python 3 uses unicode as default, therefore just print(a) (your console should support unicode). To convert byte string to unicode in Python 3, use str(b, 'utf-8'). To test your code, use IDLE (Python shell) which supports unicode. Commented Aug 12, 2016 at 2:13
  • 1
    @Lex: Are you saying the file itself contains the literal text \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728? Commented Aug 12, 2016 at 2:23
  • @ShadowRanger Thanks for pointing that out, I removed my comment after you corrected my answer. Again, unaware how vast the change is between python2 vs python 3 Commented Aug 12, 2016 at 2:23
  • @acw1668 print(str("\xe5\xa5\xbd\xe7\x9a\x84","utf-8")) raise a error :"TypeError: decoding str is not supported", Commented Aug 12, 2016 at 2:29
  • @ShadowRanger yes, it's a json unicode text, I made it use print(json.loads('"{}"'.format(b))), but it looks weird, if I have a very long json string and the json format is not quite right ,this method may be not work Commented Aug 12, 2016 at 2:34

1 Answer 1

3

If I understand correctly, the file contains the literal text \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728 (so it's plain ASCII, but with backslashes and all that describe the Unicode ordinals the same way you would in a Python str literal). If so, there are two ways to handle this:

  1. Read the file in binary mode, then call mystr = mybytes.decode('unicode-escape') to convert from the bytes to str interpreting the escapes
  2. Read the file in text mode, and use the codecs module for the "text -> text" conversion (bytes to bytes and text to text codecs are now supported only by the codecs module functions; bytes.decode is purely for bytes to text and str.encode is purely for text to bytes, because usually, in Py2, str.encode and unicode.decode was a mistake, and removing the dangerous methods makes it easier to understand what direction the conversions are supposed to go), e.g. decodedstr = codecs.decode(encodedstr, 'unicode-escape')
Sign up to request clarification or add additional context in comments.

2 Comments

Not the OP but I tried reading from a file in binary mode that had one line \xe5\xa5\xbd\xe7\x9a\x84. This gave me b'\\xe5\\xa5\\xbd\\xe7\\x9a\\x84' and printing that with .decode('unicode-escape') gives 好ç ... and not '好的' as expected by OP
the string \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728 in file is loaded from a http request, it's a json unicode string, I tried the code print(codecs.decode("'\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728'", 'unicode-escape')), it prints 'æ¹æ³ï¼å é¤å­å¨å¨', not '好的'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.