12

I have an 16bit big endian unicode string represented as u'\u4132',

how can I split it into integers 41 and 32 in python ?

1
  • 1
    The whole question needs explanation. There is no such thing as a "n-bit bigendian unicode string". What you have is a unicode object of length 1. Secondly, the representation is HEXAdecimal. What do you want done with u'\uabcd'? Thirdly, WHY do you want it split into bytes? Commented Nov 21, 2010 at 21:40

6 Answers 6

19

Here are a variety of different ways you may want it.

Python 2:

>>> chars = u'\u4132'.encode('utf-16be')
>>> chars
'A2'
>>> ord(chars[0])
65
>>> '%x' % ord(chars[0])
'41'
>>> hex(ord(chars[0]))
'0x41'
>>> ['%x' % ord(c) for c in chars]
['41', '32']
>>> [hex(ord(c)) for c in chars]
['0x41', '0x32']

Python 3:

>>> chars = '\u4132'.encode('utf-16be')
>>> chars
b'A2'
>>> chars = bytes('\u4132', 'utf-16be')
>>> chars  # Just the same.
b'A2'
>>> chars[0]
65
>>> '%x' % chars[0]
'41'
>>> hex(chars[0])
'0x41'
>>> ['%x' % c for c in chars]
['41', '32']
>>> [hex(c) for c in chars]
['0x41', '0x32']
Sign up to request clarification or add additional context in comments.

2 Comments

OP seems to want to split the bytes of the code point, instead of the bytes of the encoded code point: '\u4132' -> 41, 32. If that is the case, then your solution would break for code points that require a surrogate pair to be encoded like '\N{PILE OF POO}' '\U0001f4a9': list(map(hex, '\N{PILE OF POO}'.encode('utf-16be'))) = ['0xd8', '0x3d', '0xdc', '0xa9'] instead of: list(map(hex, ord('\N{PILE OF POO}').to_bytes(4, 'big'))) = ['0x0', '0x1', '0xf4', '0xa9']
@DaniloSouzaMorães Whatever the OP wants is weird. You may well be correct; if it were UTF-16LE in question I’d give you a 5% chance at best, but given it’s UTF-16BE I’ll call it about even.
4
  • Java: "\u4132".getBytes("UTF-16BE")
  • Python 2: u'\u4132'.encode('utf-16be')
  • Python 3: '\u4132'.encode('utf-16be')

These methods return a byte array, which you can convert to an int array easily. But note that code points above U+FFFF will be encoded using two code units (so with UTF-16BE this means 32 bits or 4 bytes).

2 Comments

why did you use the 'u' before the string in python 3?
Thanks for the note, I fixed it in the answer.
2

"Those" aren't integers, it's a hexadecimal number which represents the code point.

If you want to get an integer representation of the code point you need to use ord(u'\u4132') if you now want to convert that back to the unicode character use unicode() which will return a unicode string.

2 Comments

yep i know they are not integers, but at the end of the operation i need to get 41 and 32 as integers, that's why i mentioned integer. as a correction "how can i convert u"\u4132" to string or int 4132" might be better :)
@hinoglu: if you indeed want to get 41 and 32 as integers, your application is likely to be broken. What sensible behavior should it have if instead of u'\U4132' you have say u'\U412F' ? Which is as valid as the other. As much there may be reasons to want to get the string '4132' from above (or '412F') I can't imagine any sensible reason for wanting to get the integer 4132. Except may be for converting it back to string '4132' ? Will you please explain your use case ?
2
>>> c = u'\u4132'
>>> '%x' % ord(c)
'4132'

Comments

1

Dirty hack: repr(u'\u4132') will return "u'\\u4132'"

1 Comment

Eek. Please look at the other answers (what a coincidence - I wrote one ;-) ) - they achieve the same thing without that ghastly hack!
0

Pass the unicode character to ord() to get its code point and then break that code point into individual bytes with int.to_bytes() and then format the output however you want:

list(map(lambda b: hex(b)[2:], ord('\u4132').to_bytes(4, 'big')))

returns: ['0', '0', '41', '32']

list(map(lambda b: hex(b)[2:], ord('\N{PILE OF POO}').to_bytes(4, 'big')))

returns: ['0', '1', 'f4', 'a9']

As I have mentioned on another comment, encoding the code point to utf16 will not work as expected for code points outside the BMP (Basic Multilingual Plane) since UTF16 will need a surrogate pair to encode those code points.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.