Decoding botched escaped unicode from SQL Server in Python?

Question

I have a CSV file, appears to be UTF-16, dumped from SQL Server. This file contains properly encoded accents (spanish) but some of the rows are encoded differently. Like this:

0xd83d0xde1b0xd83d0xde1b0xd83d0xde1b

This seems to be a strange encoding for

\ud83d\ude1b\ud83d\ude1b\ud83d\ude1b

\ud83d\ude1b are surrogate pairs for an emoji

I need to convert everything to a nice, neat UTF-8 file. I tried endless combinations of bytearray(), encode(), decode(), and so on.

How can I convert this file of mixed UTF-16 and escaped UTF-16 into proper Python 3 strings, and finally save them to a new UTF-8 file?

snakecharmerb · Accepted Answer · 2018-12-20 08:25:23Z

1

You can convert the hex data like this:

>>> import binascii
>>> s = '0xd83d0xde1b0xd83d0xde1b0xd83d0xde1b'


>>> # Remove the leading '0x'
>>> hs = s.replace('0x', '')

>>> # Convert from hex to bytes
>>> bs = binascii.unhexlify(hs)
>>> bs
b'\xd8=\xde\x1b\xd8=\xde\x1b\xd8=\xde\x1b'

# Decode to str
>>> bs.decode('utf-16be')
'😛😛😛'

answered Dec 20, 2018 at 8:25

snakecharmerb

57.1k13 gold badges136 silver badges200 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Decoding botched escaped unicode from SQL Server in Python?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related