0

I have a CSV file, appears to be UTF-16, dumped from SQL Server. This file contains properly encoded accents (spanish) but some of the rows are encoded differently. Like this:

0xd83d0xde1b0xd83d0xde1b0xd83d0xde1b

This seems to be a strange encoding for

\ud83d\ude1b\ud83d\ude1b\ud83d\ude1b

\ud83d\ude1b are surrogate pairs for an emoji

I need to convert everything to a nice, neat UTF-8 file. I tried endless combinations of bytearray(), encode(), decode(), and so on.

How can I convert this file of mixed UTF-16 and escaped UTF-16 into proper Python 3 strings, and finally save them to a new UTF-8 file?

1 Answer 1

1

You can convert the hex data like this:

>>> import binascii
>>> s = '0xd83d0xde1b0xd83d0xde1b0xd83d0xde1b'


>>> # Remove the leading '0x'
>>> hs = s.replace('0x', '')

>>> # Convert from hex to bytes
>>> bs = binascii.unhexlify(hs)
>>> bs
b'\xd8=\xde\x1b\xd8=\xde\x1b\xd8=\xde\x1b'

# Decode to str
>>> bs.decode('utf-16be')
'😛😛😛'
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.