SQLAlchemy UnicodeEncodeError on '😕' from SQL Server to Postgres

Question

Some text is being fetched from an nvarchar column on a SQL Server database using SQLAlchemy and then written to a Postgres database with column type character varying. Python throws the following exception when the source data contains the character 😕:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d'
in position 34: surrogates not allowed

The SQLAlchemy column type is set to String. I tried setting the column type to Unicode and separately set the column collation to SQL_Latin1_General_CP1_CI_AS, with no luck. The driver being used is FreeTDS.

Why can't Python encode this string? Does the problem lie with our use of SQLAlchemy, Python or Postgres? This error is making me 😕.

Daniel · Accepted Answer · 2016-05-10 12:05:38Z

1

The codepoint \U1f615 (😕) can be represented by the two surrogates \ud83d and \ude15. Somehow your SQL-Server, which uses internally UTF16 was decoded as UCS2, so that the surrogates are not properly decoded. So the problem is the SQL-Server. If you cannot correctly read the data, you have to manually correct the unicode strings, like so (python3):

def surrogate_to_unicode(sur):
    a, b = sur
    return chr(0x10000 + ((ord(a)-0xd800)<<10) + (ord(b)-0xdc00))

text = '\ud83d\ude15'
text = re.sub('[\ud800-\udbff][\udc00-\udfff]', lambda g:surrogate_to_unicode(g.group(0)), text)

answered May 10, 2016 at 12:05

Daniel

42.9k4 gold badges57 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

snakecharmerb Over a year ago

See also stackoverflow.com/questions/38147259/…

Collectives™ on Stack Overflow

SQLAlchemy UnicodeEncodeError on '😕' from SQL Server to Postgres

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related