Reading unicode characters from file/sqlite database and using it in Python

Question

I have a list of variables with unicode characters, some of them for chemicals like Ozone gas: like 'O\u2083'. All of them are stored in a sqlite database which is read in a Python code to produce O₃. However, when I read I get 'O\\u2083'. The sqlite database is created using an csv file that contains the string 'O\u2083' among others. I understand that \u2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be \,u,2,0,8,3). Is there any way to recognize unicode characters in this context? Now my first option to solve it is to create a function to recognize set of characters and replace for unicode characters. Is there anything like this already implemented?

jfs · Accepted Answer · 2016-07-01 17:04:39Z

2

SQLite allows you to read/write Unicode text directly. u'O\u2083' is two characters u'O' and u'\u2083' (your question has a typo: 'u\2083' != '\u2083').

I understand that u\2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be u,\,2,0,8,3)

Don't confuse u'u\2083' and u'\u2083': the latter is a single character while the former is 4-character sequence: u'u', u'\x10' ('\20' is interpreted as octal in Python), u'8', u'3'.

If you save a single Unicode character u'\u2083' into a SQLite database; it is stored as a single Unicode character (the internal representation of Unicode inside the database is irrelevant as long as the abstraction holds).

On Python 2, if there is no from __future__ import unicode_literals at the top of the module then 'abc' string literal creates a bytestring instead of a Unicode string -- in that case both 'u\2083' and '\u2083' are sequences of bytes, not text characters (\uxxxx is not recognized as a unicode escape sequence inside bytestrings).

edited Jul 1, 2016 at 17:04

answered Jul 1, 2016 at 12:59

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

awulll Over a year ago

I edited now. Was my fault. Is \u2083! Sorry for that!

Mark Tolonen · Accepted Answer · 2016-06-30 00:39:22Z

1

If you have a byte string (length 7), decode the Unicode escape.

>>> s = 'O\u2083'
>>> len(s)
7
>>> s
'O\\u2083'
>>> print(s)
O\u2083
>>> u = s.decode('unicode-escape')
>>> len(u)
2
>>> u
u'O\u2083'
>>> print(u)
O₃

Caveat: Your console/IDE used to print the character needs to use an encoding that supports the character or you'll get a UnicodeEncodeError when printing. The font must support the symbol as well.

edited Jun 30, 2016 at 0:39

answered Jun 30, 2016 at 0:34

Mark Tolonen

181k26 gold badges182 silver badges278 bronze badges

2 Comments

jfs Over a year ago

@awulll: in most cases, .decode('unicode-escape') indicates a bug in your code (or upstream) -- do not use it -- it fixes surface symptoms while ignoring the core issue. For example, if the input format contains JSON text then the correct solution should use json module to parse it instead of unicode-escape -- there is not enough detail in your question to decipher what is your actual input format. SQLite can and should store a single Unicode character instead of the byte sequence -- fix the process that writes data to the database and/or cvs file.

awulll Over a year ago

@J.F.Sebastian, thanks for the comment. Your answer is useful too. For while, 'unicode-escape' help me cause is just a small point in all my stuff, but in the future I'll have to rewrite some SQLite databases and your advice will be considered! Thank you!

sytech · Accepted Answer · 2016-06-29 18:19:40Z

1

It's important to remember everything is bytes. To pull bytes into something useful to you, you kind of have to know what encoding is used when you pull in data. There are too many ambiguous cases to determine encoding by analyzing the data. When you send data out of your program, it's all back out to bytes again. Depending on whether you're using Python 2.x or 3.x you'll have a very different experience with Unicode and Python.

You can, however attempt encoding and simply do a "replace" on errors. For example the_string.encode("utf-8","replace") will try to encode as utf-8 and will replace problems with a ? -- You could also anticipate problem characters and replace them beforehand, but that gets unmanageable quickly. Take a look at codecs classes for more replacement options.

answered Jun 29, 2016 at 18:19

sytech

42.7k8 gold badges77 silver badges126 bronze badges

1 Comment

awulll Over a year ago

thanks, but the point is in my case I have just unicode, but '\u2083' is 6 characters and not just one, as I need. What I need is to convert one string with 6 characters ('\u2083') to one unicode character (\u2083, the small 3 in Ozone) and do this to any other character. I can do a function with a unicode table and do some replaces when necessary, but if there is another way to manage it would be nice.

Collectives™ on Stack Overflow

Reading unicode characters from file/sqlite database and using it in Python

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related