Writing from Python to a database with an encoding different from utf8

Question

Python 3.7.2

I write the strings from my Python code into my database. My strings contain Latin and Cyrillic characters, so in the database I use 1-byte encoding koi8-r. The miracle is that my strings without distortion are written to the database, although utf8 and koi8r have completely different sequence of characters (for example, as in ascii and utf8). Sometimes characters of other layouts appear in the text and then write errors appear.

Therefore, the question appears:

Who converts strings: the database or the aiomysql library, that I use to write to the database.
How quickly in Python / MariaDB to remove non-koi8-r characters to avoid errors.
Is there a multibyte encoding that stores the Latin and Cyrillic characters in the first byte, and other layouts in other bytes.

Thank you in advance for participating in the conversation.

Perhaps there are databases that support "economical" multibyte encoding that satisfies item 3? MariaDB stores utf8 characters in 3 bytes and therefore does not satisfy these conditions. — MihailDr
– MihailDr, Commented Mar 4, 2019 at 8:11
1: It's possible that either does it, the important part is that there are mechanisms in the driver/API that are aware of encodings and handle them appropriately. — deceze
– deceze ♦, Commented Mar 4, 2019 at 8:26
If you can choose your encoding at all, why not use UTF-8?! — deceze
– deceze ♦, Commented Mar 4, 2019 at 8:26
Use utf-8 all the way - if you care about your data and mental health... — bruno desthuilliers
– bruno desthuilliers, Commented Mar 4, 2019 at 9:39
I don't want to use utf8, because in MariaDB each character weighs 3 bytes. — MihailDr
– MihailDr, Commented Mar 4, 2019 at 10:33

Rick James · Accepted Answer · 2019-03-06 22:18:58Z

2

Here's the processing when INSERTing:

The Client has the characters encoded with charset-1.
You told MySQL that that was the case when you connected or via SET NAMES.
The column that the characters will be inserted into is declared to be charset-2.
The INSERT converts from charset-1 to charset-2. So, all is well.

Upon SELECTing, the same thing happens, except that the conversion is in the other direction.

What you are doing is OK. But, going forward, everyone 'should' use UTF-8 characters in clients and CHARACTER SET utf8mb4 for columns. You will essentially have to change to such if you ever branch out beyond what your character sets allow, which may be nothing more than Russian and English.

answered Mar 6, 2019 at 22:18

Rick James

144k15 gold badges144 silver badges255 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Rick James Over a year ago

PS: non-latin Cyrillic characters take 2 bytes in UTF-8. So, using utf8mb4 makes text up to twice as bulky as when using koi8r.

Collectives™ on Stack Overflow

Writing from Python to a database with an encoding different from utf8

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related