1

Python 3.7.2

I write the strings from my Python code into my database. My strings contain Latin and Cyrillic characters, so in the database I use 1-byte encoding koi8-r. The miracle is that my strings without distortion are written to the database, although utf8 and koi8r have completely different sequence of characters (for example, as in ascii and utf8). Sometimes characters of other layouts appear in the text and then write errors appear.

Therefore, the question appears:

  1. Who converts strings: the database or the aiomysql library, that I use to write to the database.
  2. How quickly in Python / MariaDB to remove non-koi8-r characters to avoid errors.
  3. Is there a multibyte encoding that stores the Latin and Cyrillic characters in the first byte, and other layouts in other bytes.

Thank you in advance for participating in the conversation.

5
  • Perhaps there are databases that support "economical" multibyte encoding that satisfies item 3? MariaDB stores utf8 characters in 3 bytes and therefore does not satisfy these conditions. Commented Mar 4, 2019 at 8:11
  • 1: It's possible that either does it, the important part is that there are mechanisms in the driver/API that are aware of encodings and handle them appropriately. Commented Mar 4, 2019 at 8:26
  • 1
    If you can choose your encoding at all, why not use UTF-8?! Commented Mar 4, 2019 at 8:26
  • Use utf-8 all the way - if you care about your data and mental health... Commented Mar 4, 2019 at 9:39
  • I don't want to use utf8, because in MariaDB each character weighs 3 bytes. Commented Mar 4, 2019 at 10:33

1 Answer 1

2

Here's the processing when INSERTing:

  1. The Client has the characters encoded with charset-1.
  2. You told MySQL that that was the case when you connected or via SET NAMES.
  3. The column that the characters will be inserted into is declared to be charset-2.
  4. The INSERT converts from charset-1 to charset-2. So, all is well.

Upon SELECTing, the same thing happens, except that the conversion is in the other direction.

What you are doing is OK. But, going forward, everyone 'should' use UTF-8 characters in clients and CHARACTER SET utf8mb4 for columns. You will essentially have to change to such if you ever branch out beyond what your character sets allow, which may be nothing more than Russian and English.

Sign up to request clarification or add additional context in comments.

1 Comment

PS: non-latin Cyrillic characters take 2 bytes in UTF-8. So, using utf8mb4 makes text up to twice as bulky as when using koi8r.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.