I'm using ftfy to fix broken UTF-8 encoding that shows as CP1252 and convert it to UTF-8 cyrillic, but I've found that some letters can't be fixed.
I have a string Ð'010СС199 that I convert to bytes and define pairs b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199" where:
\xc3\x90' -> \xd0\x92 -> Cyrillic В
\xc3\x90\xc2\xa1\ -> \xd0\xa1 -> cyrillic С
As you can see Ð' length is 2. ord won't work in this case.
For using slice I must know where is start and end.
Translate also doesn't work here.
Previously I've used simple string replacement, but now I'd like to improve my method and exclude mistakes.
Original Ð'010СС199 -> conversion -> outputВ010СС199
EDIT:
str = "Ð'010СС199"
str_to_bytes = str.encode("UTF-8")
print(str_to_bytes)
# UTF-8 bytes
# \xc3\x90\xc2\xa0 : \xd0\xa0 -> cyrillic Р
# \xc3\x90\xc2\xa1 : \xd0\xa1 -> cyrillic С
# \xc3\x90\xe2\x80\x94' : \xd0\x97 -> cyrillic З
# \xc3\x90' : \xd0\x92 -> Cyrillic В
test_str = b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199"
t1 = test_str.replace(b'\xc3\x90\xc2\xa1', b'\xd0\xa1')
print(t1)
dict_cyr = {"Ð'": "P",
"С":"C"}
t2 = test_str.translate(test_str)
print(t2)
I can explain how I received results. 1. I used 2cyr.com decoder. But even it failed in some cases. 2. I have a manually translated strings, so I compared them and define what byte corresponds to cyrillic letter with help of UTF-8 chartable.
list[str_to_bytes]and use decimal values. Because\xc3\x90looks like a control character.strobject and atest_strvalue.