Replacing byte in bytes array to fix encoding

Question

I'm using ftfy to fix broken UTF-8 encoding that shows as CP1252 and convert it to UTF-8 cyrillic, but I've found that some letters can't be fixed.

I have a string Ð'010Ð¡Ð¡199 that I convert to bytes and define pairs b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199" where:

\xc3\x90' -> \xd0\x92 -> Cyrillic В
\xc3\x90\xc2\xa1\ -> \xd0\xa1 -> cyrillic С

As you can see Ð' length is 2. ord won't work in this case.

For using slice I must know where is start and end.

Translate also doesn't work here.

Previously I've used simple string replacement, but now I'd like to improve my method and exclude mistakes.

Original Ð'010Ð¡Ð¡199 -> conversion -> outputВ010СС199

EDIT:

    str = "Ð'010Ð¡Ð¡199"
    str_to_bytes = str.encode("UTF-8")
    print(str_to_bytes)
    # UTF-8 bytes
    # \xc3\x90\xc2\xa0 : \xd0\xa0 -> cyrillic Р
    # \xc3\x90\xc2\xa1 : \xd0\xa1 -> cyrillic С
    # \xc3\x90\xe2\x80\x94' : \xd0\x97 -> cyrillic З
    # \xc3\x90' : \xd0\x92 -> Cyrillic В
    test_str = b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199"
    t1 = test_str.replace(b'\xc3\x90\xc2\xa1', b'\xd0\xa1')
    print(t1)
    dict_cyr = {"Ð'": "P",
                "Ð¡":"C"}
    t2 = test_str.translate(test_str)
    print(t2)

I can explain how I received results. 1. I used 2cyr.com decoder. But even it failed in some cases. 2. I have a manually translated strings, so I compared them and define what byte corresponds to cyrillic letter with help of UTF-8 chartable.

I can, but it's useless anyway. I just convert string and manually define bytes pairs. — Rostislav Aleev
– Rostislav Aleev, Commented Feb 11, 2019 at 10:51
What I'm thinking about is to use list[str_to_bytes] and use decimal values. Because \xc3\x90 looks like a control character. — Rostislav Aleev
– Rostislav Aleev, Commented Feb 11, 2019 at 11:05
What you have is a UTF-8 - CP1252 Mojibake, and recovering the missing bytes is not going to be straightforward. UTF-8 pairs follow a specific pattern, but not all UTF-8 bytes have CP1252 equivalents. When those are missing, you have to guess what can replace them. — Martijn Pieters
– Martijn Pieters, Commented Feb 11, 2019 at 11:15
You said you were working with ftfy, how are you using that? Do you have the original binary data? You show a str object and a test_str value. — Martijn Pieters
– Martijn Pieters, Commented Feb 11, 2019 at 11:16

Serge Ballesta · Accepted Answer · 2019-02-11 14:14:21Z

2

A common problem in encoding/decoding is encoding a string in utf-8 and later decoding the bytestring as if it were cp1252 (often because of a stupid windows app).

It could be what happens here, because CYRILLIC CAPITAL LETTER VE ('В' or '\u0412') and CYRILLIC CAPITAL LETTER ES (or) respectively translate as:

>>> '\u0412'.encode().decode('cp1252')
'Ð’'
>>> '\u0421'.encode().decode('cp1252')
'Ð¡'

Which is close from your original string, except that my transformation uses a RIGHT SINGLE QUOTATION MARK (’ or U+2019) while your string contains an APOSTROPHE (' or U+0027).

If the string actually contains an APOSTROPHE, it could be caused by an attempt of filtering non latin characters from a cp1252 encoded string. The downside is that it is hard to guess whether the apostrophe is a true one or a filtered right single quotation mark.

If it does contain a single quotation mark, then it can be transformed back as simply as:

>>> 'Ð’010Ð¡Ð¡199'.encode('cp1252').decode()
'В010СС199'

edited Feb 11, 2019 at 14:14

answered Feb 11, 2019 at 13:14

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Rostislav Aleev Over a year ago

I just have no words. Accepted.

Collectives™ on Stack Overflow

Replacing byte in bytes array to fix encoding

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest