1

I'm using ftfy to fix broken UTF-8 encoding that shows as CP1252 and convert it to UTF-8 cyrillic, but I've found that some letters can't be fixed.

I have a string Ð'010СС199 that I convert to bytes and define pairs b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199" where:

\xc3\x90' -> \xd0\x92 -> Cyrillic В
\xc3\x90\xc2\xa1\ -> \xd0\xa1 -> cyrillic С

As you can see Ð' length is 2. ord won't work in this case.

For using slice I must know where is start and end.

Translate also doesn't work here.

Previously I've used simple string replacement, but now I'd like to improve my method and exclude mistakes.

Original Ð'010СС199 -> conversion -> outputВ010СС199

EDIT:

    str = "Ð'010СС199"
    str_to_bytes = str.encode("UTF-8")
    print(str_to_bytes)
    # UTF-8 bytes
    # \xc3\x90\xc2\xa0 : \xd0\xa0 -> cyrillic Р
    # \xc3\x90\xc2\xa1 : \xd0\xa1 -> cyrillic С
    # \xc3\x90\xe2\x80\x94' : \xd0\x97 -> cyrillic З
    # \xc3\x90' : \xd0\x92 -> Cyrillic В
    test_str = b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199"
    t1 = test_str.replace(b'\xc3\x90\xc2\xa1', b'\xd0\xa1')
    print(t1)
    dict_cyr = {"Ð'": "P",
                "С":"C"}
    t2 = test_str.translate(test_str)
    print(t2)

I can explain how I received results. 1. I used 2cyr.com decoder. But even it failed in some cases. 2. I have a manually translated strings, so I compared them and define what byte corresponds to cyrillic letter with help of UTF-8 chartable.

10
  • 1
    Can you share your code? Commented Feb 11, 2019 at 10:28
  • I can, but it's useless anyway. I just convert string and manually define bytes pairs. Commented Feb 11, 2019 at 10:51
  • What I'm thinking about is to use list[str_to_bytes] and use decimal values. Because \xc3\x90 looks like a control character. Commented Feb 11, 2019 at 11:05
  • 1
    What you have is a UTF-8 - CP1252 Mojibake, and recovering the missing bytes is not going to be straightforward. UTF-8 pairs follow a specific pattern, but not all UTF-8 bytes have CP1252 equivalents. When those are missing, you have to guess what can replace them. Commented Feb 11, 2019 at 11:15
  • You said you were working with ftfy, how are you using that? Do you have the original binary data? You show a str object and a test_str value. Commented Feb 11, 2019 at 11:16

1 Answer 1

2

A common problem in encoding/decoding is encoding a string in utf-8 and later decoding the bytestring as if it were cp1252 (often because of a stupid windows app).

It could be what happens here, because CYRILLIC CAPITAL LETTER VE ('В' or '\u0412') and CYRILLIC CAPITAL LETTER ES (or) respectively translate as:

>>> '\u0412'.encode().decode('cp1252')
'Ð’'
>>> '\u0421'.encode().decode('cp1252')
'С'

Which is close from your original string, except that my transformation uses a RIGHT SINGLE QUOTATION MARK ( or U+2019) while your string contains an APOSTROPHE (' or U+0027).

If the string actually contains an APOSTROPHE, it could be caused by an attempt of filtering non latin characters from a cp1252 encoded string. The downside is that it is hard to guess whether the apostrophe is a true one or a filtered right single quotation mark.

If it does contain a single quotation mark, then it can be transformed back as simply as:

>>> 'В010СС199'.encode('cp1252').decode()
'В010СС199'
Sign up to request clarification or add additional context in comments.

1 Comment

I just have no words. Accepted.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.