4

Can anyone tell me what is going on here?

        byte[] stamp = new byte[]{0,0,0,0,0,1,177,115};
        string serialize = System.Text.Encoding.UTF8.GetString(stamp);
        byte[] deserialize = System.Text.Encoding.UTF8.GetBytes(serialize);

        //deserialize == byte[]{0,0,0,0,0,1,239,191,189,115}

Why is stamp != deserialize??

2
  • 1
    Are you sure they're not still the same string? Encoding isn't needed to preserve raw bytes... Commented Jul 24, 2013 at 14:54
  • They may well be the same string, but I'm working with an SQL timestamp, so I care about the bytes, not the string... Commented Jul 24, 2013 at 15:02

2 Answers 2

5

In your original byte array, you have the 177 character, which is the plusminus sign. However during the serialization, that code isn't being recognized. It's being replaced by 239 191 189 which is the REPLACEMENT CHARACTER.

Here's a chart for reference. http://www.utf8-chartable.de/unicode-utf8-table.pl?start=65280&utf8=dec

I'm not quite sure WHY the plusminus sign isn't recognized, but that's why the byte arrays aren't equal. Other than that swap, they would be equal and the data isn't corrupted in any way.

Sign up to request clarification or add additional context in comments.

1 Comment

Good question, it got me thinking and reading up about it. It's an interesting problem!
4

The array of bytes does not encode a valid text string in UTF-8, so when you "serialize" it the parts that can't be recognized are replaced by a "replacement character." If you must convert byte arrays into strings you should find an encoding that does not have restrictions like this, such as ISO-8859-1.

In particular, the byte 177 cannot appear on its own in valid UTF-8: bytes in range 128 - 191 are "continuation bytes" that can appear only after a byte in range 194-244 has been seen. You can read more about UTF-8 here: https://en.wikipedia.org/wiki/UTF-8

2 Comments

It seems to be a table of unicode characters from U+0080 to U+017F, with how they are encoded in UTF-8 and what they mean. For example, U+00F8 is called LATIN SMALL LETTER O WITH STROKE, it's encoded as (195, 184) in UTF-8, and this is what it looks like: ø

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.