Convert byte array -> string -> byte array corrupts data

Question

Can anyone tell me what is going on here?

        byte[] stamp = new byte[]{0,0,0,0,0,1,177,115};
        string serialize = System.Text.Encoding.UTF8.GetString(stamp);
        byte[] deserialize = System.Text.Encoding.UTF8.GetBytes(serialize);

        //deserialize == byte[]{0,0,0,0,0,1,239,191,189,115}

Why is stamp != deserialize??

Are you sure they're not still the same string? Encoding isn't needed to preserve raw bytes... — Adriano Repetti
– Adriano Repetti, Commented Jul 24, 2013 at 14:54
They may well be the same string, but I'm working with an SQL timestamp, so I care about the bytes, not the string... — sǝɯɐſ
– sǝɯɐſ, Commented Jul 24, 2013 at 15:02

Eric Wich · Accepted Answer · 2013-07-24 15:11:38Z

5

In your original byte array, you have the 177 character, which is the plusminus sign. However during the serialization, that code isn't being recognized. It's being replaced by 239 191 189 which is the REPLACEMENT CHARACTER.

Here's a chart for reference. http://www.utf8-chartable.de/unicode-utf8-table.pl?start=65280&utf8=dec

I'm not quite sure WHY the plusminus sign isn't recognized, but that's why the byte arrays aren't equal. Other than that swap, they would be equal and the data isn't corrupted in any way.

edited Jul 24, 2013 at 15:11

answered Jul 24, 2013 at 14:58

Eric Wich

1,54410 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Eric Wich Over a year ago

Good question, it got me thinking and reading up about it. It's an interesting problem!

Joni · Accepted Answer · 2013-07-24 15:09:03Z

4

The array of bytes does not encode a valid text string in UTF-8, so when you "serialize" it the parts that can't be recognized are replaced by a "replacement character." If you must convert byte arrays into strings you should find an encoding that does not have restrictions like this, such as ISO-8859-1.

In particular, the byte 177 cannot appear on its own in valid UTF-8: bytes in range 128 - 191 are "continuation bytes" that can appear only after a byte in range 194-244 has been seen. You can read more about UTF-8 here: https://en.wikipedia.org/wiki/UTF-8

edited Jul 24, 2013 at 15:09

answered Jul 24, 2013 at 14:56

Joni

112k14 gold badges151 silver badges201 bronze badges

2 Comments

Cédric Bignon Over a year ago

What does this table mean utf8-chartable.de/unicode-utf8-table.pl?start=128&utf8=dec ?

Joni Over a year ago

It seems to be a table of unicode characters from U+0080 to U+017F, with how they are encoded in UTF-8 and what they mean. For example, U+00F8 is called LATIN SMALL LETTER O WITH STROKE, it's encoded as (195, 184) in UTF-8, and this is what it looks like: ø

Collectives™ on Stack Overflow

Convert byte array -> string -> byte array corrupts data

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related