String, byte[] and compression

Question

We can disassemble String to and from byte[] easily

        String s = "my string";
        byte[] b = s.getBytes();
        System.out.println(new String(b)); // my string

When compression is involved however there seem to be some issues. Suppose you have 2 methods, compress and uncompress (code below works fine)

public static byte[] compress(String data) 
             throws UnsupportedEncodingException, IOException {
    byte[] input = data.getBytes("UTF-8");
    Deflater df = new Deflater();
    df.setLevel(Deflater.BEST_COMPRESSION);
    df.setInput(input);

    ByteArrayOutputStream baos = new ByteArrayOutputStream(input.length);
    df.finish();
    byte[] buff = new byte[1024];
    while (!df.finished()) {
        int count = df.deflate(buff);
        baos.write(buff, 0, count);
    }
    baos.close();
    byte[] output = baos.toByteArray();

    return output;
}

public static String uncompress(byte[] input) 
            throws UnsupportedEncodingException, IOException,
        DataFormatException {
    Inflater ifl = new Inflater();
    ifl.setInput(input);

    ByteArrayOutputStream baos = new ByteArrayOutputStream(input.length);
    byte[] buff = new byte[1024];
    while (!ifl.finished()) {
        int count = ifl.inflate(buff);
        baos.write(buff, 0, count);
    }
    baos.close();
    byte[] output = baos.toByteArray();

    return new String(output);
}

My Tests work as follows (works fine)

String text = "some text";
byte[] bytes = Compressor.compress(text);
assertEquals(Compressor.uncompress(bytes), text); // works

For no reason other then, why not, i'd like to modify the first method to return a String instead of the byte[].

So i return new String(output) from the compress method and modify my tests to:

String text = "some text";
String compressedText = Compressor.compress(text);
assertEquals(Compressor.uncompress(compressedText.getBytes), text); //fails

This test fails with java.util.zip.DataFormatException: incorrect header check

Why is that? What needs to be done to make it work?

I would use a DeflatorOutputStream and an InflatorInputStream — Peter Lawrey
– Peter Lawrey, Commented Aug 1, 2012 at 15:53
To convert between strings and bytes, you should really, really specify a specific encoding. — Louis Wasserman
– Louis Wasserman, Commented Aug 1, 2012 at 17:28

Tomasz Nurkiewicz · Accepted Answer · 2012-08-01 16:01:21Z

4

The String(byte[]) constructor is the problem. You cannot simply take arbitrary bytes, convert them to a string and then back to byte array. String class performs sophisticated encoding on this byte based on desired charset. If given byte sequence can't be represented e.g. in Unicode it will be discarded or converted to something else. The conversion from bytes to String and back to bytes is lossless only if these bytes really represented some String (in some encoding).

Here is a simplest example:

new String(new byte[]{-128}, "UTF-8").getBytes("UTF-8")

The above returns -17, -65, -67 while 127 input returns the exact same output.

edited Aug 1, 2012 at 16:01

answered Aug 1, 2012 at 15:55

Tomasz Nurkiewicz

342k72 gold badges713 silver badges680 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

James Raitsev Over a year ago

Can you provide an example of lossless conversion please? Also, in my example, bytes really do represent some String.

Tomasz Nurkiewicz Over a year ago

@Jam: what do you mean? I added an example where this conversion breaks. Lossless conversion would be base64 which can encode any byte array into a portable ASCII string.

Tomasz Nurkiewicz Over a year ago

@Jam in your example code compress() method returns compressed data as a String. Data after compression is clearly not a valid text.

Tomasz Nurkiewicz Over a year ago

@Alex: I've no idea where are these bytes coming from. But surely they represent some valid UTF-8 character.

Alex Over a year ago

Integer.toHexString(new String(new byte[]{-128}, "UTF-8").codePointAt(0)) = "fffd". Unicode replacement character = U+FFFD. So that's where the bytes are coming from :)

Arne · Accepted Answer · 2012-08-01 16:16:13Z

It fails, because you just convert from bytes to string using the current encoding of your platform. So most bytes will be converted to their equivalent character codes but some might be replaced by other codes, depending on the current encoding. To see what happens to your bytes, just run:

byte[] b = new byte[256];
for(int i = 0; i < b.length; ++i) {
    b[i] = (byte)i;
}
String s = new String(b);

for(int i = 0; i< s.length(); ++i) {
    System.out.println(i + ": " + s.substring(i, i+1) + " " + (int)s.charAt(i));
}

As you can see, if you convert that back to bytes some codes fall all to the same value. And this sample does not handle encodings where a character is encoded with more than one code as in UTF-8.

In general one should avoid calling String.getBytes() and new String(byte[]) without supplying an appropriate encoding parameter. And there is no one-to-one encoding where each byte becomes the corresponding character code unless you code your own.

If you really want to handle your compressed data as String, then use a base64 representation or a hex dump. But beware, the string representation needs twice as much memory, base64 adds a factor of 4/3, hex even a factor of 2. This might eat up the benefit of compression.

Collectives™ on Stack Overflow

String, byte[] and compression

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related