2

We can disassemble String to and from byte[] easily

        String s = "my string";
        byte[] b = s.getBytes();
        System.out.println(new String(b)); // my string

When compression is involved however there seem to be some issues. Suppose you have 2 methods, compress and uncompress (code below works fine)

public static byte[] compress(String data) 
             throws UnsupportedEncodingException, IOException {
    byte[] input = data.getBytes("UTF-8");
    Deflater df = new Deflater();
    df.setLevel(Deflater.BEST_COMPRESSION);
    df.setInput(input);

    ByteArrayOutputStream baos = new ByteArrayOutputStream(input.length);
    df.finish();
    byte[] buff = new byte[1024];
    while (!df.finished()) {
        int count = df.deflate(buff);
        baos.write(buff, 0, count);
    }
    baos.close();
    byte[] output = baos.toByteArray();

    return output;
}

public static String uncompress(byte[] input) 
            throws UnsupportedEncodingException, IOException,
        DataFormatException {
    Inflater ifl = new Inflater();
    ifl.setInput(input);

    ByteArrayOutputStream baos = new ByteArrayOutputStream(input.length);
    byte[] buff = new byte[1024];
    while (!ifl.finished()) {
        int count = ifl.inflate(buff);
        baos.write(buff, 0, count);
    }
    baos.close();
    byte[] output = baos.toByteArray();

    return new String(output);
}

My Tests work as follows (works fine)

String text = "some text";
byte[] bytes = Compressor.compress(text);
assertEquals(Compressor.uncompress(bytes), text); // works

For no reason other then, why not, i'd like to modify the first method to return a String instead of the byte[].

So i return new String(output) from the compress method and modify my tests to:

String text = "some text";
String compressedText = Compressor.compress(text);
assertEquals(Compressor.uncompress(compressedText.getBytes), text); //fails

This test fails with java.util.zip.DataFormatException: incorrect header check

Why is that? What needs to be done to make it work?

2
  • I would use a DeflatorOutputStream and an InflatorInputStream Commented Aug 1, 2012 at 15:53
  • To convert between strings and bytes, you should really, really specify a specific encoding. Commented Aug 1, 2012 at 17:28

2 Answers 2

4

The String(byte[]) constructor is the problem. You cannot simply take arbitrary bytes, convert them to a string and then back to byte array. String class performs sophisticated encoding on this byte based on desired charset. If given byte sequence can't be represented e.g. in Unicode it will be discarded or converted to something else. The conversion from bytes to String and back to bytes is lossless only if these bytes really represented some String (in some encoding).

Here is a simplest example:

new String(new byte[]{-128}, "UTF-8").getBytes("UTF-8")

The above returns -17, -65, -67 while 127 input returns the exact same output.

Sign up to request clarification or add additional context in comments.

5 Comments

Can you provide an example of lossless conversion please? Also, in my example, bytes really do represent some String.
@Jam: what do you mean? I added an example where this conversion breaks. Lossless conversion would be base64 which can encode any byte array into a portable ASCII string.
@Jam in your example code compress() method returns compressed data as a String. Data after compression is clearly not a valid text.
@Alex: I've no idea where are these bytes coming from. But surely they represent some valid UTF-8 character.
Integer.toHexString(new String(new byte[]{-128}, "UTF-8").codePointAt(0)) = "fffd". Unicode replacement character = U+FFFD. So that's where the bytes are coming from :)
1

It fails, because you just convert from bytes to string using the current encoding of your platform. So most bytes will be converted to their equivalent character codes but some might be replaced by other codes, depending on the current encoding. To see what happens to your bytes, just run:

byte[] b = new byte[256];
for(int i = 0; i < b.length; ++i) {
    b[i] = (byte)i;
}
String s = new String(b);

for(int i = 0; i< s.length(); ++i) {
    System.out.println(i + ": " + s.substring(i, i+1) + " " + (int)s.charAt(i));
}

As you can see, if you convert that back to bytes some codes fall all to the same value. And this sample does not handle encodings where a character is encoded with more than one code as in UTF-8.

In general one should avoid calling String.getBytes() and new String(byte[]) without supplying an appropriate encoding parameter. And there is no one-to-one encoding where each byte becomes the corresponding character code unless you code your own.

If you really want to handle your compressed data as String, then use a base64 representation or a hex dump. But beware, the string representation needs twice as much memory, base64 adds a factor of 4/3, hex even a factor of 2. This might eat up the benefit of compression.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.