2

I have some extremely large array of integers which i would like to compress.
However the way to do it in java is to use something like this -

int[] myIntArray;
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(1024);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(new DeflaterOutputStream(byteArrayOutputStream));
objectOutputStream.writeObject(myIntArray);

Note that the int array first needs to be converted to bytes by java. Now I know that is fast but it still needs to create a whole new byte array and scan through the entire original int array converting it to bytes and copying the value to the new byte array.

Is there any way to skip the byte conversion and make it compress the integers right away?

4
  • It doesn't make sense. The compression algorithm needs bytes, not ints. Commented Jul 3, 2009 at 22:52
  • Where is your int array being converted to bytes? ObjectOutputStream takes your object and directly serializes it. DeflaterOutputStream compresses the serialized result, then the compressed result is stored in a ByteArrayOutputStream. I think that's exactly what you want to happen... Commented Jul 3, 2009 at 23:20
  • in my case the object i want to compress is an int[] array. the serialization process converts it to bytes which is the step i want to skip. Commented Jul 4, 2009 at 1:11
  • why? You say you want to skip a step you don't need to know is there. Each stage there is a copying of data, and the same when you uncompress it. If performance is your issue, how much time are you looking to save? Do you realise that creating the objects is far more expensive for even modest amount of data. Commented Jul 4, 2009 at 21:35

6 Answers 6

4

Skip the ObjectOutputStream and just store the ints directly as four bytes each. DataOutputStream.writeInt for instance is an easy way to do it.

Sign up to request clarification or add additional context in comments.

Comments

2

Hmm. A general-purpose compression algorithm won't necessarily do a good job compressing an array of binary values, unless there's a lot of redundancy. You might do better to develop something of your own, based on what you know about the data.

What is it that you're actually trying to compress?

Comments

2

You could use the representation used by Protocol Buffers. Each integer is represented by 1-5 bytes, depending on its magnitude.

Additionally, the new "packed" representation means you get basically a bit of "header" to say how big it is (and which field it's in) and then just the data. That's probably what ObjectOutputStream does as well, but it's a recent innovation in PB :)

Note that this will compress based on magnitude, not based on how often the integer has seen. That will dramatically affect whether it's useful for you or not.

Comments

0

A byte array is not going to save you much memory unless you make it a byte array holding unsigned ints, which is very dangerous in Java. It will replace memory overhead with extra processing time for the step checking of the code. This may be aright for data storage, but there already is data storage solution out there.
Unless you are doing this for serialization purposes, I think that you are wasting your time.

Comments

0

In your example, you are writing the compressed stream to the ByteArrayOutputStream. Your compressed array needs to exist somewhere, and if the destination is memory, then ByteArrayOutputStream is your likely choice. You could also write the stream to a socket or file. In that case, you wouldn't duplicate the stream in memory. If your array is 800MB and your running in a 1GB, you could easily write the array to a compressed file with the example you included. The change would be replacing the ByteArrayOutputStream with a file stream.

The ObjectOutputStream format is actually fairly efficient. It will not duplicate your array in memory, and has special code for efficiently writing arrays.

Are wanting to work with the compressed array in memory? Would you data lend itself well to a sparse array? Sparse array's are good when you have large gaps in your data.

Comments

0

If the array of ints is guaranteed to have no duplicates, you can use a java.util.BitSet, instead.

As its base implementation is an array of bits, with each bit indicating if a certain integer is present or not in the BitSet, its memory usage is quite low, therefore needing less space to be serialized.

2 Comments

but bitset would help only if the values are 0 or 1, while i have values ranging uptil integer.maxvalue, with no duplicates of course
No, java.util.BitSet can store as many unique integers you need.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.