14

I have a large sequence of random integers sorted from the lowest to the highest. The numbers start from 1 bit and end near 45 bits. In the beginning of the list I have numbers very close to each other: 4, 20, 23, 40, 66. But when the numbers start to get higher the distance between them is a bit higher too (actually the distance between them is aleatory). There are no duplicated numbers.

I'm using bit packing to save some space. Nonetheless, this file can get really big.

I would like to know what kind of compression algorithm can be used in this situation, or any other technique to save as much space as possible.

Thank you.

8
  • I'm not following. Could you draw a (perhaps ASCII) picture or something of the sequence Commented Sep 30, 2012 at 19:42
  • @Keyser 1, 23, 33, 34,39, 80, 122, 125, 169, 168, 203, and this way on Commented Sep 30, 2012 at 19:46
  • 5
    You can express the list as the distance between adjacent numbers -- 22, 10, 1, 5, 41, 42, 3... (if my arithmetic is right). Since these numbers are shorter you can pack them together more closely. (Note that you should add a virtual zero to the start of the sequence to generate the initial 1, or else otherwise record the start number.) You can probably even play the trick of doing the difference a second time to come up with a signed number -- -12, -9, 4, 36, 1, -39... which may or may not compact better (depending on the number distribution). Commented Sep 30, 2012 at 19:53
  • And note that you can use one of several variable-length notations. Eg, a number beginning 11b is a 3-byte (22 bit) number, a number beginning 10b is a 2-byte (14 bit) number, and a number beginning 0b is a 1-byte (7 bit) number. You can combine this scheme with your bit packing to come up with whatever size numbers best fit your needs. Commented Sep 30, 2012 at 19:58
  • (It would help to know your usage scenario -- some schemes make it much harder to access, say, the 27th entry than others, if you need to do random access without unpacking the entire list.) Commented Sep 30, 2012 at 20:00

6 Answers 6

10

You can compress optimally if you know the true distribution of the data. If you can provide a probability distribution for each integer you can use arithmetic coding or other entropy coding techniques to compress to theoretical minimal size.

The trick is in predicting accurately.

First, you should probably compress the distances between the numbers because that allows you to make statistical statements. If you were to compress the numbers directly you'd have a hard time modelling them because they occur only once.

Next, you could try to build a very simple model to predict the next distance. Keep a histogram of all previously seen distances and calculate the probabilities from the frequencies.

You probably need to account for missing values (you clearly can't assign them 0 probability because that is not expressible) but you can use heuristics for that, like encoding the next distance bit-by-bit and predicting each bit individually. You will pay almost nothing for the high-order bits because they are almost always 0 and entropy encoding optimizes them away.

All of this is much simpler if you know the distribution. Example: You you are compressing a list of all prime numbers you know the theoretical distribution of distances because there are formulae for that. So you already have a perfect model.

Sign up to request clarification or add additional context in comments.

6 Comments

There is a probability. I'm counting from 1 bit to near 43 bits, but each number is stored in a different file, and this is aleatory. That's why each file have aleatory but sorted numbers. There are 256 files, so the probability is 1/256 that the next number go to a file instead another.
So this means that the distance follows an exactly known distribution? If yes, you got the answer in the last paragraph.
You could also have a single file, and for each number store an 8-bit integer to denote in which file it belongs. That gives you optimal compression (because the files are equally distibuted) and 8 bit per number.
They are not equally distributed, sorry. I'm using hash functions to determinate which file each number should go, so there is not a 100% distribution of 1/256
It is almost a perfect distribution. If the distribution of values to files is controlled by a hash function this scheme is the best you can do. For each number you are essentially generating a random 8-bit value.
|
10

There's a very simple and fairly effective compression technique which can be used for sorted integers in a known range. Like most compression schemes, it is optimized for serial access, although you can build an index to speed up random access if needed.

It's a type of delta encoding (i.e. each number is represented by the distance from the previous one), consisting of a vector of codes which are either

  • a single 1-bit, representing a delta of 2k which is added to the delta in the following code, or

  • a 0-bit followed by a k-bit delta, indicating that the next number is the specified delta from the previous one.

For example, if k is 4, the sequence:

00011 1 1 00000 1 00001

codes three numbers. The first four-bit encoding (3) is the first delta, taken from an initial value of 0, so the first number is 3. The next two solitary 1's accumulate to a delta of 2·24, or 32, which is added to the following delta of 0000, for a total of 32. So the second number is 3+32=35. Finally, the last delta is a single 24 plus 1, total 17, and the third number is 35+17=52.

The 1-bit indicates that the next delta should be incremented by 2k (or, more generally, each delta is incremented by 2k times the number of immediately preceding 1-bits.)

Another, possibly better, way of thinking of this is that each delta is coded as a variable length bit sequence: 1i0(1|0)k, representing a delta of i·2k+[the k-bit suffix]. But the first presentation aligns better with the optimality proof.

Since each "1" code represents an increment of 2k, there cannot be more than m/2k of them, where m is the largest number in the set to be compressed. The remaining codes all correspond to numbers, and have a total length of n·(k + 1) where n is the size of the set. The optimal value of k is roughly log2 m/n, which in your case would be 7 or 8.

I did a quick proof of concept of the algorithm, without worrying about optimizations. It's still plenty fast; sorting the random sample takes a lot longer than compressing/decompressing it. I tried it with a few different seeds and vector sizes from 16,400,000 to 31,000,000 with a value range of [0, 4,000,000,000). The bits used per data value ranged from 8.59 (n=31000000) to 9.45 (n=16400000). All of the tests were done with 7-bit suffixes; log2 m/n varies from 7.01 (n=31000000) to 7.93 (n=16400000). I tried with 6-bit and 8-bit suffixes; except in the case of n=31000000 where the 6-bit suffixes were slightly smaller, the 7-bit suffix was always the best. So I guess that the optimal k is not exactly floor(log2 m/n) but it's not far off.

Compression code:

void Compress(std::ostream& os,
              const std::vector<unsigned long>& v,
              unsigned long k = 0) {
  BitOut out(os);
  out.put(v.size(), 64);
  if (v.size()) {
    unsigned long twok;
    if (k == 0) {
      unsigned long ratio = v.back() / v.size();
      for (twok = 1; twok <= ratio / 2; ++k, twok *= 2) { }
    } else {
      twok = 1 << k;
    }
    out.put(k, 32);

    unsigned long prev = 0;
    for (unsigned long val : v) {
      while (val - prev >= twok) { out.put(1); prev += twok; }
      out.put(0);
      out.put(val - prev, k);
      prev = val;
    }
  }
  out.flush(1);
}

Decompression:

std::vector<unsigned long> Decompress(std::istream& is) {
  BitIn in(is);
  unsigned long size = in.get(64);
  if (size) {
    unsigned long k = in.get(32);
    unsigned long twok = 1 << k;

    std::vector<unsigned long> v;
    v.reserve(size);
    unsigned long prev = 0;
    for (; size; --size) {
      while (in.get()) prev += twok;
      prev += in.get(k);
      v.push_back(prev);
    }
  }
  return v;
}

It can be a bit awkward to use variable-length encodings; an alternative is to store the first bit of each code (1 or 0) in a bit vector, and the k-bit suffixes in a separate vector. This would be particularly convenient if k is 8.

A variant, which results in slight longer files but is a bit easier to build indexes for, is to only use the 1-bits as deltas. Then the deltas are always a·2k for some a, possibly 0, where a is the number of consecutive 1 bits preceding the suffix code. The index then consists of the locations of every Nth 1-bit in the bit vector, and the corresponding index into the suffix vector (i.e. the index of the suffix corresponding with the next 0 in the bit vector).


13 Comments

This works, it is not most efficient though because this code makes certain assumptions about the data distribution. If that assumption is wrong it will waste bits, potentially being larger than uncompressed. It is a valid answer though.
@usr, It cannot be longer than uncompressed if you choose a k corresponding to the size of the original data vector. I added a bit of elaboration about this. It does not make any assumption about distribution of data; only about the quantity of data.
@usr, ... the quantity and range of data (didn't edit that fast enough)
@rici is there any better documentation/manual/explanation for your solution? Sorry for my ignorance, but I still can't fully understand it
@JonathanAllan: Yes, I think you're right. If k is four then the binary section must be five bits (of which the first is 0). I think I fixed the example (and a couple of lingering formatting issues). Thanks.
|
6

One option that worked well for me in the past was to store a list of 64-bit integers as 8 different lists of 8-bit values. You store the high 8 bits of the numbers, then the next 8 bits, etc. For example, say you have the following 32-bit numbers:

0x12345678
0x12349785
0x13111111
0x13444444

The data stored would be (in hex):

12,12,13,13
34,34,11,44
56,97,11,44
78,85,11,44

I then ran that through the deflate compressor.

I don't recall what compression ratios I was able to achieve with this, but it was significantly better than compressing the numbers themselves.

Comments

5

I want to add another answer with the simplest possible solution:

  1. Convert the numbers to deltas as discussed previously
  2. Run it through the 7-zip LZMA2 algorithm. It is even multi-core ready

I think this will give almost perfect results in your case because the distances have a simple distribution. 7-zip will be able to pick it up.

2 Comments

I tried LZMA and LZMA2 but zpaq got a bit better results, thank you anyway.
Nanozip is also worth a try. It is among the top compressors and still usably fast.
4

You can use Delta Encoding and Protocol Buffers simply.

Like your example: 4, 20, 23, 40, 66.

Delta Encoding compressed: 4, 16, 3, 17, 26.

Then you store all numbers as varint in Protocol Buffers directly. Only need 1 byte for number between 0-127. And 2 bytes for number between 128-16384... This is enough for most scenes.

Further more you can use entropy coding(huffman) to achieve more effective compression rate than varint. Even less than 8bits per number.

Divide a number to 2 part. Like 17=...0001 0001(binary)=(5)0001. The first part (5) is valid bit count. The suffix part (0001) is without the leading 1.

Like the example: 4, 16, 3, 17, 26 = (3)00 (5)0000 (2)1 (5)0001 (5)1010

The first part will be between 0-45 even there are a lot of numbers. So they can be compressed by entropy coding like huffman effectively.

Comments

3

If your sequence is made up of pseudo-random numbers, such as might be generated by a typical digital computer, then I don't think that any compression scheme will beat, for brevity of representation, simply storing the code for the generator and whatever parameters you need to define its initial state.

If your sequence is made up of truly random numbers generated in some non-deterministic way then the other answers already posted offer a variety of good advice.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.