compression algorithm for sorted integers

Question

I have a large sequence of random integers sorted from the lowest to the highest. The numbers start from 1 bit and end near 45 bits. In the beginning of the list I have numbers very close to each other: 4, 20, 23, 40, 66. But when the numbers start to get higher the distance between them is a bit higher too (actually the distance between them is aleatory). There are no duplicated numbers.

I'm using bit packing to save some space. Nonetheless, this file can get really big.

I would like to know what kind of compression algorithm can be used in this situation, or any other technique to save as much space as possible.

Thank you.

I'm not following. Could you draw a (perhaps ASCII) picture or something of the sequence — keyser
– keyser, Commented Sep 30, 2012 at 19:42
@Keyser 1, 23, 33, 34,39, 80, 122, 125, 169, 168, 203, and this way on — Frederico Schardong
– Frederico Schardong, Commented Sep 30, 2012 at 19:46
You can express the list as the distance between adjacent numbers -- 22, 10, 1, 5, 41, 42, 3... (if my arithmetic is right). Since these numbers are shorter you can pack them together more closely. (Note that you should add a virtual zero to the start of the sequence to generate the initial 1, or else otherwise record the start number.) You can probably even play the trick of doing the difference a second time to come up with a signed number -- -12, -9, 4, 36, 1, -39... which may or may not compact better (depending on the number distribution). — Hot Licks
– Hot Licks, Commented Sep 30, 2012 at 19:53
And note that you can use one of several variable-length notations. Eg, a number beginning 11b is a 3-byte (22 bit) number, a number beginning 10b is a 2-byte (14 bit) number, and a number beginning 0b is a 1-byte (7 bit) number. You can combine this scheme with your bit packing to come up with whatever size numbers best fit your needs. — Hot Licks
– Hot Licks, Commented Sep 30, 2012 at 19:58
(It would help to know your usage scenario -- some schemes make it much harder to access, say, the 27th entry than others, if you need to do random access without unpacking the entire list.) — Hot Licks
– Hot Licks, Commented Sep 30, 2012 at 20:00

usr · Accepted Answer · 2012-09-30 20:09:17Z

10

You can compress optimally if you know the true distribution of the data. If you can provide a probability distribution for each integer you can use arithmetic coding or other entropy coding techniques to compress to theoretical minimal size.

The trick is in predicting accurately.

First, you should probably compress the distances between the numbers because that allows you to make statistical statements. If you were to compress the numbers directly you'd have a hard time modelling them because they occur only once.

Next, you could try to build a very simple model to predict the next distance. Keep a histogram of all previously seen distances and calculate the probabilities from the frequencies.

You probably need to account for missing values (you clearly can't assign them 0 probability because that is not expressible) but you can use heuristics for that, like encoding the next distance bit-by-bit and predicting each bit individually. You will pay almost nothing for the high-order bits because they are almost always 0 and entropy encoding optimizes them away.

All of this is much simpler if you know the distribution. Example: You you are compressing a list of all prime numbers you know the theoretical distribution of distances because there are formulae for that. So you already have a perfect model.

answered Sep 30, 2012 at 20:09

usr

172k38 gold badges251 silver badges380 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Frederico Schardong Over a year ago

There is a probability. I'm counting from 1 bit to near 43 bits, but each number is stored in a different file, and this is aleatory. That's why each file have aleatory but sorted numbers. There are 256 files, so the probability is 1/256 that the next number go to a file instead another.

usr Over a year ago

So this means that the distance follows an exactly known distribution? If yes, you got the answer in the last paragraph.

usr Over a year ago

You could also have a single file, and for each number store an 8-bit integer to denote in which file it belongs. That gives you optimal compression (because the files are equally distibuted) and 8 bit per number.

Frederico Schardong Over a year ago

They are not equally distributed, sorry. I'm using hash functions to determinate which file each number should go, so there is not a 100% distribution of 1/256

usr Over a year ago

It is almost a perfect distribution. If the distribution of values to files is controlled by a hash function this scheme is the best you can do. For each number you are essentially generating a random 8-bit value.

|

rici · Accepted Answer · 2020-02-03 22:55:44Z

10

There's a very simple and fairly effective compression technique which can be used for sorted integers in a known range. Like most compression schemes, it is optimized for serial access, although you can build an index to speed up random access if needed.

It's a type of delta encoding (i.e. each number is represented by the distance from the previous one), consisting of a vector of codes which are either

a single 1-bit, representing a delta of 2^k which is added to the delta in the following code, or
a 0-bit followed by a k-bit delta, indicating that the next number is the specified delta from the previous one.

For example, if k is 4, the sequence:

00011 1 1 00000 1 00001

codes three numbers. The first four-bit encoding (3) is the first delta, taken from an initial value of 0, so the first number is 3. The next two solitary 1's accumulate to a delta of 2·2⁴, or 32, which is added to the following delta of 0000, for a total of 32. So the second number is 3+32=35. Finally, the last delta is a single 2⁴ plus 1, total 17, and the third number is 35+17=52.

The 1-bit indicates that the next delta should be incremented by 2^k (or, more generally, each delta is incremented by 2^k times the number of immediately preceding 1-bits.)

Another, possibly better, way of thinking of this is that each delta is coded as a variable length bit sequence: 1ⁱ0(1|0)^k, representing a delta of i·2^k+[the k-bit suffix]. But the first presentation aligns better with the optimality proof.

Since each "1" code represents an increment of 2^k, there cannot be more than m/2^k of them, where m is the largest number in the set to be compressed. The remaining codes all correspond to numbers, and have a total length of n·(k + 1) where n is the size of the set. The optimal value of k is roughly log₂ m/n, which in your case would be 7 or 8.

I did a quick proof of concept of the algorithm, without worrying about optimizations. It's still plenty fast; sorting the random sample takes a lot longer than compressing/decompressing it. I tried it with a few different seeds and vector sizes from 16,400,000 to 31,000,000 with a value range of [0, 4,000,000,000). The bits used per data value ranged from 8.59 (n=31000000) to 9.45 (n=16400000). All of the tests were done with 7-bit suffixes; log₂ m/n varies from 7.01 (n=31000000) to 7.93 (n=16400000). I tried with 6-bit and 8-bit suffixes; except in the case of n=31000000 where the 6-bit suffixes were slightly smaller, the 7-bit suffix was always the best. So I guess that the optimal k is not exactly floor(log₂ m/n) but it's not far off.

Compression code:

void Compress(std::ostream& os,
              const std::vector<unsigned long>& v,
              unsigned long k = 0) {
  BitOut out(os);
  out.put(v.size(), 64);
  if (v.size()) {
    unsigned long twok;
    if (k == 0) {
      unsigned long ratio = v.back() / v.size();
      for (twok = 1; twok <= ratio / 2; ++k, twok *= 2) { }
    } else {
      twok = 1 << k;
    }
    out.put(k, 32);

    unsigned long prev = 0;
    for (unsigned long val : v) {
      while (val - prev >= twok) { out.put(1); prev += twok; }
      out.put(0);
      out.put(val - prev, k);
      prev = val;
    }
  }
  out.flush(1);
}

Decompression:

std::vector<unsigned long> Decompress(std::istream& is) {
  BitIn in(is);
  unsigned long size = in.get(64);
  if (size) {
    unsigned long k = in.get(32);
    unsigned long twok = 1 << k;

    std::vector<unsigned long> v;
    v.reserve(size);
    unsigned long prev = 0;
    for (; size; --size) {
      while (in.get()) prev += twok;
      prev += in.get(k);
      v.push_back(prev);
    }
  }
  return v;
}

It can be a bit awkward to use variable-length encodings; an alternative is to store the first bit of each code (1 or 0) in a bit vector, and the k-bit suffixes in a separate vector. This would be particularly convenient if k is 8.

A variant, which results in slight longer files but is a bit easier to build indexes for, is to only use the 1-bits as deltas. Then the deltas are always a·2^k for some a, possibly 0, where a is the number of consecutive 1 bits preceding the suffix code. The index then consists of the locations of every N^th 1-bit in the bit vector, and the corresponding index into the suffix vector (i.e. the index of the suffix corresponding with the next 0 in the bit vector).

edited Feb 3, 2020 at 22:55

answered Sep 30, 2012 at 20:35

rici

243k30 gold badges263 silver badges364 bronze badges

13 Comments

usr Over a year ago

This works, it is not most efficient though because this code makes certain assumptions about the data distribution. If that assumption is wrong it will waste bits, potentially being larger than uncompressed. It is a valid answer though.

rici Over a year ago

@usr, It cannot be longer than uncompressed if you choose a k corresponding to the size of the original data vector. I added a bit of elaboration about this. It does not make any assumption about distribution of data; only about the quantity of data.

rici Over a year ago

@usr, ... the quantity and range of data (didn't edit that fast enough)

Frederico Schardong Over a year ago

@rici is there any better documentation/manual/explanation for your solution? Sorry for my ignorance, but I still can't fully understand it

rici Over a year ago

@JonathanAllan: Yes, I think you're right. If k is four then the binary section must be five bits (of which the first is 0). I think I fixed the example (and a couple of lingering formatting issues). Thanks.

|

Jim Mischel · Accepted Answer · 2012-09-30 22:04:17Z

6

One option that worked well for me in the past was to store a list of 64-bit integers as 8 different lists of 8-bit values. You store the high 8 bits of the numbers, then the next 8 bits, etc. For example, say you have the following 32-bit numbers:

The data stored would be (in hex):

12,12,13,13
34,34,11,44
56,97,11,44
78,85,11,44

I then ran that through the deflate compressor.

I don't recall what compression ratios I was able to achieve with this, but it was significantly better than compressing the numbers themselves.

answered Sep 30, 2012 at 22:04

Jim Mischel

135k25 gold badges197 silver badges377 bronze badges

Comments

usr · Accepted Answer · 2012-09-30 22:09:42Z

5

I want to add another answer with the simplest possible solution:

Convert the numbers to deltas as discussed previously
Run it through the 7-zip LZMA2 algorithm. It is even multi-core ready

I think this will give almost perfect results in your case because the distances have a simple distribution. 7-zip will be able to pick it up.

answered Sep 30, 2012 at 22:09

usr

172k38 gold badges251 silver badges380 bronze badges

2 Comments

Frederico Schardong Over a year ago

I tried LZMA and LZMA2 but zpaq got a bit better results, thank you anyway.

usr Over a year ago

Nanozip is also worth a try. It is among the top compressors and still usably fast.

Loki · Accepted Answer · 2020-07-06 10:24:32Z

You can use Delta Encoding and Protocol Buffers simply.

Like your example: 4, 20, 23, 40, 66.

Delta Encoding compressed: 4, 16, 3, 17, 26.

Then you store all numbers as varint in Protocol Buffers directly. Only need 1 byte for number between 0-127. And 2 bytes for number between 128-16384... This is enough for most scenes.

Further more you can use entropy coding(huffman) to achieve more effective compression rate than varint. Even less than 8bits per number.

Divide a number to 2 part. Like 17=...0001 0001(binary)=(5)0001. The first part (5) is valid bit count. The suffix part (0001) is without the leading 1.

Like the example: 4, 16, 3, 17, 26 = (3)00 (5)0000 (2)1 (5)0001 (5)1010

The first part will be between 0-45 even there are a lot of numbers. So they can be compressed by entropy coding like huffman effectively.

High Performance Mark · Accepted Answer · 2012-10-01 06:10:41Z

3

If your sequence is made up of pseudo-random numbers, such as might be generated by a typical digital computer, then I don't think that any compression scheme will beat, for brevity of representation, simply storing the code for the generator and whatever parameters you need to define its initial state.

If your sequence is made up of truly random numbers generated in some non-deterministic way then the other answers already posted offer a variety of good advice.

answered Oct 1, 2012 at 6:10

High Performance Mark

78.6k7 gold badges109 silver badges168 bronze badges

Collectives™ on Stack Overflow

compression algorithm for sorted integers

6 Answers 6

6 Comments

13 Comments

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

6 Comments

13 Comments

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related