4

I have many unique numbers, all positive and the order doesn't matter, 0 < num < 2^32.
Example: 23 56 24 26

The biggest, 56, needs 6 bits space. So, I need: 4*6 = 24 bits in total.

I do the following to save space:
I sort them first: 23 24 26 56 (because the order doesn't matter)
Now I get the difference of each from the previous: 23 1 2 30

The biggest, 30, needs 5 bits space.
After this I store all the numbers in 4*5 bits = 20 bits space.

Question: how to further improve this algorithm?

More information: Since requested, the numbers are mostly on the range of 2.000-4.000. Numbers less than 300 are pretty rare. Numbers more than 16.000 are pretty rare also. Generally speaking, all the numbers will be close. For example, they may be all in the 1.000-2.000 range or they may all be in the 16.000-20.000 range. The total number of numbers will be something in the range of 500-5.000.

3
  • 2
    The difference between 56 and 26 is 30, so that would be the biggest ;) Commented Jan 20, 2014 at 11:10
  • I don't understand the bit calculations... where do you store the format so it can be decoded? Commented Jan 20, 2014 at 11:54
  • 4
    anyway, this question cannot be properly answered, as it all depends on the distribution and the number of numbers... or you have to explain at least 20 different compression techniques.. :/ Commented Jan 20, 2014 at 11:57

4 Answers 4

4

Your first step is good one to take because sorting reduces the differences to least. Here is a way to improve your algorithm:

  1. sort and calculate differences as you have done.
  2. Use Huffman coding on it.

Use of Huffman coding is more important then your step; I'll show you why:

consider the following data:

1 2 3 4 5 6 7 4294967295

where 4294967295 = 2^32-1. Using your algorithm:

1 1 1 1 1 1 1 4294967288

total bits needed is still 32*8

Using Huffman coding, the frequencies are:

1 => 7
4294967288 => 1

Huffman codes are 1 => 0 and 4294967288 => 1

Total bits needed = 7*1 + 1 = 8 bits

Huffman coding reduces size by 32*8/8 = 32 times

Sign up to request clarification or add additional context in comments.

5 Comments

@KarolyHorvath didnt want convey anything special there just didnt want to evaluate that because it is easier to show like that it takes 32-bits
I was confused by the parentheses as well. I'll remove them.
aaah.. is that the last number in the list?
You still need to store the mapping 4294967288 => 1 somewhere, so I guess it's a bit more than 8 bits in the end. Anyway, +1
@tobias_k Good point guess i realized that later but thought its better to leave it uncomplicated
4

This problem is well known in database community as "Inverted index compression". You can google for some papers.

Following are some of the most common techniques:

  • Variable byte coding (VByte)
  • Simple9, Simple16
  • "Frame Of Reference" family of techniques
    • PForDelta
    • Adaptive Frame Of Reference (AFOR)
  • Rice-Golomb coding (often used as a part of other techniques)

VByte and Simple9/16 are easiest to implement, fast and have good compression ratio in practice.

Huffman coding is not very good for index compression because it is slow and differences are quite random in practice. (But it may be a good choice in your case.)

9 Comments

I wouldn't call Huffman coding slow.. unless it's done wrong, of course.
Huffman coding is heavily bitwise, faster methods concentrate more on byte/word manipulation. And of course it is not fair to compare Huffman and smth like Simple9, they have different use cases.
If you only know the bit-by-bit way to do Huffman coding, it's not surprising you think it's slow. Outside of the classroom however, no one actually implements Huffman coding that way. There are several table-based decoding algorithms that are used in practice.
@MaxTaldykin If large numbers are taken then differences cannot be that random morever if small amount of numbers are taken then huffman takes O(logn) bits per number hence in any case huffman gives good results.
It is slow only when compared to the techniques I listed. Table-base Huffman is not used for index compression due to bad cache locality.
|
2

How many numbers do you have ? If your set covers the range [0..(2^32)-1] densely enough (you do the maths) then a 4GiB bitfield, where the n-th bit represents the presence, or absence, of the natural number n may be useful.

1 Comment

if it is really dense, encode the numbers which are missing. probably will take less space than 4GiB.
0

If your numbers are not uniformly distributed, a better compression will be achieved by using frequencies of the numbers and affect less bits to most frequent ones. This is the idea behind huffman coding.

2 Comments

To quote form the question: "I have many unique numbers...". I guess that means "unique" as in "each number has the same frequency, and that is 1"
@tobias_k you are right, i missed the "unique" point. Huffman can be applied to differences .

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.