Improve number compression algorithm?

Question

I have many unique numbers, all positive and the order doesn't matter, 0 < num < 2^32.
Example: 23 56 24 26

The biggest, 56, needs 6 bits space. So, I need: 4*6 = 24 bits in total.

I do the following to save space:
I sort them first: 23 24 26 56 (because the order doesn't matter)
Now I get the difference of each from the previous: 23 1 2 30

The biggest, 30, needs 5 bits space.
After this I store all the numbers in 4*5 bits = 20 bits space.

Question: how to further improve this algorithm?

More information: Since requested, the numbers are mostly on the range of 2.000-4.000. Numbers less than 300 are pretty rare. Numbers more than 16.000 are pretty rare also. Generally speaking, all the numbers will be close. For example, they may be all in the 1.000-2.000 range or they may all be in the 16.000-20.000 range. The total number of numbers will be something in the range of 500-5.000.

The difference between 56 and 26 is 30, so that would be the biggest ;) — John Willemse
– John Willemse, Commented Jan 20, 2014 at 11:10
I don't understand the bit calculations... where do you store the format so it can be decoded? — Karoly Horvath
– Karoly Horvath, Commented Jan 20, 2014 at 11:54
anyway, this question cannot be properly answered, as it all depends on the distribution and the number of numbers... or you have to explain at least 20 different compression techniques.. :/ — Karoly Horvath
– Karoly Horvath, Commented Jan 20, 2014 at 11:57

Jean-François Corbett · Accepted Answer · 2014-01-20 12:17:16Z

4

Your first step is good one to take because sorting reduces the differences to least. Here is a way to improve your algorithm:

sort and calculate differences as you have done.
Use Huffman coding on it.

Use of Huffman coding is more important then your step; I'll show you why:

consider the following data:

1 2 3 4 5 6 7 4294967295

where 4294967295 = 2^32-1. Using your algorithm:

1 1 1 1 1 1 1 4294967288

total bits needed is still 32*8

Using Huffman coding, the frequencies are:

1 => 7
4294967288 => 1

Huffman codes are 1 => 0 and 4294967288 => 1

Total bits needed = 7*1 + 1 = 8 bits

Huffman coding reduces size by 32*8/8 = 32 times

edited Jan 20, 2014 at 12:17

Jean-François Corbett

38.7k30 gold badges144 silver badges192 bronze badges

answered Jan 20, 2014 at 11:48

Vikram Bhat

6,3063 gold badges22 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Vikram Bhat Over a year ago

@KarolyHorvath didnt want convey anything special there just didnt want to evaluate that because it is easier to show like that it takes 32-bits

Jean-François Corbett Over a year ago

I was confused by the parentheses as well. I'll remove them.

Karoly Horvath Over a year ago

aaah.. is that the last number in the list?

tobias_k Over a year ago

You still need to store the mapping 4294967288 => 1 somewhere, so I guess it's a bit more than 8 bits in the end. Anyway, +1

Vikram Bhat Over a year ago

@tobias_k Good point guess i realized that later but thought its better to leave it uncomplicated

max taldykin · Accepted Answer · 2014-01-20 11:57:06Z

4

This problem is well known in database community as "Inverted index compression". You can google for some papers.

Following are some of the most common techniques:

Variable byte coding (VByte)
Simple9, Simple16
"Frame Of Reference" family of techniques
- PForDelta
- Adaptive Frame Of Reference (AFOR)
Rice-Golomb coding (often used as a part of other techniques)

VByte and Simple9/16 are easiest to implement, fast and have good compression ratio in practice.

Huffman coding is not very good for index compression because it is slow and differences are quite random in practice. (But it may be a good choice in your case.)

answered Jan 20, 2014 at 11:57

max taldykin

13k5 gold badges49 silver badges69 bronze badges

9 Comments

user555045 Over a year ago

I wouldn't call Huffman coding slow.. unless it's done wrong, of course.

max taldykin Over a year ago

Huffman coding is heavily bitwise, faster methods concentrate more on byte/word manipulation. And of course it is not fair to compare Huffman and smth like Simple9, they have different use cases.

user555045 Over a year ago

If you only know the bit-by-bit way to do Huffman coding, it's not surprising you think it's slow. Outside of the classroom however, no one actually implements Huffman coding that way. There are several table-based decoding algorithms that are used in practice.

Vikram Bhat Over a year ago

@MaxTaldykin If large numbers are taken then differences cannot be that random morever if small amount of numbers are taken then huffman takes O(logn) bits per number hence in any case huffman gives good results.

max taldykin Over a year ago

It is slow only when compared to the techniques I listed. Table-base Huffman is not used for index compression due to bad cache locality.

|

High Performance Mark · Accepted Answer · 2014-01-20 21:56:18Z

2

How many numbers do you have ? If your set covers the range [0..(2^32)-1] densely enough (you do the maths) then a 4GiB bitfield, where the n-th bit represents the presence, or absence, of the natural number n may be useful.

edited Jan 20, 2014 at 21:56

answered Jan 20, 2014 at 11:43

High Performance Mark

78.6k7 gold badges109 silver badges168 bronze badges

1 Comment

Karoly Horvath Over a year ago

if it is really dense, encode the numbers which are missing. probably will take less space than 4GiB.

Louis Hugues · Accepted Answer · 2014-01-20 11:08:43Z

0

If your numbers are not uniformly distributed, a better compression will be achieved by using frequencies of the numbers and affect less bits to most frequent ones. This is the idea behind huffman coding.

answered Jan 20, 2014 at 11:08

Louis Hugues

5964 silver badges6 bronze badges

2 Comments

tobias_k Over a year ago

To quote form the question: "I have many unique numbers...". I guess that means "unique" as in "each number has the same frequency, and that is 1"

Louis Hugues Over a year ago

@tobias_k you are right, i missed the "unique" point. Huffman can be applied to differences .

Collectives™ on Stack Overflow

Improve number compression algorithm?

4 Answers 4

5 Comments

9 Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

9 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related