Compression algorithm for contiguous numbers

Question

I'm looking for an efficient encoding for storing simulated coefficients.

The data has thousands of curves with each 512 contiguous numbers with single precision. The data may be stored as fixed point while it should preserve about 23-bit precision (compared to unity level).

The curves could look like those:

My best approach was to convert the numbers to 24-bit fixed point. Repeatedly I took the adjacent difference as long as the sum-of-squares decreases. When compressing the resulting data using LZMA (xz,lzip) I get about 7.5x compression (compared to float32).

Adjacent differences are good at the beginning, but they emphasize the quantization noise at each turn.

I've also tried the cosine transform after subtracting the slope/curve at the boundaries. The resulting compression was much weaker.

I tried AEC but LZMA compressed much stronger. The highest compression was using bzip3 (after adjacent differences).

I found no function to fit the data with high precision and a limited parameter count.

Is there a way to reduce the penalty of quantization noise when using adjacent differences?

Are there encodings which are better suited for this type of data?

What exactly do you mean by "contiguous"? Where is the quantization noise coming from? Do you mean just the noise in the least significant bit of the 23? When you say "simulated", are the generated by an algorithm as opposed to measured? If they are generated by an algorithm, then why not use the algorithm and parameters as the compressed representation? — Mark Adler
– Mark Adler, Commented Feb 6, 2023 at 17:59
Data model: The data is generated by a model (based on maybe 500MB of measured data). Calculating the model is expensive, so the model should be precalculated for various parameters and distributed. I'd like to compress resulting data of several tens of GB. The quantization noise I'm talking about are from the digital representation: After one adjacent difference from a least significant step you get a spike. With the next adjacent difference you get a positive spike at the leading edge and and a negative spike at the falling edge. — m2j
– m2j, Commented Feb 6, 2023 at 18:19
With contiguous I mean, that the data follows a strict curve. I used this word to emphasis that the data should be highly predictive. — m2j
– m2j, Commented Feb 6, 2023 at 18:31
Can you provide a link to a few examples, each with 512 samples? — Mark Adler
– Mark Adler, Commented Feb 6, 2023 at 21:54

Mark Adler · Accepted Answer · 2023-02-06 21:31:07Z

2

You could try a higher-order predictor. Your "adjacent difference" is a zeroth-order predictor, where the next sample is predicted to be equal to the last sample. You take the differences between the actuals and the predictions, and then compress those differences.

You can try first, second, etc. order predictors. A first-order predictor would look at the last two samples, draw a line between those, and predict that the next sample will fall on the line. A second-order predictor would look at the last three samples, fit those to a parabola, and predict that the next sample will fall on the parabola. And so on.

Assuming that your samples are equally spaced on your x-axis, then the predictors for x[0] up through cubics are:

x[-1] (what you're using now)
2*x[-1] - x[-2]
3*x[-1] - 3*x[-2] + x[-3]
4*x[-1] - 6*x[-2] + 4*x[-3] - x[-4]

(Note that the coefficients are alternating-sign binomial coefficients.)

I doubt that the cubic polynomial predictor will be useful for you, but experiment with all of them to see if any help.

Assuming that the differences are small, you should use a variable-length integer to represent them. The idea would be to use one byte for each difference most of the time. For example, you could code seven bits of difference, say -64 to 63, in one byte with the high bit clear. If the difference doesn't fit in that, then make the high bit set, and have a second byte with another seven bits for a total of 14 with that second high bit clear. And so on for larger differences.

edited Feb 6, 2023 at 21:31

answered Feb 6, 2023 at 20:05

Mark Adler

115k15 gold badges136 silver badges182 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

m2j Over a year ago

Thanks for the hint. I'm currently quite occupied and thus I couldn't yet test this. I will look at this asap.

Collectives™ on Stack Overflow

Compression algorithm for contiguous numbers

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related