Algorithm to compress a lot of small strings?

Question

I am looking for an algorithm to compress small ASCII strings. They contain lots of letters but they also can contain numbers and rarely special characters. They will be small, about 50-100 bytes average, 250 max.

Examples:

Android show EditText.setError() above the EditText and not below it
ImageView CENTER_CROP dont work
Prevent an app to show on recent application list on android kitkat 4.4.2
Image can't save validable in android
Android 4.4 SMS - Not receiving sentIntents
Imported android-map-extensions version 2.0 now my R.java file is missing
GCM registering but not receiving messages on pre 4.0.4. devices

I want to compress the titles one by one, not many titles together and I don't care much about CPU and memory usage.

Text lends itself to zip compression pretty well, you might want to do that, it is available as standalone as well as a part of few libraries. — dtech
– dtech, Commented Jan 17, 2014 at 11:21
Searching for compression library should give you something to start with. — Some programmer dude
– Some programmer dude, Commented Jan 17, 2014 at 11:22
I want to compress the titles one by one, not many titles together. — Luka
– Luka, Commented Jan 17, 2014 at 11:22
Compressing a small quantity of data rarely achieves a result better (smaller) than the original input, due to the overhead of the additional header, instructing how the compressed data should be decoded. You can apply a proprietary (your own) method for this specific case. You have 52 letters, 10 digits and space. 63 symbols altogether. You can encode each symbol using 7 bits (instead of 8). The first bit will always be 0, and the remaining 6 bits will be mapped to any of your 63 symbols. For any "non regular symbol", use 1 followed by the ASCII code of that symbol (i.e., 9 bits instead of 8) — barak manos
– barak manos, Commented Jan 17, 2014 at 11:31
Not usually the best solution, but in this case... You could build a Hoffman compression table from a set of typical titles, then use Hoffman compression using this table for each title. — James Kanze
– James Kanze, Commented Jan 17, 2014 at 11:33

leemes · Accepted Answer · 2014-01-17 19:40:18Z

You can use Huffman coding with a shared Huffman tree among all texts you want to compress.

While you typically construct a Huffman tree for each string to be compressed separately, this would require a lot of overhead in storage which should be avoided here. That's also the major problem when using a standard compression scheme for your case: most of them have some overhead which kills your compression efficiency for very short strings. Some of them don't have a (big) overhead but those are typically less efficient in general.

When constructing a Huffman tree which is later used for compression and decompression, you typically use the texts which will be compressed to decide which character is encoded with which bits. Since in your case the texts to be compressed seem to be unknown in advance, you need to have some "pseudo" texts to build the tree, maybe from a dictionary of the human language or some experience of previous user data.

Then construct the Huffman tree and store it once in your application; either hardcode it into the binary or provide it in the form of a file. Then you can compress and decompress any texts using this tree. Whenever you decide to change the tree since you gain better experience on which texts are compressed, the compressed string representation also changes. It might be a good idea to introduce versioning and store the tree version together with each string you compress.

Another improvement you might think about is to use multi-character Huffman encoding. Instead of compressing the texts character by character, you could find frequent syllables or words and put them into the tree too; then they require even less bits in the compressed string. This however requires a little bit more complicated compression algorithm, but it might be well worth the effort.

To process a string of bits in the compression and decompression routine in C++^(*), I recommend either boost::dynamic_bitset or std::vector<bool>. Both internally pack multiple bits into bytes.

^(*)_{The question once had the c++ tag, so OP obviously wanted to implement it in C++. But as the general problem is not specific to a programming language, the tag was removed. But I still kept the C++-specific part of the answer.}

Collectives™ on Stack Overflow

Algorithm to compress a lot of small strings?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related