3

I am looking for an algorithm to compress small ASCII strings. They contain lots of letters but they also can contain numbers and rarely special characters. They will be small, about 50-100 bytes average, 250 max.

Examples:

Android show EditText.setError() above the EditText and not below it
ImageView CENTER_CROP dont work
Prevent an app to show on recent application list on android kitkat 4.4.2
Image can't save validable in android
Android 4.4 SMS - Not receiving sentIntents
Imported android-map-extensions version 2.0 now my R.java file is missing
GCM registering but not receiving messages on pre 4.0.4. devices

I want to compress the titles one by one, not many titles together and I don't care much about CPU and memory usage.

30
  • 1
    Text lends itself to zip compression pretty well, you might want to do that, it is available as standalone as well as a part of few libraries. Commented Jan 17, 2014 at 11:21
  • Searching for compression library should give you something to start with. Commented Jan 17, 2014 at 11:22
  • I want to compress the titles one by one, not many titles together. Commented Jan 17, 2014 at 11:22
  • 2
    Compressing a small quantity of data rarely achieves a result better (smaller) than the original input, due to the overhead of the additional header, instructing how the compressed data should be decoded. You can apply a proprietary (your own) method for this specific case. You have 52 letters, 10 digits and space. 63 symbols altogether. You can encode each symbol using 7 bits (instead of 8). The first bit will always be 0, and the remaining 6 bits will be mapped to any of your 63 symbols. For any "non regular symbol", use 1 followed by the ASCII code of that symbol (i.e., 9 bits instead of 8) Commented Jan 17, 2014 at 11:31
  • 4
    Not usually the best solution, but in this case... You could build a Hoffman compression table from a set of typical titles, then use Hoffman compression using this table for each title. Commented Jan 17, 2014 at 11:33

1 Answer 1

3

You can use Huffman coding with a shared Huffman tree among all texts you want to compress.

While you typically construct a Huffman tree for each string to be compressed separately, this would require a lot of overhead in storage which should be avoided here. That's also the major problem when using a standard compression scheme for your case: most of them have some overhead which kills your compression efficiency for very short strings. Some of them don't have a (big) overhead but those are typically less efficient in general.

When constructing a Huffman tree which is later used for compression and decompression, you typically use the texts which will be compressed to decide which character is encoded with which bits. Since in your case the texts to be compressed seem to be unknown in advance, you need to have some "pseudo" texts to build the tree, maybe from a dictionary of the human language or some experience of previous user data.

Then construct the Huffman tree and store it once in your application; either hardcode it into the binary or provide it in the form of a file. Then you can compress and decompress any texts using this tree. Whenever you decide to change the tree since you gain better experience on which texts are compressed, the compressed string representation also changes. It might be a good idea to introduce versioning and store the tree version together with each string you compress.

Another improvement you might think about is to use multi-character Huffman encoding. Instead of compressing the texts character by character, you could find frequent syllables or words and put them into the tree too; then they require even less bits in the compressed string. This however requires a little bit more complicated compression algorithm, but it might be well worth the effort.

To process a string of bits in the compression and decompression routine in C++(*), I recommend either boost::dynamic_bitset or std::vector<bool>. Both internally pack multiple bits into bytes.


(*)The question once had the tag, so OP obviously wanted to implement it in C++. But as the general problem is not specific to a programming language, the tag was removed. But I still kept the C++-specific part of the answer.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.