I want to take an arbitrary string of ASCII text, like "Hello world", and compress it into a version with fewer characters (as few as possible), but in a way that it can be decompressed. The compressed version should be composed only of ascii characters. Is there a way to accomplish this, especially in Ruby?
4 Answers
If you know that only ASCII characters will be used, that is the 7 low order bits of every byte. With bit manipulation, you can mash every 8 bytes into 7 (12.5% savings). If you can get it into a smaller range (64 valid chars only), you can drop another byte.
However, because you want the compressed form to ALSO contain only ASCII characters, that loses you one byte - which goes back to square one unless your input can be restricted to 64-chars (e.g. lossy compression substituting some chars with others, storing only in lower case etc).
If your strings are not large (>1k), then there is minimal savings to be had using gzip/bzip2 etc because of the size of the headers. If you had a predefined dictionary to use as a Huffman table, you may get some compression but in other cases, you can get bloat against the original text.
Prior discussion on SO An efficient compression algorithm for short text strings
1 Comment
There are many good text compression algorithms like Huffman encoding or LZW that are good at compressing text strings into bitstrings with many fewer bits than the standard ASCII encoding. Once you have such an encoding, you can always split the bitstring up into groups of seven bits to pack them into standard ASCII characters. I'm sure that there are libraries out there that do this, but I'm not much of a Ruby coder and don't know of any off the top of my head.
1 Comment
The simplest way to do this would be to compress it using a standard algorithm, then base64 encode the result. This isn't likely to help on a string as short as 'Hello world', though - at that size, there's very little you could do to decrease the size, unless all your strings have a similar restricted character set, or patterns that something like huffman encoding can take advantage of.
Comments
If your language is given, say english, then you can get away by leaving common characters out if your word stays unambiguous. For example, "Hello world" could become "Hll wrld" if your dictionary only contains Hello to match Hll and world to match wrld. Semitic languages like arabic actually have no vocals in their written language, and people still manage to read them. Also, other rules like when a word is supposed to be uppercase can be used to reduce the character set to lower case characters(supposing that a given text follows these rules).
Also, while byte-wise compression works well for texts, actual natural language can be far better compressed if you encode whole words, because the vocabulary size is very limited (even more limited if you look at a restricted set of texts). But that was not the question, i'm getting off-topic here.
a-zA-Zinto an alphabet consisting ofa-zA-Z? I don't think that's possible. To reduce the length, you need to increase the available characters. If your input alphabet is limited to, say,a-zA-Zand your output can contain all 255 ASCII codepoints, you may be on to something...