10

I want to take an arbitrary string of ASCII text, like "Hello world", and compress it into a version with fewer characters (as few as possible), but in a way that it can be decompressed. The compressed version should be composed only of ascii characters. Is there a way to accomplish this, especially in Ruby?

5
  • You want to compress text in an alphabet consisting of a-zA-Z into an alphabet consisting of a-zA-Z? I don't think that's possible. To reduce the length, you need to increase the available characters. If your input alphabet is limited to, say, a-zA-Z and your output can contain all 255 ASCII codepoints, you may be on to something... Commented Jan 27, 2011 at 3:04
  • 2
    When you say that the compressed version should be composed only of ascii characters, do you mean that 0x00-0x19 chars are not allowed? If you take down the possible chars to A-Za-z0-9, you might be able to get 5 chars / int. But it won't be an ascii string anymore, though Commented Jan 27, 2011 at 3:07
  • @deceze If it couldn't be done, binary files could not be compressed (as thay are already 8 bits). It can be done, but you'll get an output shorter than the input only if you have (a sizable number of) repetitions and so a dictionary helps. Commented Jan 27, 2011 at 3:41
  • @belisarius Yes, you're right. Point accepted. It is possible, just hardly so with strings as short as "Hello world". Commented Jan 27, 2011 at 3:54
  • 1
    @deceze The cutoff point when your repetitions start to gain some benefit for you depends on the compression algorithm. But one thing is sure: you need repetitions comprising more than one char ... and Hello world is not the most beautiful example ... Commented Jan 27, 2011 at 4:00

4 Answers 4

9

If you know that only ASCII characters will be used, that is the 7 low order bits of every byte. With bit manipulation, you can mash every 8 bytes into 7 (12.5% savings). If you can get it into a smaller range (64 valid chars only), you can drop another byte.

However, because you want the compressed form to ALSO contain only ASCII characters, that loses you one byte - which goes back to square one unless your input can be restricted to 64-chars (e.g. lossy compression substituting some chars with others, storing only in lower case etc).

If your strings are not large (>1k), then there is minimal savings to be had using gzip/bzip2 etc because of the size of the headers. If you had a predefined dictionary to use as a Huffman table, you may get some compression but in other cases, you can get bloat against the original text.

Prior discussion on SO An efficient compression algorithm for short text strings

Sign up to request clarification or add additional context in comments.

1 Comment

The question, as I read it, is about compressing text where the output is also 7-bit ASCII. Removing the high bit isn't going to work as compression in that case.
4

There are many good text compression algorithms like Huffman encoding or LZW that are good at compressing text strings into bitstrings with many fewer bits than the standard ASCII encoding. Once you have such an encoding, you can always split the bitstring up into groups of seven bits to pack them into standard ASCII characters. I'm sure that there are libraries out there that do this, but I'm not much of a Ruby coder and don't know of any off the top of my head.

1 Comment

Unless you use a fixed Huffman table, the table size itself will likely "compress" into a larger size on short strings.
1

The simplest way to do this would be to compress it using a standard algorithm, then base64 encode the result. This isn't likely to help on a string as short as 'Hello world', though - at that size, there's very little you could do to decrease the size, unless all your strings have a similar restricted character set, or patterns that something like huffman encoding can take advantage of.

Comments

0

If your language is given, say english, then you can get away by leaving common characters out if your word stays unambiguous. For example, "Hello world" could become "Hll wrld" if your dictionary only contains Hello to match Hll and world to match wrld. Semitic languages like arabic actually have no vocals in their written language, and people still manage to read them. Also, other rules like when a word is supposed to be uppercase can be used to reduce the character set to lower case characters(supposing that a given text follows these rules).

Also, while byte-wise compression works well for texts, actual natural language can be far better compressed if you encode whole words, because the vocabulary size is very limited (even more limited if you look at a restricted set of texts). But that was not the question, i'm getting off-topic here.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.