4

I'm looking for an algorithm that would compress some string to another string (i.e. without "\0" or special control characters), but I can't find anything on the internet. Is there such an algorithm? It doesn't have to be particularly efficient, just something basic.

8
  • see stackoverflow.com/questions/1138345/… Commented Sep 21, 2011 at 10:57
  • 5
    Any compression algorithm does exactly that. Commented Sep 21, 2011 at 10:59
  • @MAKKAM, this is not really related. He wants an algorithm for a specific kind of string (short strings) while I need something general. Commented Sep 21, 2011 at 10:59
  • try Huffman coding and RLE Commented Sep 21, 2011 at 11:20
  • 1
    @static_rtti, I'm looking for a compression algorithm that would output a string that can be copied and pasted. I think most compression algorithms output binary data with \0 characters. Commented Sep 21, 2011 at 11:29

4 Answers 4

8

Easy:

$ echo "Hello world" | gzip -c | base64
H4sIALnHeU4AA/NIzcnJVyjPL8pJ4QIA1eA5twwAAAA=

$ echo "H4sIALnHeU4AA/NIzcnJVyjPL8pJ4QIA1eA5twwAAAA=" | base64 -d | gzip -dc
Hello world

Note: looks like there is no compression, but for bigger data the compression ratio will be better :-)

Sign up to request clarification or add additional context in comments.

9 Comments

base64 is not as efficient as possible under the constraints the OP mentioned, there are over 90 printable ASCII characters and even more in extended-ascii
Establishes the principle, though. You could substitute Ascii85, or some other scheme using even more printable characters, and the questioner says, "It doesn't have to be particularly efficient".
@harold, yes, might not be as space efficient as possible, but the OP said "It doesn't have to be particularly efficient, just something basic" - and that's it! Base64 is very basic, easy and portable. It is ready-to-use in many languages. And there's no need to implement anything else for yourself, no need to reinvent the wheel.
does it have to be particularly inefficient? ok it's not that bad, but it kind of hurts my "it could have been better" senses. If there wasn't a compression stage first it would have made sense to me to just base64 it, but now not so much.
I tried this, and had to add --ignore-garbage to base64 -d
|
4

Your requirement for no "special characters" is very restrictive, unless you can guarantee that a subset of characters (say "~") will never be used. Then you can use those characters to mark your compression:

~a -> the
~b -> The
~c -> and
~d -> And
~e -> Sirius Robotics Corporation Ltd.
etc.

Just add commonly used words to the codebook. The codebook can be fixed, as above, or vary with the text to be compressed. Either way the decompressing side will need access to the correct codebook to do the decompression.

ETA: You could use "~~" to denote the special character in the codebook, which would allow the special character to be used in the text to be compressed.

Comments

3

Apparently you have some specific character set in mind and you want to use it for both the original string and the compressed string.

Standard compression routines (e.g. gzip) work on byte strings.

One idea is to take existing code (e.g. gzip's) and rewrite it to use your character set instead of bytes.

Another is to construct a 1-to-1 mapping between strings in your character set and arbitrary byte strings, map the original string to a byte string, compress the byte string using a standard compression utility or function, and map the result back to a string using your character set. (Strictly speaking you can use two different mappings.)

One way to construct the mapping is to pad your character set with dummies and a special pad character until you have 2^k different characters (for some k); then each 8 of your characters correspond to k bytes (and shorter strings can be padded with the pad character).

Comments

1

As far as I can tell, the most popular compression algorithm that allows standard C string-handling routines to be re-used to handle compressed text (i.e., carefully avoids putting any 0x00 bytes in the compressed string, except as the end-of-compressed-data marker) is simple byte-pair encoding, also called dual-tile encoding or DTE. DTE is often used to compress text in video game ROMs.

When the DTE decompressor prints out a DTE-compressed string, it reads 1 byte at a time from the DTE-compressed string and prints out 1 or two bytes:

  • compressed byte B in the range 0x01..0xFF: the decoder uses that as an index into the "dictionary" and prints out the 1 or 2 bytes stored in the dictionary at that index.
  • compressed byte B is 0x00, that's the end of the string -- done.

A typical DTE implementation has a hard-wired dictionary stored in both the encoder and the decoder something like this:

  • indexes of frequently-used letters -- perhaps the entire ASCII isprint() range 0x20 to 0x7e, and the newline character 0x0A -- represent themselves. (The compressed byte 'a' is decoded as the single letter 'a')
  • indexes from 0xc0 to 0xff: the byte is decoded into 2 characters: a space character, and a letter formed from this byte XORed with 0x80. (The compressed byte (0x80 xor 'a') is decoded into 2 characters, the space character and the letter 'a').
  • Any other available indexes ( 0x7f..0xbf ) store other common bigrams ("th", "re", etc.).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.