String to string compression algorithm?

Question

I'm looking for an algorithm that would compress some string to another string (i.e. without "\0" or special control characters), but I can't find anything on the internet. Is there such an algorithm? It doesn't have to be particularly efficient, just something basic.

@MAKKAM, this is not really related. He wants an algorithm for a specific kind of string (short strings) while I need something general. — laurent
– laurent, Commented Sep 21, 2011 at 10:59
@static_rtti, I'm looking for a compression algorithm that would output a string that can be copied and pasted. I think most compression algorithms output binary data with \0 characters. — laurent
– laurent, Commented Sep 21, 2011 at 11:29

Tomas · Accepted Answer · 2011-09-21 11:20:46Z

8

Easy:

$ echo "Hello world" | gzip -c | base64
H4sIALnHeU4AA/NIzcnJVyjPL8pJ4QIA1eA5twwAAAA=

$ echo "H4sIALnHeU4AA/NIzcnJVyjPL8pJ4QIA1eA5twwAAAA=" | base64 -d | gzip -dc
Hello world

Note: looks like there is no compression, but for bigger data the compression ratio will be better :-)

answered Sep 21, 2011 at 11:20

Tomas

60.2k54 gold badges251 silver badges386 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

user555045 Over a year ago

base64 is not as efficient as possible under the constraints the OP mentioned, there are over 90 printable ASCII characters and even more in extended-ascii

Steve Jessop Over a year ago

Establishes the principle, though. You could substitute Ascii85, or some other scheme using even more printable characters, and the questioner says, "It doesn't have to be particularly efficient".

Tomas Over a year ago

@harold, yes, might not be as space efficient as possible, but the OP said "It doesn't have to be particularly efficient, just something basic" - and that's it! Base64 is very basic, easy and portable. It is ready-to-use in many languages. And there's no need to implement anything else for yourself, no need to reinvent the wheel.

user555045 Over a year ago

does it have to be particularly inefficient? ok it's not that bad, but it kind of hurts my "it could have been better" senses. If there wasn't a compression stage first it would have made sense to me to just base64 it, but now not so much.

Stephen Denne Over a year ago

I tried this, and had to add --ignore-garbage to base64 -d

|

rossum · Accepted Answer · 2024-09-27 07:58:25Z

4

Your requirement for no "special characters" is very restrictive, unless you can guarantee that a subset of characters (say "~") will never be used. Then you can use those characters to mark your compression:

~a -> the
~b -> The
~c -> and
~d -> And
~e -> Sirius Robotics Corporation Ltd.
etc.

Just add commonly used words to the codebook. The codebook can be fixed, as above, or vary with the text to be compressed. Either way the decompressing side will need access to the correct codebook to do the decompression.

ETA: You could use "~~" to denote the special character in the codebook, which would allow the special character to be used in the text to be compressed.

edited Sep 27, 2024 at 7:58

answered Sep 21, 2011 at 11:51

rossum

15.7k2 gold badges26 silver badges40 bronze badges

Comments

reinierpost · Accepted Answer · 2011-09-21 11:45:21Z

Apparently you have some specific character set in mind and you want to use it for both the original string and the compressed string.

Standard compression routines (e.g. gzip) work on byte strings.

One idea is to take existing code (e.g. gzip's) and rewrite it to use your character set instead of bytes.

Another is to construct a 1-to-1 mapping between strings in your character set and arbitrary byte strings, map the original string to a byte string, compress the byte string using a standard compression utility or function, and map the result back to a string using your character set. (Strictly speaking you can use two different mappings.)

One way to construct the mapping is to pad your character set with dummies and a special pad character until you have 2^k different characters (for some k); then each 8 of your characters correspond to k bytes (and shorter strings can be padded with the pad character).

David Cary · Accepted Answer · 2012-09-01 16:14:29Z

As far as I can tell, the most popular compression algorithm that allows standard C string-handling routines to be re-used to handle compressed text (i.e., carefully avoids putting any 0x00 bytes in the compressed string, except as the end-of-compressed-data marker) is simple byte-pair encoding, also called dual-tile encoding or DTE. DTE is often used to compress text in video game ROMs.

When the DTE decompressor prints out a DTE-compressed string, it reads 1 byte at a time from the DTE-compressed string and prints out 1 or two bytes:

compressed byte B in the range 0x01..0xFF: the decoder uses that as an index into the "dictionary" and prints out the 1 or 2 bytes stored in the dictionary at that index.
compressed byte B is 0x00, that's the end of the string -- done.

A typical DTE implementation has a hard-wired dictionary stored in both the encoder and the decoder something like this:

indexes of frequently-used letters -- perhaps the entire ASCII isprint() range 0x20 to 0x7e, and the newline character 0x0A -- represent themselves. (The compressed byte 'a' is decoded as the single letter 'a')
indexes from 0xc0 to 0xff: the byte is decoded into 2 characters: a space character, and a letter formed from this byte XORed with 0x80. (The compressed byte (0x80 xor 'a') is decoded into 2 characters, the space character and the letter 'a').
Any other available indexes ( 0x7f..0xbf ) store other common bigrams ("th", "re", etc.).

Collectives™ on Stack Overflow

String to string compression algorithm?

4 Answers 4

9 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

9 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related