Skip to main content
added 299 characters in body
Source Link
finnw
  • 48.8k
  • 24
  • 150
  • 223

Any algorithm/library that supports a preset dictionary, e.g. zlib.

This way you can prime the compressor with the same kind of text that is likely to appear in the input. If the files are similar in some way (e.g. all URLs, all C programs, all StackOverflow posts, all ASCII-art drawings) then certain substrings will appear in most or all of the input files.

Moving these common substrings into the preset dictionaryEvery compression algorithm will sharesave space if the costsame substring is repeated multiple times in one input file (e.g. "the" in English text or "int" in C code.)

But in the case of storing these substrings among all filesURLs certain strings (e.g. "http://www.", ".com", ".html", ".aspx" will typically appear once in each input file. So you need to share them between files somehow rather than repeating them for everyhaving one compressed occurrence per file. Placing them in a preset dictionary will achieve this.

Any algorithm/library that supports a preset dictionary, e.g. zlib.

This way you can prime the compressor with the same kind of text that is likely to appear in the input. If the files are similar in some way (e.g. all C programs, all StackOverflow posts, all ASCII-art drawings) then certain substrings will appear in most or all of the input files.

Moving these common substrings into the preset dictionary will share the cost of storing these substrings among all files, rather than repeating them for every file.

Any algorithm/library that supports a preset dictionary, e.g. zlib.

This way you can prime the compressor with the same kind of text that is likely to appear in the input. If the files are similar in some way (e.g. all URLs, all C programs, all StackOverflow posts, all ASCII-art drawings) then certain substrings will appear in most or all of the input files.

Every compression algorithm will save space if the same substring is repeated multiple times in one input file (e.g. "the" in English text or "int" in C code.)

But in the case of URLs certain strings (e.g. "http://www.", ".com", ".html", ".aspx" will typically appear once in each input file. So you need to share them between files somehow rather than having one compressed occurrence per file. Placing them in a preset dictionary will achieve this.

Source Link
finnw
  • 48.8k
  • 24
  • 150
  • 223

Any algorithm/library that supports a preset dictionary, e.g. zlib.

This way you can prime the compressor with the same kind of text that is likely to appear in the input. If the files are similar in some way (e.g. all C programs, all StackOverflow posts, all ASCII-art drawings) then certain substrings will appear in most or all of the input files.

Moving these common substrings into the preset dictionary will share the cost of storing these substrings among all files, rather than repeating them for every file.