0

I have a list of small strings, and I would like to quickly compress them. What is a good approach to do this? The strings don't have any other properties, aside from having ~13 million strings with sizes from 5 - 30 characters length.

Update: From the comments, these are sent over a network, used for a join so I don't know the specific properties, order doesn't matter, and I am sending them in bulk.

8
  • 2
    Strings of what? Lower-case English words? Mixed case? Random alphanumeric characters? Random 8-bit bytes? Commented Mar 15, 2018 at 13:58
  • Have a look at Huffman Encoding. Commented Mar 15, 2018 at 14:21
  • 2
    If they really don't have any other properties, then they're totally random, and nothing will work. If they have at least some nonrandomness, then probably a fast general-purpose compressor like gzip will find most of it. You also don't say if you care about the order of the strings; if you don't, you might be able to save a bit more space with a good ordering (e.g. lexicographical will at least tend to put strings with the same prefix near each other). Commented Mar 15, 2018 at 14:24
  • 3
    If you don't have any operations you need the compressed form for, you can semi-instantaneously compress them to zero memory occupancy by just deleting them: state what the compressed form shall be useful for. Not knowing properties of input data before processing is (luckily) different from lack thereof. Commented Mar 15, 2018 at 14:39
  • 1
    Don't comment comments asking for additional information or clarification: edit your question. Especially for essential information like send them over a network, bulk decompress or [decompressed] order doesn't matter. Commented Mar 15, 2018 at 20:27

1 Answer 1

1

From your comments, you don't need to be able to decompress an individual small string.
Sorting the strings prior to using the standard compression/decompression method you can most easily use should go a long way.
Measure the difference in effect, report welcome!

as compressed as possible is dangerous as any "optimisation".
Fix a goal upfront and a way to tell not there from good enough, and move on once achieved.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.