11

I have more than 100 million unique strings (VARCHAR(100) UNIQUE in MySQL database). Now I use the code below to create unique hash from them (VARCHAR(32) UNIQUE) in order to reduct index size of the InnoDB table (a unique index on varchar(100) is roughly 3 times larger than on varchar(32) field).

id = hashlib.md5(str).hexdigest()

Is there any other method to create shorter ids from those strings and make reasonable uniqueness guarantees?

2
  • SHA1? Note that you can make the strings shorter still by using a base-64 version of the digest rather than a hex one: base64.b64encode(hashlib.md5("foo").digest()) Commented Jun 19, 2012 at 6:23
  • 1
    You could use a BINARY(16) column to store the MD5 hash, without either hex or base64 encoding. Commented Jun 19, 2012 at 6:53

4 Answers 4

13

You can save it as integer:

id_ = int(hashlib.md5(your_str).hexdigest(), 16)

Or as binary string:

id_ = hashlib.md5(your_str).digest()
Sign up to request clarification or add additional context in comments.

Comments

3

One crude way can be, you could do md5 and then pick first 16 characters from it, instead of all 32. Collisions still won't be that high, and you'll have reasonable uniqueness guarantee.

Comments

2

The simplest solutions is to convert hexadecimal data (yor digests have base of 16) to something else, eg. with base 64.

If you agree on some level of higher risk, you can use only eg first ten digits (hexadecimal) of the digest. It will give you 16**10 (more than 10**12) possibilities instead of 16**32 (more than 10**38), but it is still huge and is commonly used technique (Git and Github usually use 7 digits for identifying commits, afair).

1 Comment

Btw. MD5 encoded using base64 takes 22 characters and you have the same information as with raw MD5. If going from 32 to 22 chatacters satisfies you, then this is ok.
0

Since hashing and compression are very similar an obvious solution is to use a compression algorithm to compress your keys. This will preserve the uniqueness of the keys as well.

2 Comments

Can you suggest a compression algorithm that can achieve the requested compression ratio on such short inputs?
Please, check this and this posts

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.