9

I need to produce unique identifiers that can be used in filenames and can be reproduced given the same input values. I need to produce millions of these identifiers as the source input has millions of combinations.

For simplicity's sake, I will use a small set in the example, but the actual sets can be rather large (hundreds, maybe thousands, of items); larger than could be manually encoded into a filename.

I noticed that the 5th method of generating UUID's allows you to provide a string input.

> input_set = {'apple', 'banana', 'orange'}
> uuid.uuid5(uuid.NAMESPACE_URL, pickle.dumps(input_set)).hex
'f39926529ad45997984643816c1bc403'

The documentation says it uses SHA1 under the hood. Is the risk of a collision too high? Is there a better way of reliably hashing unique identifiers?

14
  • 2
    Here's one resource that addresses the question of UUID collisions: en.wikipedia.org/wiki/Universally_unique_identifier#Collisions Commented Dec 1, 2017 at 21:17
  • Thanks @blurp though that only deals with version 1 and 2 of generating UUID's. I'm looking to use version 5 which generates them from a input string and namespace identifier. Commented Dec 1, 2017 at 21:23
  • 1
    The Wikipedia article talks specifically about the chances of a collision with versions 3, 4, and 5. Commented Dec 1, 2017 at 21:25
  • 1
    @BrendanAbel: What kinds of strings? Are they all known in advance? All hash functions are guaranteed to have collisions. While the chances of a collision are extremely low, there might be ways to guarantee that you will never have collisions if you can make assumptions about your input data. Commented Dec 1, 2017 at 21:55
  • 1
    @BrendanAbel: If you can enumerate all the valid combinations and bijectively map sets of strings to their corresponding index, you would only need 20 bits (or 4 alphanumeric characters). If you have only a million valid inputs, it's probably much easier to just use a cryptographic hash function and manually verify that there are no collisions. Since it takes thousands of years of processor time to deliberately find one, I doubt you'll accidentally find one. Commented Dec 2, 2017 at 1:16

3 Answers 3

10

The odds that you'd get an SHA1 collision from strings is astoundingly low. Currently there are less than 63 known collisions for SHA1.

First ever SHA1 collision found

First ever' SHA-1 hash collision calculated. All it took were five clever brains... and 6,610 years of processor time

SHA1 is no longer considered secure in the cryptography world, but certainly exceeds your expectations here.

Cryptographic hashing functions are designed to be one way functions.This means the functions inverse is "hard" to calculate. (i.e. knowing the output in no way helps you determine the input) As Blender pointed out in the comments this has nothing to do with the chance of collisions.

Take a look at the Birthday Paradox for some basic information on how the probability of a collision is calculated.

This question addresses the likely hood of a SHA1 collision. This article states

A cryptographic hash function has provable security against collision attacks if finding collisions is provably polynomial-time reducible from problem P which is supposed to be unsolvable in polynomial time. The function is then called provably secure, or just provable.

Here is a list of "secure" hash algorithms.

UPDATE You stated in the comments your input is much larger than the 160 bit limit for SHA1. I recommend you use SHA3 in this case as there is no limit on the size of your input. Check out the Python documentation for more information.

Here is a basic example:

import sha3
k = sha3.keccak_512()
k.update(b"data")
k.hexdigest()
'1065aceeded3a5e4412e2187e919bffeadf815f5bd73d37fe00d384fe29f55f08462fdabe1007b993ce5b8119630e7db93101d9425d6e352e22ffe3dcb56b825'
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks. Though I want to point out that I'm not using filenames as the source input, I'm using very large strings; much larger than the 160 bit SHA1 or the 128 bit UUID strings.
Collisions are guaranteed to exist for any hash function, regardless of the existence of one-way functions.
If the names are longer than maybe consider using a different hash algorithm. Some can handle strings of arbitrary length.
@BrendanAbel: SHA-1's output is 160 bits. Every hash function has a fixed-length output, that's why they're useful. They can take any amount of data and produce a fixed-length hash derived from it.
|
8

Instead of using pysha3 (see DoesData's answer), you could also use the built-in module hashlib:

import hashlib

h = hashlib.sha3_512() # Python 3.6+
h.update(b"Hello World")
h.hexdigest()

Output:

'3d58a719c6866b0214f96b0a67b37e51a91e233ce0be126a08f35fdf4c043c6126f40139bfbc338d44eb2a03de9f7bb8eff0ac260b3629811e389a5fbee8a894'

1 Comment

0

If the smaller base64.urlsafe_b64encode output would be preferable:

> import base64, hashlib

> base64.urlsafe_b64encode(hashlib.sha3_512('asdf'.encode()).digest())
b'jYjPWyD1Os164UebWzbcICF1OwSZAsdyR7snsTGzAL08qL7vKHVtzie4mQhnxFd6JTXn47dRQTmcoalMyEsOuQ=='

The above output is of length 88 whereas the corresponding hex would be of length 128.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.