Safest way to generate a unique hash in Python

Question

I need to produce unique identifiers that can be used in filenames and can be reproduced given the same input values. I need to produce millions of these identifiers as the source input has millions of combinations.

For simplicity's sake, I will use a small set in the example, but the actual sets can be rather large (hundreds, maybe thousands, of items); larger than could be manually encoded into a filename.

I noticed that the 5th method of generating UUID's allows you to provide a string input.

> input_set = {'apple', 'banana', 'orange'}
> uuid.uuid5(uuid.NAMESPACE_URL, pickle.dumps(input_set)).hex
'f39926529ad45997984643816c1bc403'

The documentation says it uses SHA1 under the hood. Is the risk of a collision too high? Is there a better way of reliably hashing unique identifiers?

Here's one resource that addresses the question of UUID collisions: en.wikipedia.org/wiki/Universally_unique_identifier#Collisions — user8651755
– user8651755, Commented Dec 1, 2017 at 21:17
Thanks @blurp though that only deals with version 1 and 2 of generating UUID's. I'm looking to use version 5 which generates them from a input string and namespace identifier. — Brendan Abel
– Brendan Abel, Commented Dec 1, 2017 at 21:23
The Wikipedia article talks specifically about the chances of a collision with versions 3, 4, and 5. — user8651755
– user8651755, Commented Dec 1, 2017 at 21:25
@BrendanAbel: What kinds of strings? Are they all known in advance? All hash functions are guaranteed to have collisions. While the chances of a collision are extremely low, there might be ways to guarantee that you will never have collisions if you can make assumptions about your input data. — Blender
– Blender, Commented Dec 1, 2017 at 21:55
@BrendanAbel: If you can enumerate all the valid combinations and bijectively map sets of strings to their corresponding index, you would only need 20 bits (or 4 alphanumeric characters). If you have only a million valid inputs, it's probably much easier to just use a cryptographic hash function and manually verify that there are no collisions. Since it takes thousands of years of processor time to deliberately find one, I doubt you'll accidentally find one. — Blender
– Blender, Commented Dec 2, 2017 at 1:16

Brendan Abel · Accepted Answer · 2021-04-13 14:00:14Z

10

The odds that you'd get an SHA1 collision from strings is astoundingly low. Currently there are less than 63 known collisions for SHA1.

First ever SHA1 collision found

First ever' SHA-1 hash collision calculated. All it took were five clever brains... and 6,610 years of processor time

SHA1 is no longer considered secure in the cryptography world, but certainly exceeds your expectations here.

Cryptographic hashing functions are designed to be one way functions.This means the functions inverse is "hard" to calculate. (i.e. knowing the output in no way helps you determine the input) As Blender pointed out in the comments this has nothing to do with the chance of collisions.

Take a look at the Birthday Paradox for some basic information on how the probability of a collision is calculated.

This question addresses the likely hood of a SHA1 collision. This article states

A cryptographic hash function has provable security against collision attacks if finding collisions is provably polynomial-time reducible from problem P which is supposed to be unsolvable in polynomial time. The function is then called provably secure, or just provable.

Here is a list of "secure" hash algorithms.

UPDATE You stated in the comments your input is much larger than the 160 bit limit for SHA1. I recommend you use SHA3 in this case as there is no limit on the size of your input. Check out the Python documentation for more information.

Here is a basic example:

import sha3
k = sha3.keccak_512()
k.update(b"data")
k.hexdigest()
'1065aceeded3a5e4412e2187e919bffeadf815f5bd73d37fe00d384fe29f55f08462fdabe1007b993ce5b8119630e7db93101d9425d6e352e22ffe3dcb56b825'

edited Apr 13, 2021 at 14:00

Brendan Abel

38k15 gold badges101 silver badges126 bronze badges

answered Dec 1, 2017 at 21:42

DoesData

7,1174 gold badges45 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Brendan Abel Over a year ago

Thanks. Though I want to point out that I'm not using filenames as the source input, I'm using very large strings; much larger than the 160 bit SHA1 or the 128 bit UUID strings.

Blender Over a year ago

Collisions are guaranteed to exist for any hash function, regardless of the existence of one-way functions.

DoesData Over a year ago

If the names are longer than maybe consider using a different hash algorithm. Some can handle strings of arbitrary length.

Blender Over a year ago

@BrendanAbel: SHA-1's output is 160 bits. Every hash function has a fixed-length output, that's why they're useful. They can take any amount of data and produce a fixed-length hash derived from it.

Charlie Parker Over a year ago

is md5 ok? e.g. stackoverflow.com/questions/22974499/…

|

Thomas · Accepted Answer · 2024-01-03 18:06:10Z

8

Instead of using pysha3 (see DoesData's answer), you could also use the built-in module hashlib:

import hashlib

h = hashlib.sha3_512() # Python 3.6+
h.update(b"Hello World")
h.hexdigest()

Output:

'3d58a719c6866b0214f96b0a67b37e51a91e233ce0be126a08f35fdf4c043c6126f40139bfbc338d44eb2a03de9f7bb8eff0ac260b3629811e389a5fbee8a894'

edited Jan 3, 2024 at 18:06

answered Apr 13, 2021 at 13:42

Thomas

10.8k17 gold badges54 silver badges103 bronze badges

1 Comment

Charlie Parker Over a year ago

is md5 ok? e.g. stackoverflow.com/questions/22974499/…

Asclepius · Accepted Answer · 2022-11-24 02:39:02Z

0

If the smaller base64.urlsafe_b64encode output would be preferable:

> import base64, hashlib

> base64.urlsafe_b64encode(hashlib.sha3_512('asdf'.encode()).digest())
b'jYjPWyD1Os164UebWzbcICF1OwSZAsdyR7snsTGzAL08qL7vKHVtzie4mQhnxFd6JTXn47dRQTmcoalMyEsOuQ=='

The above output is of length 88 whereas the corresponding hex would be of length 128.

answered Nov 24, 2022 at 2:39

Asclepius

64.6k20 gold badges188 silver badges164 bronze badges

Collectives™ on Stack Overflow

Safest way to generate a unique hash in Python

3 Answers 3

6 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related