0

I am parsing data from multiple sources and I want to assign a unique (string) id to each entry. Each entry contains a title (string), url(string) and body(string). We can get same title from multiple sources but those will have different urls and I would like to store both the items in that case. I am thinking of creating a hash of title and url and assign that as an id, that ways if I get same title and url from different sources, the id will be same and I will be able to identify that it's a duplicate.

import hashlib 
hashlib.sha256(str("title url").encode('utf-8')).hexdigest()

But I think there can be a case where 2 different title url combinations might generate same hash, not sure how to overcome the clash. Can someone suggest a way of generating unique identifier using strings I don't want to use timestamp because I might get same row from different sources at different times

3
  • 2
    well You can just check if hash not in hashes: Commented Apr 13, 2021 at 8:20
  • What about combining title + url + body instead of single title for generating hashes Commented Apr 13, 2021 at 8:21
  • 3
    No you won't, there is NO WAY, you get a SHA256 collision, that is a cryptographic hash function, you're safe Commented Apr 13, 2021 at 8:21

1 Answer 1

4

You're safe, you won't have 2 different title url combinations generating same hash with SHA-256


SHA256 is a cryptographic hash function, from the SHA-2 hash family, and is a standard from 2020.

The collision probability (2 inputs gives same output) is 1/(2^128) which is about 2e-39.


See: SHA-256 collisions on crypto.stackexchange

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.