How would you convert arbitrary strings into unique integers, which would be the same across Python sessions and platforms? hash('my string') wouldn't work because a different value is returned for each Python session and platform.
-
1It would be helpful if you could clarify if you want a guarantee of uniqueness, or if you're satisfied with a high probability of uniqueness, as with a hash function. The fact that you're talking about hash() suggests the latter...? Do you need to be able to invert the mapping, or not?user1142217– user11422172021-07-05 14:13:02 +00:00Commented Jul 5, 2021 at 14:13
5 Answers
Use a hash algorithm such as MD5 or SHA1, then convert the hexdigest via int():
>>> import hashlib
>>> int(hashlib.md5('Hello, world!'.encode('utf-8')).hexdigest(), 16)
144653930895353261282233826065192032313L
7 Comments
int(hashlib.md5('Hello, world!').hexdigest()[:8], 16) will be < 232, int(hashlib.md5('Hello, world!').hexdigest()[:16], 16) will be < 264.TypeError: Unicode-objects must be encoded before hashing you need to encode your string, for example: int(hashlib.md5('Hello, world!'.encode('utf-8')).hexdigest(), 16)If a hash function really won't work for you, you can turn the string into a number.
my_string = 'my string'
def string_to_int(s):
ord3 = lambda x : '%.3d' % ord(x)
return int(''.join(map(ord3, s)))
In[10]: string_to_int(my_string)
Out[11]: 109121032115116114105110103L
This is invertible, by mapping each triplet through chr.
def int_to_string(n)
s = str(n)
return ''.join([chr(int(s[i:i+3])) for i in range(0, len(s), 3)])
In[12]: int_to_string(109121032115116114105110103L)
Out[13]: 'my string'
1 Comment
Here are my python27 implementation for algorithms listed here: http://www.cse.yorku.ca/~oz/hash.html. No idea if they are efficient or not.
from ctypes import c_ulong
def ulong(i): return c_ulong(i).value # numpy would be better if available
def djb2(L):
"""
h = 5381
for c in L:
h = ((h << 5) + h) + ord(c) # h * 33 + c
return h
"""
return reduce(lambda h,c: ord(c) + ((h << 5) + h), L, 5381)
def djb2_l(L):
return reduce(lambda h,c: ulong(ord(c) + ((h << 5) + h)), L, 5381)
def sdbm(L):
"""
h = 0
for c in L:
h = ord(c) + (h << 6) + (h << 16) - h
return h
"""
return reduce(lambda h,c: ord(c) + (h << 6) + (h << 16) - h, L, 0)
def sdbm_l(L):
return reduce(lambda h,c: ulong(ord(c) + (h << 6) + (h << 16) - h), L, 0)
def loselose(L):
"""
h = 0
for c in L:
h += ord(c);
return h
"""
return sum(ord(c) for c in L)
def loselose_l(L):
return reduce(lambda h,c: ulong(ord(c) + h), L, 0)
Comments
First off, you probably don't really want the integers to be actually unique. If you do then your numbers might be unlimited in size. If that really is what you want then you could use a bignum library and interpret the bits of the string as the representation of a (potentially very large) integer. If your strings can include the \0 character then you should prepend a 1, so you can distinguish e.g. "\0\0" from "\0".
Now, if you prefer bounded-size numbers you'll be using some form of hashing. MD5 will work but it's overkill for the stated purpose. I recommend using sdbm instead, it works very well. In C it looks like this:
static unsigned long sdbm(unsigned char *str)
{
unsigned long hash = 0;
int c;
while (c = *str++)
hash = c + (hash << 6) + (hash << 16) - hash;
return hash;
}
The source, http://www.cse.yorku.ca/~oz/hash.html, also presents a few other hash functions.
1 Comment
Here's another option, quite crude (probably has many collisions) and not very legible.
It worked for the purpose of generating an int (and later on, a random color) for different strings:
aString = "don't panic"
reduce( lambda x,y:x+y, map( lambda x:ord(x[0])*x[1],zip( aString, range( 1, len( aString ) ) ) ) )