42

How would you convert arbitrary strings into unique integers, which would be the same across Python sessions and platforms? hash('my string') wouldn't work because a different value is returned for each Python session and platform.

1
  • 1
    It would be helpful if you could clarify if you want a guarantee of uniqueness, or if you're satisfied with a high probability of uniqueness, as with a hash function. The fact that you're talking about hash() suggests the latter...? Do you need to be able to invert the mapping, or not? Commented Jul 5, 2021 at 14:13

5 Answers 5

54

Use a hash algorithm such as MD5 or SHA1, then convert the hexdigest via int():

>>> import hashlib
>>> int(hashlib.md5('Hello, world!'.encode('utf-8')).hexdigest(), 16)
144653930895353261282233826065192032313L
Sign up to request clarification or add additional context in comments.

7 Comments

This is a good answer, but technically the integer produced is not unique. There are less MD5 hashes than available strings. However, chances of a collision are very low
This is the case for any hash method.
What does "very low" mean? Would it be unwise to use this algorithm in production when uniqueness is required?
Slight modification: If you want to constrain the size of the int: int(hashlib.md5('Hello, world!').hexdigest()[:8], 16) will be < 232, int(hashlib.md5('Hello, world!').hexdigest()[:16], 16) will be < 264.
If you get TypeError: Unicode-objects must be encoded before hashing you need to encode your string, for example: int(hashlib.md5('Hello, world!'.encode('utf-8')).hexdigest(), 16)
|
8

If a hash function really won't work for you, you can turn the string into a number.

my_string = 'my string'
def string_to_int(s):
    ord3 = lambda x : '%.3d' % ord(x)
    return int(''.join(map(ord3, s)))

In[10]: string_to_int(my_string)
Out[11]: 109121032115116114105110103L

This is invertible, by mapping each triplet through chr.

def int_to_string(n)
    s = str(n)
    return ''.join([chr(int(s[i:i+3])) for i in range(0, len(s), 3)])

In[12]: int_to_string(109121032115116114105110103L)
Out[13]: 'my string'

1 Comment

This maps '\0' and '\0\0' to the same thing - you should prepend a '1'. Also this is a little inefficient, could use the hex representation so you'd have smaller numbers (this is then equivalent to using the string's binary representation and interpreting it as a number).
3

Here are my python27 implementation for algorithms listed here: http://www.cse.yorku.ca/~oz/hash.html. No idea if they are efficient or not.

from ctypes import c_ulong

def ulong(i): return c_ulong(i).value  # numpy would be better if available

def djb2(L):
  """
  h = 5381
  for c in L:
    h = ((h << 5) + h) + ord(c) # h * 33 + c
  return h
  """
  return reduce(lambda h,c: ord(c) + ((h << 5) + h), L, 5381)

def djb2_l(L):
  return reduce(lambda h,c: ulong(ord(c) + ((h << 5) + h)), L, 5381)

def sdbm(L):
  """
  h = 0
  for c in L:
    h = ord(c) + (h << 6) + (h << 16) - h
  return h
  """
  return reduce(lambda h,c: ord(c) + (h << 6) + (h << 16) - h, L, 0)

def sdbm_l(L):
  return reduce(lambda h,c: ulong(ord(c) + (h << 6) + (h << 16) - h), L, 0)

def loselose(L):
  """
  h = 0
  for c in L:
    h += ord(c);
    return h
  """
  return sum(ord(c) for c in L)

def loselose_l(L):
  return reduce(lambda h,c: ulong(ord(c) + h), L, 0)

Comments

1

First off, you probably don't really want the integers to be actually unique. If you do then your numbers might be unlimited in size. If that really is what you want then you could use a bignum library and interpret the bits of the string as the representation of a (potentially very large) integer. If your strings can include the \0 character then you should prepend a 1, so you can distinguish e.g. "\0\0" from "\0".

Now, if you prefer bounded-size numbers you'll be using some form of hashing. MD5 will work but it's overkill for the stated purpose. I recommend using sdbm instead, it works very well. In C it looks like this:

static unsigned long sdbm(unsigned char *str)
{
    unsigned long hash = 0;
    int c;

    while (c = *str++)
        hash = c + (hash << 6) + (hash << 16) - hash;

    return hash;
}

The source, http://www.cse.yorku.ca/~oz/hash.html, also presents a few other hash functions.

1 Comment

You're quite correct. This would definitely be a problem if I was trying to convert entire documents to a number. However, for my application, I'll only be converting short strings, usually less than a few dozen characters.
0

Here's another option, quite crude (probably has many collisions) and not very legible.

It worked for the purpose of generating an int (and later on, a random color) for different strings:

aString = "don't panic"
reduce( lambda x,y:x+y, map( lambda x:ord(x[0])*x[1],zip( aString, range( 1, len( aString ) ) ) ) )

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.