2

Requirement: I have a big set of unique strings. I need to assign an unique int id to each of them. After that, either get id from string or get string from id should be efficient enough (both memory and speed).

If in c/c++, one can store these strings in a hash table (an array of const char * for example), and assign the index of a string in the table as the id.

Is it possible do the same thing or other solution in python? Otherwise I need to maintain two dict that map strings to ids and vice vasa.

Update: the set is frozend, no need to change.

4

2 Answers 2

1

If only string -> id is enough, just use hash function:

In [2]: hash( 'hello' )
Out[2]: 840651671246116861

In [3]: hash( 'helloo' )
Out[3]: -827725961091893887

If you need both ways, as @njzk2 suggested:

values = {hash(value): value for value in string_list}
# from id -> string:
values[id]
# from string -> id:
hash(string)

If you are cautious for collisions in hashing and your data is static, you can check if there are any collisions:

hashes = set()
for value in string_list:
   hashed = hash(value)
   if hashed in hashes:
      print('at least one collision in hashing')
      break
   hashes.add(hashed)
print('no collisions at hashing')

If you have any collisions, which is very unlikely, you can do:

myDict1 = {} # string --> id dictonary
myDict2 = {} # id --> string dictionary

counter = 0
for value in string_list:
   myDict1[value] = counter
   myDict2[counter] = value
   counter += 1
Sign up to request clarification or add additional context in comments.

11 Comments

why would you need 2 dictionnaries? values = {hash(value): value for value in string_list} is enough. (since, as you pointed out, you can always obtain the hash from the string)
also, as per the question I have a big set of unique strings
@njzk2 Good catch! Do you want to add a new answer, or should I update mine?
But hash(string) maybe collide for two different strings.
This is great if you don't need to insert and delete. Otherwise, it's best to use the suggested two-way-reverse-map in the questions comment.
|
1

Here is my solution now, using only one dict.

class StringTable(object):

    def __init__(self):
        self.table = {}

    def probe_id(self, ss):
        table = self.table
        id_ = hash(ss)

        while True:
            s = table.get(id_, None)

            # id_ not use, return it
            if s is None:
                return id_, False

            # id_ used and equal to ss
            if ss == s:
                return id_, True

            # id_ used by other string, probe again!
            id_ = id_ + 1

            # don't use this, endless loop, why?
            #id_ = hash(id_)

    def store(self, ss):
        id_, occupied = self.probe_id(ss)
        if not occupied:
            self.table[id_] = ss
            print 'store {%s: %r}' % (id_, ss)
        return id_

    def get_string_from_id(self, id_):
        ret = self.table.get(id_, None)
        print 'get_string_from_id %s -> %r' % (id_, ret)
        return ret

    def get_id_from_string(self, ss):
        id_, occupied = self.probe_id(ss)
        ret = id_ if occupied else None
        print 'get_id_from_string %r -> %s' % (ss, ret)
        return ret


st = StringTable()
# http://bugs.python.org/issue14621
s1 = "\x00\xcf\x0b\x96\x19"
s2 = "\x00\x6d\x29\x45\x18"

print repr(s1), hash(s1)
print repr(s2), hash(s2)

id1 = st.store(s1)
id2 = st.store(s2)

st.get_string_from_id(id1)
st.get_string_from_id(id2)

st.get_id_from_string(s1)
st.get_id_from_string(s2)

Output:

jayven@laptop:~/test$ python stringtable.py
'\x00\xcf\x0b\x96\x19' 1220138288
'\x00m)E\x18' 1220138288
store {1220138288: '\x00\xcf\x0b\x96\x19'}
store {1220138289: '\x00m)E\x18'}
get_string_from_id 1220138288 -> '\x00\xcf\x0b\x96\x19'
get_string_from_id 1220138289 -> '\x00m)E\x18'
get_id_from_string '\x00\xcf\x0b\x96\x19' -> 1220138288
get_id_from_string '\x00m)E\x18' -> 1220138289

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.