I'm wondering if anyone knows how to vectorize feature hashing in Python. For example, this is my code:
import numpy as np
hashlen = 5
x = np.array([4, 7, 4, 2, 6, 8, 0, 6, 3, 1])
h = np.array([0, 3, 1, 2, 4, 2, 1, 0, 3, 1])
In feature hashing, h represents the indices of the new vector I am hashing x to, i.e the index 0 of the hashed vector should have 4 and 6 summed up, index 1 should have 4, 0 and 1 summed up, etc. The resulting hashed vector should be:
w = np.array([ 10, 5, 10, 10, 6])
One way of doing this is of course by looping through the hash indices, i.e:
for itr in range(hashlen):
w[itr] = np.sum(x[np.where(h==itr)])
For large vectors, the complexity is a function of hashlen (the length of the hashed vector). It could take too long, especially with a np.where() in it.
I want to do something like:
w = np.zeros(hashlen)
w[h]+= x
However, the result of this is the same as doing
w = np.zeros(hashlen)
w[h] = x
Can anyone let me know if I'm missing something here? Or if there's an 'easy' way of doing the feature hashing that doesn't involve too many computations?