5

I'm wondering if anyone knows how to vectorize feature hashing in Python. For example, this is my code:

    import numpy as np
    hashlen = 5
    x = np.array([4, 7, 4, 2, 6, 8, 0, 6, 3, 1])
    h = np.array([0, 3, 1, 2, 4, 2, 1, 0, 3, 1])

In feature hashing, h represents the indices of the new vector I am hashing x to, i.e the index 0 of the hashed vector should have 4 and 6 summed up, index 1 should have 4, 0 and 1 summed up, etc. The resulting hashed vector should be:

    w = np.array([ 10, 5, 10, 10, 6])

One way of doing this is of course by looping through the hash indices, i.e:

    for itr in range(hashlen):
        w[itr] = np.sum(x[np.where(h==itr)])

For large vectors, the complexity is a function of hashlen (the length of the hashed vector). It could take too long, especially with a np.where() in it.

I want to do something like:

    w = np.zeros(hashlen)
    w[h]+= x

However, the result of this is the same as doing

    w = np.zeros(hashlen)
    w[h] = x

Can anyone let me know if I'm missing something here? Or if there's an 'easy' way of doing the feature hashing that doesn't involve too many computations?

1 Answer 1

5

You can use bincount with weights to do what you are asking:

>>> np.bincount(h,weights=x)
array([ 10.,   5.,  10.,  10.,   6.])

For matrices:

>>> import numpy as np
>>> a=np.random.randint(0,5,(50,50))
>>> rand=np.random.rand(5)
>>> rand
array([ 0.10899745,  0.35296303,  0.21127571,  0.56433924,  0.27895281])
>>> b=np.take(rand,a)

#Unfortunately you cannot do it like this:
>>> np.bincount(a,weights=b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: object too deep for desired array

#There we go:
>>> np.bincount(a.flat,weights=b.flat)
array([  55.04371257,  172.59892108,   96.34172236,  297.40677707,
        145.89232039])

This used fancy indexing to see what was happening:

>>> np.bincount(a.flat)
array([505, 489, 456, 527, 523])
>>> np.bincount(a.flat)*rand
array([  55.04371257,  172.59892108,   96.34172236,  297.40677707,
        145.89232039])
Sign up to request clarification or add additional context in comments.

4 Comments

Just curious - is it possible to do something like that on matrices instead of vectors? Without looping of course.
Updated for matrices. There are some other ways if you are looking for something else.
No I mean, when you hash it, you shouldn't be hashing it to a vector; you should be hashing it to a matrix of size 50x5 (in this case). Something like: z = np.zeros([50,5]) for itr in range(50): z[itr] = np.bincount(a[itr], weights=b[itr])
Im not sure thats possible- looking at a numpy function that does something similar, histogramdd, it actually loops through the multidimensional array with np.digitize. Its probably best to ask this as a separate question with a discrete example.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.