Vectorizing feature hashing in python

Question

I'm wondering if anyone knows how to vectorize feature hashing in Python. For example, this is my code:

    import numpy as np
    hashlen = 5
    x = np.array([4, 7, 4, 2, 6, 8, 0, 6, 3, 1])
    h = np.array([0, 3, 1, 2, 4, 2, 1, 0, 3, 1])

In feature hashing, h represents the indices of the new vector I am hashing x to, i.e the index 0 of the hashed vector should have 4 and 6 summed up, index 1 should have 4, 0 and 1 summed up, etc. The resulting hashed vector should be:

    w = np.array([ 10, 5, 10, 10, 6])

One way of doing this is of course by looping through the hash indices, i.e:

    for itr in range(hashlen):
        w[itr] = np.sum(x[np.where(h==itr)])

For large vectors, the complexity is a function of hashlen (the length of the hashed vector). It could take too long, especially with a np.where() in it.

I want to do something like:

    w = np.zeros(hashlen)
    w[h]+= x

However, the result of this is the same as doing

    w = np.zeros(hashlen)
    w[h] = x

Can anyone let me know if I'm missing something here? Or if there's an 'easy' way of doing the feature hashing that doesn't involve too many computations?

Daniel · Accepted Answer · 2013-08-01 17:30:33Z

5

You can use bincount with weights to do what you are asking:

>>> np.bincount(h,weights=x)
array([ 10.,   5.,  10.,  10.,   6.])

For matrices:

>>> import numpy as np
>>> a=np.random.randint(0,5,(50,50))
>>> rand=np.random.rand(5)
>>> rand
array([ 0.10899745,  0.35296303,  0.21127571,  0.56433924,  0.27895281])
>>> b=np.take(rand,a)

#Unfortunately you cannot do it like this:
>>> np.bincount(a,weights=b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: object too deep for desired array

#There we go:
>>> np.bincount(a.flat,weights=b.flat)
array([  55.04371257,  172.59892108,   96.34172236,  297.40677707,
        145.89232039])

This used fancy indexing to see what was happening:

>>> np.bincount(a.flat)
array([505, 489, 456, 527, 523])
>>> np.bincount(a.flat)*rand
array([  55.04371257,  172.59892108,   96.34172236,  297.40677707,
        145.89232039])

edited Aug 1, 2013 at 17:30

answered Jul 31, 2013 at 16:30

Daniel

19.6k7 gold badges64 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ganesh Sundar Over a year ago

Just curious - is it possible to do something like that on matrices instead of vectors? Without looping of course.

Daniel Over a year ago

Updated for matrices. There are some other ways if you are looking for something else.

Ganesh Sundar Over a year ago

No I mean, when you hash it, you shouldn't be hashing it to a vector; you should be hashing it to a matrix of size 50x5 (in this case). Something like: z = np.zeros([50,5]) for itr in range(50): z[itr] = np.bincount(a[itr], weights=b[itr])

Daniel Over a year ago

Im not sure thats possible- looking at a numpy function that does something similar, histogramdd, it actually loops through the multidimensional array with np.digitize. Its probably best to ask this as a separate question with a discrete example.

Collectives™ on Stack Overflow

Vectorizing feature hashing in python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related