Group numpy into multiple sub-arrays using an array of values

Question

I have an array of points along a line:

a = np.array([18, 56, 32, 75, 55, 55])

I have another array that corresponds to the indices I want to use to access the information in a (they will always have equal lengths). Neither array a nor array b are sorted.

b = np.array([0, 2, 3, 2, 2, 2])

I want to group a into multiple sub-arrays such that the following would be possible:

c[0] -> array([18])
c[2] -> array([56, 75, 55, 55])
c[3] -> array([32])

Although the above example is simple, I will be dealing with millions of points, so efficient methods are preferred. It is also essential later that any sub-array of points can be accessed in this fashion later in the program by automated methods.

Might be helpful later on depending on your goal: Vectorized groupby with NumPy — Brad Solomon
– Brad Solomon, Commented Mar 19, 2018 at 22:01

Divakar · Accepted Answer · 2018-03-19 23:00:10Z

5

Here's one approach -

def groupby(a, b):
    # Get argsort indices, to be used to sort a and b in the next steps
    sidx = b.argsort(kind='mergesort')
    a_sorted = a[sidx]
    b_sorted = b[sidx]

    # Get the group limit indices (start, stop of groups)
    cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])

    # Split input array with those start, stop ones
    out = [a_sorted[i:j] for i,j in zip(cut_idx[:-1],cut_idx[1:])]
    return out

A simpler, but lesser efficient approach would be to use np.split to replace the last few lines and get the output, like so -

out = np.split(a_sorted, np.flatnonzero(b_sorted[1:] != b_sorted[:-1])+1 )

Sample run -

In [38]: a
Out[38]: array([18, 56, 32, 75, 55, 55])

In [39]: b
Out[39]: array([0, 2, 3, 2, 2, 2])

In [40]: groupby(a, b)
Out[40]: [array([18]), array([56, 75, 55, 55]), array([32])]

To get sub-arrays covering the entire range of IDs in b -

def groupby_perID(a, b):
    # Get argsort indices, to be used to sort a and b in the next steps
    sidx = b.argsort(kind='mergesort')
    a_sorted = a[sidx]
    b_sorted = b[sidx]

    # Get the group limit indices (start, stop of groups)
    cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])

    # Create cut indices for all unique IDs in b
    n = b_sorted[-1]+2
    cut_idxe = np.full(n, cut_idx[-1], dtype=int)

    insert_idx = b_sorted[cut_idx[:-1]]
    cut_idxe[insert_idx] = cut_idx[:-1]
    cut_idxe = np.minimum.accumulate(cut_idxe[::-1])[::-1]

    # Split input array with those start, stop ones
    out = [a_sorted[i:j] for i,j in zip(cut_idxe[:-1],cut_idxe[1:])]
    return out

Sample run -

In [241]: a
Out[241]: array([18, 56, 32, 75, 55, 55])

In [242]: b
Out[242]: array([0, 2, 3, 2, 2, 2])

In [243]: groupby_perID(a, b)
Out[243]: [array([18]), array([], dtype=int64), 
           array([56, 75, 55, 55]), array([32])]

edited Mar 19, 2018 at 23:00

answered Mar 19, 2018 at 21:57

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

berkelem Over a year ago

Why do you specify kind="mergesort" in argsort()?

Divakar Over a year ago

@berkelem To keep the order of elements as it was in a. With the default, that's not guaranteed.

Collectives™ on Stack Overflow

Group numpy into multiple sub-arrays using an array of values

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related