6

I have an array of points along a line:

a = np.array([18, 56, 32, 75, 55, 55])

I have another array that corresponds to the indices I want to use to access the information in a (they will always have equal lengths). Neither array a nor array b are sorted.

b = np.array([0, 2, 3, 2, 2, 2])

I want to group a into multiple sub-arrays such that the following would be possible:

c[0] -> array([18])
c[2] -> array([56, 75, 55, 55])
c[3] -> array([32])

Although the above example is simple, I will be dealing with millions of points, so efficient methods are preferred. It is also essential later that any sub-array of points can be accessed in this fashion later in the program by automated methods.

1

1 Answer 1

5

Here's one approach -

def groupby(a, b):
    # Get argsort indices, to be used to sort a and b in the next steps
    sidx = b.argsort(kind='mergesort')
    a_sorted = a[sidx]
    b_sorted = b[sidx]

    # Get the group limit indices (start, stop of groups)
    cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])

    # Split input array with those start, stop ones
    out = [a_sorted[i:j] for i,j in zip(cut_idx[:-1],cut_idx[1:])]
    return out

A simpler, but lesser efficient approach would be to use np.split to replace the last few lines and get the output, like so -

out = np.split(a_sorted, np.flatnonzero(b_sorted[1:] != b_sorted[:-1])+1 )

Sample run -

In [38]: a
Out[38]: array([18, 56, 32, 75, 55, 55])

In [39]: b
Out[39]: array([0, 2, 3, 2, 2, 2])

In [40]: groupby(a, b)
Out[40]: [array([18]), array([56, 75, 55, 55]), array([32])]

To get sub-arrays covering the entire range of IDs in b -

def groupby_perID(a, b):
    # Get argsort indices, to be used to sort a and b in the next steps
    sidx = b.argsort(kind='mergesort')
    a_sorted = a[sidx]
    b_sorted = b[sidx]

    # Get the group limit indices (start, stop of groups)
    cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])

    # Create cut indices for all unique IDs in b
    n = b_sorted[-1]+2
    cut_idxe = np.full(n, cut_idx[-1], dtype=int)

    insert_idx = b_sorted[cut_idx[:-1]]
    cut_idxe[insert_idx] = cut_idx[:-1]
    cut_idxe = np.minimum.accumulate(cut_idxe[::-1])[::-1]

    # Split input array with those start, stop ones
    out = [a_sorted[i:j] for i,j in zip(cut_idxe[:-1],cut_idxe[1:])]
    return out

Sample run -

In [241]: a
Out[241]: array([18, 56, 32, 75, 55, 55])

In [242]: b
Out[242]: array([0, 2, 3, 2, 2, 2])

In [243]: groupby_perID(a, b)
Out[243]: [array([18]), array([], dtype=int64), 
           array([56, 75, 55, 55]), array([32])]
Sign up to request clarification or add additional context in comments.

2 Comments

Why do you specify kind="mergesort" in argsort()?
@berkelem To keep the order of elements as it was in a. With the default, that's not guaranteed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.