Detect the first unique rows in multiple numpy 2-d arrays

Question

I have multiple numpy 2-d arrays which I want to compare rowwise. The output of my function should be a numpy 2-d array representing all rows of the three inputs arrays. I want to be able to detect the first time that a row occurs, every second or third duplicate row should be flagged as False in the output. It is not possible to have duplicate rows within a single array.

If it is possible I would like to avoid the use of loops, as they slow down the calculation speed.

Example:

array1 = array([[444, 427],
   [444, 428],
   [444, 429],
   [444, 430],
   [445, 421]], dtype=uint64)

array2 = array([[446, 427],
   [446, 440],
   [444, 429],
   [444, 432],
   [445, 421]], dtype=uint64)

array3 = array([[447, 427],
   [446, 441],
   [444, 429],
   [444, 432],
   [445, 421]], dtype=uint64)

# output
array([[True, True, True, True,  True],
   [ True,  True,  False, True,  False],
   [ True,  True,  False, False,  False]], dtype=bool)

Any ideas?

Good one. It is not possible to have duplicate rows within the same array (Will add that to the question). — Wilmar van Ommeren
– Wilmar van Ommeren, Commented May 8, 2016 at 20:16
What if the last row of array3 were [444, 427] instead, what must be the output's last row, last column element be? — Divakar
– Divakar, Commented May 8, 2016 at 21:35
as divakar says, it is not clear in your question if you want a hstack or a vstack — B. M.
– B. M., Commented May 10, 2016 at 6:32

Community · Accepted Answer · 2017-05-23 10:34:09Z

7

Here's a fast vectorized approach:

def find_dupe_rows(*arrays):

    A = np.vstack(arrays)
    rtype = np.dtype((np.void, A.dtype.itemsize*A.shape[1]))
    _, first_idx = np.unique(A.view(rtype), return_index=True)
    out = np.zeros(A.shape[0], np.bool)
    out[first_idx] = True

    return out.reshape(len(arrays), -1)

Example usage:

print(find_dupe_rows(array1, array2, array3))
# [[ True  True  True  True  True]
#  [ True  True False  True False]
#  [ True  True False False False]]

To break that down a bit:

Stack the three subarrays to produce a (15, 2) array:
```
A = np.vstack((array1, array2, array3))
```

Use np.unique together with this trick to efficiently find the indices where each unique row first occurs within A:

rtype = np.dtype((np.void, A.dtype.itemsize * A.shape[1]))
_, first_idx = np.unique(A.view(rtype), return_index=True)

Every row that isn't the first occurrence of a unique row can be treated as a duplicate:

out = np.zeros(A.shape[0], np.bool)     # output is False by default
out[first_idx] = True                   # set first occurrences to True

Finally, reshape this boolean vector to (narrays, nrows), as per your example output:
```
return out.reshape(len(arrays), -1)
```

edited May 23, 2017 at 10:34

CommunityBot

11 silver badge

answered May 8, 2016 at 21:24

ali_m

74.6k28 gold badges230 silver badges314 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Wilmar van Ommeren Over a year ago

Thanks a lot! This is perfect, I will implement it in my script today.

Community · Accepted Answer · 2017-05-23 12:34:58Z

If you are looking for duplicates along the same row IDs across the 2D input arrays, here's a vectorized approach -

# Concatenate all input 2D arrays to form a single tall 2D array
A = np.vstack((array1,array2,array3))

# Consider the rows of the 2D input arrays as linear indexing tuples. 
# Thus, we can reduce the input size, reduced by length of rows.
# This would help in simplifying the solution ahead and help in performance.
A_lidx = np.ravel_multi_index(A.T,A.max(0)+1).reshape(-1,array1.shape[0])

# Finally use broadcasting to perform elementwise comparison between 
# the elements of each column against themselves. Then, use argmax 
# along the first axis giving us the first indices of the duplicates, 
# which when compared with index ID would lead us to final boolean array.
out = (A_lidx[:,None] == A_lidx).argmax(0) == np.arange(A_lidx.shape[0])[:,None]

If you were looking for a global search across all rows of all 2D input arrays, you need a bit modified approach, like so -

# Concatenate all input 2D arrays to form a single tall 2D array
A = np.vstack((array1,array2,array3))

# Consider the rows of the 2D input arrays as linear indexing tuples. 
# Thus, we can reduce the input size, reduced by length of rows.
# This would help in simplifying the solution ahead and help in performance.
A_lidx = np.ravel_multi_index(A.T,A.max(0)+1).ravel()

# Find first occurances by differentiating sorted version and 
# looking for indices with positive change.
sidx = A_lidx.argsort()
first_occ = sidx[np.append(0,np.where(np.diff(A_lidx[sidx])>0)[0]+1)]

# Finally, set those indices as True in an output array of appropriate length
out = np.in1d(np.arange(len(A_lidx)),first_occ).reshape(3,-1)

Please note that the step to calculate first_occ is basically a crude way of using np.unique(..., return_index=True) as used in `@ali_m's solution.

Andreas Hsieh · Accepted Answer · 2016-05-08 20:19:26Z

3

array1 = np.array([[444, 427],
   [444, 428],
   [444, 429],
   [444, 430],
   [445, 421]])

array2 = np.array([[446, 427],
   [446, 440],
   [444, 429],
   [444, 432],
   [445, 421]])

array3 = np.array([[447, 427],
   [446, 441],
   [444, 429],
   [444, 432],
   [445, 421]])

array_list = [array1,array2,array3]
x = set()
check = np.ones((len(array_list),array1.shape[0]),dtype=bool)
for i,item in enumerate(array_list):
    for j in range(item.shape[0]):
        temp = tuple(item[j,:])
        if temp in x:
            check[i,j] = False
        else:
            x.add(temp)

print check

Hope this will help.

answered May 8, 2016 at 20:19

Andreas Hsieh

2,1601 gold badge12 silver badges9 bronze badges

1 Comment

Wilmar van Ommeren Over a year ago

Awesome, the output is indeed correct! However, solution without loops would be more optimal, therefore I will not yet mark this question as solved. Still, thanks a lot!

Collectives™ on Stack Overflow

Detect the first unique rows in multiple numpy 2-d arrays

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related