4

I have multiple numpy 2-d arrays which I want to compare rowwise. The output of my function should be a numpy 2-d array representing all rows of the three inputs arrays. I want to be able to detect the first time that a row occurs, every second or third duplicate row should be flagged as False in the output. It is not possible to have duplicate rows within a single array.

If it is possible I would like to avoid the use of loops, as they slow down the calculation speed.

Example:

array1 = array([[444, 427],
   [444, 428],
   [444, 429],
   [444, 430],
   [445, 421]], dtype=uint64)

array2 = array([[446, 427],
   [446, 440],
   [444, 429],
   [444, 432],
   [445, 421]], dtype=uint64)

array3 = array([[447, 427],
   [446, 441],
   [444, 429],
   [444, 432],
   [445, 421]], dtype=uint64)

# output
array([[True, True, True, True,  True],
   [ True,  True,  False, True,  False],
   [ True,  True,  False, False,  False]], dtype=bool)

Any ideas?

7
  • 1
    Do you care about duplicate rows within the same array? Commented May 8, 2016 at 19:50
  • Good one. It is not possible to have duplicate rows within the same array (Will add that to the question). Commented May 8, 2016 at 20:16
  • What if the last row of array3 were [444, 427] instead, what must be the output's last row, last column element be? Commented May 8, 2016 at 21:35
  • as divakar says, it is not clear in your question if you want a hstack or a vstack Commented May 10, 2016 at 6:32
  • @Divakar it must be false Commented May 10, 2016 at 6:44

3 Answers 3

7

Here's a fast vectorized approach:

def find_dupe_rows(*arrays):

    A = np.vstack(arrays)
    rtype = np.dtype((np.void, A.dtype.itemsize*A.shape[1]))
    _, first_idx = np.unique(A.view(rtype), return_index=True)
    out = np.zeros(A.shape[0], np.bool)
    out[first_idx] = True

    return out.reshape(len(arrays), -1)

Example usage:

print(find_dupe_rows(array1, array2, array3))
# [[ True  True  True  True  True]
#  [ True  True False  True False]
#  [ True  True False False False]]

To break that down a bit:

  1. Stack the three subarrays to produce a (15, 2) array:

    A = np.vstack((array1, array2, array3))
    
  2. Use np.unique together with this trick to efficiently find the indices where each unique row first occurs within A:

    rtype = np.dtype((np.void, A.dtype.itemsize * A.shape[1]))
    _, first_idx = np.unique(A.view(rtype), return_index=True)
    
  3. Every row that isn't the first occurrence of a unique row can be treated as a duplicate:

    out = np.zeros(A.shape[0], np.bool)     # output is False by default
    out[first_idx] = True                   # set first occurrences to True
    
  4. Finally, reshape this boolean vector to (narrays, nrows), as per your example output:

    return out.reshape(len(arrays), -1)
    
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot! This is perfect, I will implement it in my script today.
4

If you are looking for duplicates along the same row IDs across the 2D input arrays, here's a vectorized approach -

# Concatenate all input 2D arrays to form a single tall 2D array
A = np.vstack((array1,array2,array3))

# Consider the rows of the 2D input arrays as linear indexing tuples. 
# Thus, we can reduce the input size, reduced by length of rows.
# This would help in simplifying the solution ahead and help in performance.
A_lidx = np.ravel_multi_index(A.T,A.max(0)+1).reshape(-1,array1.shape[0])

# Finally use broadcasting to perform elementwise comparison between 
# the elements of each column against themselves. Then, use argmax 
# along the first axis giving us the first indices of the duplicates, 
# which when compared with index ID would lead us to final boolean array.
out = (A_lidx[:,None] == A_lidx).argmax(0) == np.arange(A_lidx.shape[0])[:,None]

If you were looking for a global search across all rows of all 2D input arrays, you need a bit modified approach, like so -

# Concatenate all input 2D arrays to form a single tall 2D array
A = np.vstack((array1,array2,array3))

# Consider the rows of the 2D input arrays as linear indexing tuples. 
# Thus, we can reduce the input size, reduced by length of rows.
# This would help in simplifying the solution ahead and help in performance.
A_lidx = np.ravel_multi_index(A.T,A.max(0)+1).ravel()

# Find first occurances by differentiating sorted version and 
# looking for indices with positive change.
sidx = A_lidx.argsort()
first_occ = sidx[np.append(0,np.where(np.diff(A_lidx[sidx])>0)[0]+1)]

# Finally, set those indices as True in an output array of appropriate length
out = np.in1d(np.arange(len(A_lidx)),first_occ).reshape(3,-1)

Please note that the step to calculate first_occ is basically a crude way of using np.unique(..., return_index=True) as used in `@ali_m's solution.

Comments

3
array1 = np.array([[444, 427],
   [444, 428],
   [444, 429],
   [444, 430],
   [445, 421]])

array2 = np.array([[446, 427],
   [446, 440],
   [444, 429],
   [444, 432],
   [445, 421]])

array3 = np.array([[447, 427],
   [446, 441],
   [444, 429],
   [444, 432],
   [445, 421]])

array_list = [array1,array2,array3]
x = set()
check = np.ones((len(array_list),array1.shape[0]),dtype=bool)
for i,item in enumerate(array_list):
    for j in range(item.shape[0]):
        temp = tuple(item[j,:])
        if temp in x:
            check[i,j] = False
        else:
            x.add(temp)

print check

Hope this will help.

1 Comment

Awesome, the output is indeed correct! However, solution without loops would be more optimal, therefore I will not yet mark this question as solved. Still, thanks a lot!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.