3

I want to randomly select rows from a numpy array. Say I have this array-

A = [[1, 3, 0],
     [3, 2, 0],
     [0, 2, 1],
     [1, 1, 4],
     [3, 2, 2],
     [0, 1, 0],
     [1, 3, 1],
     [0, 4, 1],
     [2, 4, 2],
     [3, 3, 1]]

To randomly select say 6 rows, I am doing this:

B = A[np.random.choice(A.shape[0], size=6, replace=False), :]

I want another array C which has the rows which were not selected in B.

Is there some in-built method to do this or do I need to do a brute-force, checking rows of B with rows of A?

1
  • 1
    Look into np.setdiff1d and np.in1d. Commented Dec 19, 2015 at 11:18

3 Answers 3

3

You can make any number of row-wise random partitions of A by slicing a shuffled sequence of row indices:

ind = numpy.arange( A.shape[ 0 ] )
numpy.random.shuffle( ind )
B = A[ ind[ :6 ], : ]
C = A[ ind[ 6: ], : ]

If you don't want to change the order of the rows in each subset, you can sort each slice of the indices:

B = A[ sorted( ind[ :6 ] ), : ]
C = A[ sorted( ind[ 6: ] ), : ]

(Note that the solution provided by @MaxNoe also preserves row order.)

Sign up to request clarification or add additional context in comments.

Comments

1

Solution

This gives you the indices for the selection:

sel = np.random.choice(A.shape[0], size=6, replace=False)

and this B:

B = A[sel]

Get all not selected indices:

unsel = list(set(range(A.shape[0])) - set(sel))

and use them for C:

C = A[unsel]

Variation with NumPy functions

Instead of using set and list, you can use this:

unsel2 = np.setdiff1d(np.arange(A.shape[0]), sel)

For the example array the pure Python version:

%%timeit
unsel1 = list(set(range(A.shape[0])) - set(sel)) 

100000 loops, best of 3: 8.42 µs per loop

is faster than the NumPy version:

%%timeit
unsel2 = np.setdiff1d(np.arange(A.shape[0]), sel)

10000 loops, best of 3: 77.5 µs per loop

For larger A the NumPy version is faster:

A = np.random.random((int(1e4), 3))
sel = np.random.choice(A.shape[0], size=6, replace=False)


%%timeit
unsel1 = list(set(range(A.shape[0])) - set(sel))

1000 loops, best of 3: 1.4 ms per loop


%%timeit
unsel2 = np.setdiff1d(np.arange(A.shape[0]), sel)

1000 loops, best of 3: 315 µs per loop

Comments

1

You can use boolean masks and draw random indices from an integer array which is as long as yours. The ~ is an elementwise not:

idx = np.arange(A.shape[0])
mask = np.zeros_like(idx, dtype=bool)

selected = np.random.choice(idx, 6, replace=False)
mask[selected] = True

B = A[mask]
C = A[~mask]

2 Comments

Ah, sorry. Typo. You need to set mask[selected] = True, not all.
Just the mask is enough. But it was wrong. Now corrected.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.