38

I have a large set of data in which I need to compare the distances of a set of samples from this array with all the other elements of the array. Below is a very simple example of my data set.

import numpy as np
import scipy.spatial.distance as sd

data = np.array(
    [[ 0.93825827,  0.26701143],
     [ 0.99121108,  0.35582816],
     [ 0.90154837,  0.86254049],
     [ 0.83149103,  0.42222948],
     [ 0.27309625,  0.38925281],
     [ 0.06510739,  0.58445673],
     [ 0.61469637,  0.05420098],
     [ 0.92685408,  0.62715114],
     [ 0.22587817,  0.56819403],
     [ 0.28400409,  0.21112043]]
)


sample_indexes = [1,2,3]

# I'd rather not make this
other_indexes = list(set(range(len(data))) - set(sample_indexes))

sample_data = data[sample_indexes]
other_data = data[other_indexes]

# compare them
dists = sd.cdist(sample_data, other_data)

Is there a way to index a numpy array for indexes that are NOT the sample indexes? In my above example I make a list called other_indexes. I'd rather not have to do this for various reasons (large data set, threading, a very VERY low amount of memory on the system this is running on etc. etc. etc.). Is there a way to do something like..

other_data = data[ indexes not in sample_indexes]

I read that numpy masks can do this but I tried...

other_data = data[~sample_indexes]

And this gives me an error. Do I have to create a mask?

2
  • Can data be arranged so that the first N rows form the sample_data and the remainder form the other_data? If so, you could define sample_data and other_data using basic slices, which return views. This would require very little extra memory since the views share the same underlying data. Commented Aug 15, 2014 at 18:32
  • Also, if you are very memory-constrained, you might consider making the file-based arrays using np.memmap. Commented Aug 15, 2014 at 18:33

4 Answers 4

47
mask = np.ones(len(data), np.bool)
mask[sample_indexes] = 0
other_data = data[mask]

not the most elegant for what perhaps should be a single-line statement, but its fairly efficient, and the memory overhead is minimal too.

If memory is your prime concern, np.delete would avoid the creation of the mask, and fancy-indexing creates a copy anyway.

On second thought; np.delete does not modify the existing array, so its pretty much exactly the single line statement you are looking for.

Sign up to request clarification or add additional context in comments.

5 Comments

doesn't np.delete create a new copy of the array with the specified elements deleted? I'd rather not create a new array, I'd rather read from the existing one in place.
yes, delete creates a copy. if memory is really that tight, have you considered storing your data in a pytables array, and operating on that? see f.I. pytables.github.io/usersguide/…
note; if the number of deletions is truly small, a python loop to swap these elements to the end of the array, and then creating a view of the array, would be a simple and efficient solution
Okay, that works for my purposes. You said I can do a python loop to swap elements to the end of an array. Can elements be swapped in-place in a numpy array? How would I do this? Thanks
This is precisely what the official documentation recommends: docs.scipy.org/doc/numpy/reference/generated/numpy.delete.html
13

You may want to try in1d

In [5]:

select = np.in1d(range(data.shape[0]), sample_indexes)
In [6]:

print data[select]
[[ 0.99121108  0.35582816]
 [ 0.90154837  0.86254049]
 [ 0.83149103  0.42222948]]
In [7]:

print data[~select]
[[ 0.93825827  0.26701143]
 [ 0.27309625  0.38925281]
 [ 0.06510739  0.58445673]
 [ 0.61469637  0.05420098]
 [ 0.92685408  0.62715114]
 [ 0.22587817  0.56819403]
 [ 0.28400409  0.21112043]]

2 Comments

run this ''''a=np.array([[1,2],[3,4], [5,6]]) a[~np.array([0,1])] ''''
For data of shape[0] == 25,000, @eelco-hoogendoorn's way to create the boolean mask is 25x faster on my machine. Which makes sense because you simply index the relevant positions, whereas here you do a lookup for each index. By the way, this lookup can be sped up 7x by using np.arange instead of range.
5

You may also use setdiff1d:

In [11]: data[np.setdiff1d(np.arange(data.shape[0]), sample_indexes)]
Out[11]: 
array([[ 0.93825827,  0.26701143],
       [ 0.27309625,  0.38925281],
       [ 0.06510739,  0.58445673],
       [ 0.61469637,  0.05420098],
       [ 0.92685408,  0.62715114],
       [ 0.22587817,  0.56819403],
       [ 0.28400409,  0.21112043]])

Comments

-3

I'm not familiar with the specifics on numpy, but here's a general solution. Suppose you have the following list:
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].
You create another list of indices you don't want:
inds = [1, 3, 6].
Now simply do this:
good_data = [x for x in a if x not in inds], resulting in good_data = [0, 2, 4, 5, 7, 8, 9].

1 Comment

this will create lots of python objects, and thus will be wildly memory inefficient compared to numpy

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.