2

I have a list of arrays with variable length. I have something like that:

a=[np.array([0, 3, 4]), np.array([1, 8]), np.array([2, 5, 7]), np.array([6])]

And would like to extract from all arrays that contain more than one value all values but the first one. It is quite straight forward to do it in a for-loop but I would highly appreciate to know how to do it without a for loop to save time. My for-loop is like that:

duplicate_pos = []
    for i in range(len(a)):
        if len(a[i]) > 1:
            duplicate_pos.append(a[i][1:])

Thx a lot.

PS: Even though this is the first question I ever ask here, stackoverflow is my daily science companion since I started my PhD several years ago. Thx to this amazing community.

4
  • Despite your love for stackoverflow, this question better suites codereview... Commented Apr 10, 2016 at 13:42
  • I agree with @Francesco Commented Apr 10, 2016 at 13:44
  • Why do you think that doing it without for loop, it will save you time?...Did you try to profile it? Commented Apr 10, 2016 at 14:13
  • How to do numpy tasks without a loop is a very common type of SO question. It belongs here. Commented Apr 10, 2016 at 15:15

5 Answers 5

2

You can use a combination of filter (to get rid of shorties) and map (to slice):

b = map(lambda li: li[1:], filter(lambda li: len(li) > 1, a))

# [array([3, 4]), array([8]), array([5, 7])]

In Python3, b is a map object which can be listified like any other iterable via list(b). In Python2, map returns a list.

Sign up to request clarification or add additional context in comments.

2 Comments

Combining filter or map with lambda is really discouraged and does not make any performance improvement; most probably just the opposite, did you profile it against OP's method?
No, I did not and I do not think it does. It is however the only suggested way to avoid using the for keyword which is kind of indicated by the OP ;) Under the hood this cannot be done without a loop anyway (I would strongly assume).
2

you can do that in one line as follows:

duplicate_pos = [i[1:] for i in a if len(i) != 1]

Comments

1

You can use a comprehension list:

duplicate_pos = [subarray[1:] for subarray in a if len(subarray)>1]

Or if you are going to use the values only once you could use a generator

duplicate_pos = (subarray[1:] for subarray in a if len(subarray)>1)

Comments

1

In case you want to use pure numpy to solve this problem:

Numpy supports multidimensional arrays and has very fast reduce-like functions. But numpy requires multidimensional arrays to have a constant length in each dimension. So you could (not necessarily should) use a masked-array to solve this problem:

>>> a=[[0., 3, 4], [1, 8, np.nan], [2, 5, 7], [6, np.nan, np.nan]] # nan to fill the rows
>>> b = np.ma.masked_invalid(a)
>>> b
masked_array(
 data =
   [[0.0 3.0 4.0]
    [1.0 8.0 --]
    [2.0 5.0 7.0]
    [6.0 -- --]],
 mask =
   [[False False False]
    [False False  True]
    [False False False]
    [False  True  True]],
 fill_value = 1e+20)

To discard all rows only containing less than 2 elements use count (counts unmasked values in this case) followed by a boolean indexing:

>>> b[np.ma.count(b, axis=1) > 1][:,1:]
masked_array(
 data =
   [[3.0 4.0]
    [8.0 --]
    [5.0 7.0]],
 mask =
   [[False False]
    [False  True]
    [False False]],
 fill_value = 1e+20)

I've included the intermediate steps here:

>>> np.ma.count(b, axis=1)
array([3, 2, 3, 1], dtype=int64)
>>> np.ma.count(b, axis=1) > 1
array([ True,  True,  True, False], dtype=bool)
>>> b[np.ma.count(b, axis=1) > 1]
masked_array(
 data =
   [[0.0 3.0 4.0]
    [1.0 8.0 --]
    [2.0 5.0 7.0]],
 mask =
   [[False False False]
    [False False  True]
    [False False False]],
 fill_value = 1e+20)

2 Comments

Using nan turns the integer arrays into floats.
@hpaulj - This is just to illustrate how it could be done. It was just convenient to use np.ma.masked_invalid to create the masked array. In practice this would be rather done by some kind of preprocessing. The important point is just the b[np.ma.count(b, axis=1) > 1][:,1:]-line which replaces the for-loop.
0

Since the list contains numpy arrays I suspect you are hoping to replace the loop with a numpy operation, not just another form of Python iteration. That can speed things up by moving the iteration to compiled code. For small arrays it isn't faster because of a numpy overhead.

In this case you are starting with a list, not a 2d array, and the list contains arrays of varying size. That's a good indicator that there isn't a pure numpy solution.

A cleaner version of your loop is (no need to use a index)

def foo(a):
   b=[]
   for i in a:
       if i.shape[0]>1:   # use len(i) if i might be a list
           b.append(i[1:])
   return b

But this is expressed nicely as a list comprehension

[i[1:] for i in a if i.shape[0]>1]

In timeit tests, this is 50% faster than the for loop. But the test case is so small I wouldn't put too much stock in the time differences.

I expect the other iterators - generators, maps, itertools - will time about the same. Others are welcome to elaborate on times.

i[1:] runs ok on a 1 (or 0) element array, so you might not need the if test. Or you could filter out empty arrays in another iteration. For small lists, the iteration choice is usually a matter of style, what expresses the task most clearly to the reader, rather than a matter of time.


If the subarrays were all the same length, or possibly padded with something like -1, you could combine them into a 2d array, and select from that

A = np.vstack(a)
A[:,1:]

But vstack iterates on the list, turning each sub array into a 2d array before applying concatenate. That alone makes it slower than the list solutions.

1 Comment

Thx. I put the solved here because of the additional info provided.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.