41

I am looking to apply a function to each row of a numpy array. If this function evaluates to True I will keep the row, otherwise I will discard it. For example, my function might be:

def f(row):
    if sum(row)>10: return True
    else: return False

I was wondering if there was something similar to:

np.apply_over_axes()

which applies a function to each row of a numpy array and returns the result. I was hoping for something like:

np.filter_over_axes()

which would apply a function to each row of a numpy array and only return rows for which the function returned True. Is there anything like this? Or should I just use a for loop?

2 Answers 2

41

Ideally, you would be able to implement a vectorized version of your function and use that to do boolean indexing. For the vast majority of problems this is the right solution. Numpy provides quite a few functions that can act over various axes as well as all the basic operations and comparisons, so most useful conditions should be vectorizable.

import numpy as np

x = np.random.randn(20, 3)
x_new = x[np.sum(x, axis=1) > .5]

If you are absolutely sure that you can't do the above, I would suggest using a list comprehension (or np.apply_along_axis) to create an array of bools to index with.

def myfunc(row):
    return sum(row) > .5

bool_arr = np.array([myfunc(row) for row in x])
x_new = x[bool_arr]

This will get the job done in a relatively clean way, but will be significantly slower than a vectorized version. An example:

x = np.random.randn(5000, 200)

%timeit x[np.sum(x, axis=1) > .5]
# 100 loops, best of 3: 5.71 ms per loop

%timeit x[np.array([myfunc(row) for row in x])]
# 1 loops, best of 3: 217 ms per loop
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Roger, the function I wanted to use was a bit more complex than just taking the sum, so I might end up using the list comprehension solution.
0

As @Roger Fan mentioned, applying a function row-wise should really be done in a vectorized fashion on the entire array. The canonical way to filter is to construct a boolean mask and apply it on the array. That said, if it happens that the function is so complex that vectorization is not possible, it's better/faster to convert the array into a Python list (especially if it uses Python functions such as sum()) and apply the function on it.

msk = arr.sum(axis=1)>10                # best way to create a boolean mask

msk = [f(row) for row in arr.tolist()]  # second best way
#                            ^^^^^^^^   <---- convert to list

filtered_arr = arr[msk]                 # filtered via boolean indexing
A working example and a performance test

As you can see from the timeit test below, looping over a list (arr.tolist()) is much faster than looping over a numpy array (arr), partly because Python's sum() and not np.sum() is called in the function f(). That said, the vectorized method is much faster than both.

def f(row):
    if sum(row)>10: return True
    else: return False
    
arr = np.random.rand(10000, 200)

%timeit arr[[f(row) for row in arr]]
# 260 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit arr[[f(row) for row in arr.tolist()]]
# 114 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit arr[arr.sum(axis=1)>10]
# 10.8 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.