1

I have around 40k rows and I want to test all kinds of selection combinations on the rows. By selection I mean boolean masks. The number of masks/filters is around 250MM.

The current simplified code:

np_arr = np.random.randint(1, 40000, 40000)
results = np.empty(250000000)
filters = np.random.randint(1, size=(250000000, 40000))
for i in range(250000000):
    row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
    # Performing simple calculations such as sum, prod, count on selected rows and saving to result
    results[i] = row_selection.sum() # Save simple calculation result to results array

I tried Numba and Multiprocessing, but since most of the processing is in the filter selection rather than the computation, that doesn't help much.

What would be the most efficient way to solve this? Is there any way to parallelize this? As far as I see I need to loop through each filter to then individually calculate the sum, prod, count etc because I can't apply filters in parallel (even though the calculations after applying the filters are very simple).

Appreciate any suggestions on performance improvement/speedup.

4
  • Are all functions you want to apply available in Numba, or at least easy to implement? Commented Nov 22, 2018 at 16:49
  • for all i,j filters[i,j] ==0. use randint(2, ...) instead. Commented Nov 22, 2018 at 18:26
  • Hi, yes calculations are easy to implement in Numba, but the tricky part is the loop which applies the filter 250MM times. Commented Nov 23, 2018 at 2:51
  • Where do you get the filter array from in your calculation? A boolean array of size (250000000, 40000) has 10TB and would not fit into RAM. Or do you want to create some random numbers in the loop which applies the filter? Commented Nov 23, 2018 at 9:00

2 Answers 2

3

To get good performane within Numba simply avoid masking and therefore very costly array copies. You have to implement the filters yourself, but that shouldn't be any problem with the filters you mentioned.

Parallelization is also very easy to do.

Example

import numpy as np
import numba as nb

max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
filters = np.random.randint(low=0,high=2, size=(max_num, max_num2)).astype(np.bool_)

#Implement your functions like this, avoid masking
#Sum Filter
@nb.njit(fastmath=True)
def sum_filter(filter,arr):
  sum=0.
  for i in range(filter.shape[0]):
    if filter[i]==True:
      sum+=arr[i]
  return sum

#Implement your functions like this, avoid masking
#Prod Filter
@nb.njit(fastmath=True)
def prod_filter(filter,arr):
  prod=1.
  for i in range(filter.shape[0]):
    if filter[i]==True:
      prod*=arr[i]
  return sum

@nb.njit(parallel=True)
def main_func(np_arr,filters):
  results = np.empty(filters.shape[0])
  for i in nb.prange(max_num):
    results[i]=sum_filter(filters[i],np_arr)
    #results[i]=prod_filter(filters[i],np_arr)
  return results
Sign up to request clarification or add additional context in comments.

Comments

1

One way to improve is to move the as_type outside the loop. In my tests it reduced the execution time by more than half. For comparison, check the two codes below:

import numpy as np
import time

max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2))
start = time.time()
for i in range(max_num):
    row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
    # Performing simple calculations such as sum, prod, count on selected rows and saving to result
    results[i] = row_selection.sum() # Save simple calculation result to results array

end = time.time()
print(end - start)

takes 2.12

while

import numpy as np
import time

max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2)).astype(np.bool_)
start = time.time()
for i in range(max_num):
    row_selection = np_arr[filters[i]] # Select rows based on next filter
    # Performing simple calculations such as sum, prod, count on selected rows and saving to result
    results[i] = row_selection.sum() # Save simple calculation result to results array

end = time.time()
print(end - start)

takes 0.940

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.