A more efficient way to resizing Numpy Array into different size chunks

Question

Sorry I am not sure how to put the title more accurately.

I have an array which I would like to evenly split into 3 arrays, then each array will have a different size which is the downsampled version of the original array via averaging.

Here is what I have:

import numpy as np
a = np.arange(100)
bins = [5, 4, 3]
split_index = [[20, 39], [40, 59], [60, 80]]
b = []
for count, item in enumerate(bins):
    start = split_index[count][0]
    end = split_index[count][1]
    increment = (end - start) // item
    b_per_band = []
    for i in range(item):
        each_slice = a[start + i * increment : start + (i + 1) * increment]
        b_per_band.append(each_slice.mean())
    b.append(b_per_band)
print(b)

Result:

[[21.0, 24.0, 27.0, 30.0, 33.0], [41.5, 45.5, 49.5, 53.5], [62.5, 68.5, 74.5]]

So I loop through bins, find out how much increment is for each step. Slice it accordingly and append the mean to the result.

But this is really ugly and most importantly has bad performance. As I am dealing with audio spectrum in my case, I would really like to learn a more efficient way to achieving the same result.

Any suggestion?

to clarify: 1) you take slices given by split_index from an array a 2) for each slice, you calculate 'sub-slices' with lengths given in bins, 3) for each sub-slice, you take the average. Is that correct? — FObersteiner
– FObersteiner, Commented Aug 7, 2019 at 11:49
(1)(2) Correct. (3) For this particular case, each slice has a size of 20, which I would like to downsample to one of the bins, for example 5, this will mean every 4 samples I will take the average of those 4 and append. — J_yang
– J_yang, Commented Aug 7, 2019 at 11:56

FObersteiner · Accepted Answer · 2019-08-09 10:31:15Z

2

Here's an option using np.add.reduceat:

a = np.arange(100)
n_in_bin = [5, 4, 3]
split_index = [[20, 39], [40, 59], [60, 80]]
b = []
for i, sl in enumerate(split_index):
    n_bins = (sl[1]-sl[0])//n_in_bin[i]
    v = a[sl[0]:sl[0]+n_in_bin[i]*(n_bins)]
    sel_bins = np.linspace(0, len(v), n_in_bin[i]+1, True).astype(np.int)
    b.append(np.add.reduceat(v, sel_bins[:-1])/np.diff(sel_bins)))
print(b)
# [array([21., 24., 27., 30., 33.]) array([41.5, 45.5, 49.5, 53.5]) array([62.5, 68.5, 74.5])]

Some notes:

I changed the name bins to n_in_bin to clarify a bit.
using the floor division, you discard some data. Don't know if that's really important, just a hint.
the thing that should make this code faster, at least for large array sizes and 'chunks', is the use of np.add.reduceat. From my experience, this can be more efficient than looping.
if you have NaNs in your input data, check out this Q&A.

EDIT/REVISION

Since I'm also working on binning stuff at the moment, I tried a couple of things and ran timeit for the three methods shown so far, 'looped' for the one in the question, 'npredat' using np.add.reduceat, npsplit using np.split and got for 100000 iterations an avg time per iteration in [µs]:

a = np.arange(10000)
bins = [5, 4, 3]
split_index = [[20, 3900], [40, 5900], [60, 8000]]
-->
looped: 127.3, npredat: 116.9, npsplit: 135.5

vs.

a = np.arange(100)
bins = [5, 4, 3]
split_index = [[20, 39], [40, 59], [60, 80]]
-->
looped: 95.2, npredat: 103.5, npsplit: 100.5

However, results were slightly inconsistent for multiple runs of the 100k iterations and might differ for other machines than the one I tried this on. So my conclusion would be so far, that differences are marginal. All 3 options fall within the 1µs < domain > 1ms.

edited Aug 9, 2019 at 10:31

answered Aug 7, 2019 at 12:02

FObersteiner

26.1k9 gold badges60 silver badges94 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

J_yang Over a year ago

Yeah, this is also what I found comparing different methods. It seems no clear advantage of each one with sample size smaller than 100000. Larger than that this seems more efficient. I don't know why as I see the operations being pretty linear.

FObersteiner Over a year ago

right, maybe there is some memory optimization going on, that also depends on the current state of the memory - but that's just guessing. Maybe the question is also suited for the Code Review forum.

Ardweaden · Accepted Answer · 2019-08-07 11:45:45Z

0

What you're doing looks very weird to me, including the setup, which could probably use a different approach, making the problem much simpler.

However, using the same approach, you could try this:

b = []

for count, item in enumerate(bins):
    start = split_index[count][0]
    end = split_index[count][1]
    increment = (end - start) // item

    b_per_band = np.mean(np.split(a[start:start + item * increment], item),axis=1)

    b.append(b_per_band)

answered Aug 7, 2019 at 11:45

Ardweaden

9379 silver badges25 bronze badges

1 Comment

J_yang Over a year ago

hmmm. This is correct. But I timed a 8000 samples array splitting into 3x2000 chunks. Using np.split() was actually slower, with the previous method takes 107us vs 125us.

Collectives™ on Stack Overflow

A more efficient way to resizing Numpy Array into different size chunks

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related