4

Sorry I am not sure how to put the title more accurately.

I have an array which I would like to evenly split into 3 arrays, then each array will have a different size which is the downsampled version of the original array via averaging.

Here is what I have:

import numpy as np
a = np.arange(100)
bins = [5, 4, 3]
split_index = [[20, 39], [40, 59], [60, 80]]
b = []
for count, item in enumerate(bins):
    start = split_index[count][0]
    end = split_index[count][1]
    increment = (end - start) // item
    b_per_band = []
    for i in range(item):
        each_slice = a[start + i * increment : start + (i + 1) * increment]
        b_per_band.append(each_slice.mean())
    b.append(b_per_band)
print(b)

Result:

[[21.0, 24.0, 27.0, 30.0, 33.0], [41.5, 45.5, 49.5, 53.5], [62.5, 68.5, 74.5]]

So I loop through bins, find out how much increment is for each step. Slice it accordingly and append the mean to the result.

But this is really ugly and most importantly has bad performance. As I am dealing with audio spectrum in my case, I would really like to learn a more efficient way to achieving the same result.

Any suggestion?

2
  • to clarify: 1) you take slices given by split_index from an array a 2) for each slice, you calculate 'sub-slices' with lengths given in bins, 3) for each sub-slice, you take the average. Is that correct? Commented Aug 7, 2019 at 11:49
  • (1)(2) Correct. (3) For this particular case, each slice has a size of 20, which I would like to downsample to one of the bins, for example 5, this will mean every 4 samples I will take the average of those 4 and append. Commented Aug 7, 2019 at 11:56

2 Answers 2

2

Here's an option using np.add.reduceat:

a = np.arange(100)
n_in_bin = [5, 4, 3]
split_index = [[20, 39], [40, 59], [60, 80]]
b = []
for i, sl in enumerate(split_index):
    n_bins = (sl[1]-sl[0])//n_in_bin[i]
    v = a[sl[0]:sl[0]+n_in_bin[i]*(n_bins)]
    sel_bins = np.linspace(0, len(v), n_in_bin[i]+1, True).astype(np.int)
    b.append(np.add.reduceat(v, sel_bins[:-1])/np.diff(sel_bins)))
print(b)
# [array([21., 24., 27., 30., 33.]) array([41.5, 45.5, 49.5, 53.5]) array([62.5, 68.5, 74.5])]

Some notes:

  • I changed the name bins to n_in_bin to clarify a bit.
  • using the floor division, you discard some data. Don't know if that's really important, just a hint.
  • the thing that should make this code faster, at least for large array sizes and 'chunks', is the use of np.add.reduceat. From my experience, this can be more efficient than looping.
  • if you have NaNs in your input data, check out this Q&A.

EDIT/REVISION

Since I'm also working on binning stuff at the moment, I tried a couple of things and ran timeit for the three methods shown so far, 'looped' for the one in the question, 'npredat' using np.add.reduceat, npsplit using np.split and got for 100000 iterations an avg time per iteration in [µs]:

a = np.arange(10000)
bins = [5, 4, 3]
split_index = [[20, 3900], [40, 5900], [60, 8000]]
-->
looped: 127.3, npredat: 116.9, npsplit: 135.5

vs.

a = np.arange(100)
bins = [5, 4, 3]
split_index = [[20, 39], [40, 59], [60, 80]]
-->
looped: 95.2, npredat: 103.5, npsplit: 100.5

However, results were slightly inconsistent for multiple runs of the 100k iterations and might differ for other machines than the one I tried this on. So my conclusion would be so far, that differences are marginal. All 3 options fall within the 1µs < domain > 1ms.

Sign up to request clarification or add additional context in comments.

2 Comments

Yeah, this is also what I found comparing different methods. It seems no clear advantage of each one with sample size smaller than 100000. Larger than that this seems more efficient. I don't know why as I see the operations being pretty linear.
right, maybe there is some memory optimization going on, that also depends on the current state of the memory - but that's just guessing. Maybe the question is also suited for the Code Review forum.
0

What you're doing looks very weird to me, including the setup, which could probably use a different approach, making the problem much simpler.

However, using the same approach, you could try this:

b = []

for count, item in enumerate(bins):
    start = split_index[count][0]
    end = split_index[count][1]
    increment = (end - start) // item

    b_per_band = np.mean(np.split(a[start:start + item * increment], item),axis=1)

    b.append(b_per_band)

1 Comment

hmmm. This is correct. But I timed a 8000 samples array splitting into 3x2000 chunks. Using np.split() was actually slower, with the previous method takes 107us vs 125us.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.