How to efficiently resample a Pandas dataframe into 3d NumPy array?

Question

I have a big data frame with a DatetimeIndex and multiple columns. Now I would like to have an operation resample_3d which can be used like this:

index, array = df.resample_3d("1h", fill_value=0)

... and transforms the data frame

index | A | B | C | D
10:00 | 1 |   | 
10:01 | 1 |   | 
12:00 | 1 |   |
13:00 | 1 |   |

into a 3d-NumPy array of shape (3, 2, 4). The first dimension is the time (which can be looked up in the separately returned index), the second dimension is the row index within the "resample group" and the third dimension are the features. The size of the second dimension is equals the maximum rows in a single resample group. Unused entries are filled (e.g. with zeros).

Is there such a or a similar function in Pandas/another library or is there a way to implement something like this on top of Pandas efficiently without too much work?

I am aware that I could build something on top of df.resample().apply(list), but this is way too slow for bigger data frames.

I have already started my own implementation with Numba, but then quickly realized that this is quite some work.

(I have just discovered xarray and thought I tag this question with it because it may be the better base for doing this than Pandas.)

paime · Accepted Answer · 2020-07-04 19:45:08Z

It is unclear what is your data like, but yes, xarray might be what you search for.

Once your data is well-formatted as a DataArray, you can then just do:

da.resample(time="1h")

It will return a DataArrayResample object.

Usually, when resampling, the new coordinates grid doesn't match the previous grid.

Thus, from there, you need to apply one of the numerous methods of the DataArrayResample object to tell xarray how to fill this new grid.

For example, you may want to interpolate values using the original data as knots:

da.resample(time="1h").interpolate("linear")

But you can also backfill, pad, use the nearest values etc.

If you don't want to fill the new grid, use .asfreq() and new times will be set to NaN. You'll still be able to interpolate later using interpolate_na().

Your case

In your case, it seems that you are doing a down-sampling, and thus that there is an exact match between new grid coordinates and original grid coordinates.

So, methods that will work for you are any of .nearest(), .asfreq(), .interpolate() (note that .interpolate() will convert int to float).

However, since you are downsampling at exact grid knots, what you are really doing is selecting a subset of your array, so you might want to use the .sel() method instead.

Example

An example of down-sampling on exact grid points knots.

Create the data:

>>> dims = ("time", "features")
>>> sizes = (6, 3)
>>> h_step = 0.5

>>> da = xr.DataArray(
        dims=dims,
        data=np.arange(np.prod(sizes)).reshape(*sizes),
        coords=dict(
            time=pd.date_range(
                "04/07/2020",
                periods=sizes[0],
                freq=pd.DateOffset(hours=h_step),
            ),
            features=list(string.ascii_uppercase[: sizes[1]]),
        ),
    )

>>> da
<xarray.DataArray (time: 6, features: 3)>
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17]])
Coordinates:
  * time      (time) datetime64[ns] 2020-04-07 ... 2020-04-07T02:30:00
  * features  (features) <U1 'A' 'B' 'C'

>>> da.time.values
array(['2020-04-07T00:00:00.000000000',
       '2020-04-07T00:30:00.000000000',
       '2020-04-07T01:00:00.000000000', 
       '2020-04-07T01:30:00.000000000',
       '2020-04-07T02:00:00.000000000',
       '2020-04-07T02:30:00.000000000'],
      dtype='datetime64[ns]')

Downsampling using .resample() and .nearest():

>>> da.resample(time="1h").nearest()
<xarray.DataArray (time: 3, features: 3)>
array([[ 0,  1,  2],
       [ 6,  7,  8],
       [12, 13, 14]])
Coordinates:
  * time      (time) datetime64[ns] 2020-04-07 ... 2020-04-07T02:00:00
  * features  (features) <U1 'A' 'B' 'C'

>>> da.resample(time="1h").nearest().time.values
array(['2020-04-07T00:00:00.000000000',
       '2020-04-07T01:00:00.000000000',
       '2020-04-07T02:00:00.000000000'],
      dtype='datetime64[ns]')

Down-sampling by selection:

>>> dwn_step = 2

>>> new_time = pd.date_range(
        "04/07/2020",
        periods=sizes[0] // dwn_step,
        freq=pd.DateOffset(hours=h_step * dwn_step),
    )

>>> da.sel(time=new_time)
<xarray.DataArray (time: 3, features: 3)>
array([[ 0,  1,  2],
       [ 6,  7,  8],
       [12, 13, 14]])
Coordinates:
  * time      (time) datetime64[ns] 2020-04-07 ... 2020-04-07T02:00:00
  * features  (features) <U1 'A' 'B' 'C'

>>> da.sel(time=new_time).time.values
array(['2020-04-07T00:00:00.000000000',
       '2020-04-07T01:00:00.000000000',
       '2020-04-07T02:00:00.000000000'],
      dtype='datetime64[ns]')

Another option to create new_time index is to merely do:

new_time = da.time[::dwn_coeff]

It is more straightforward, but you can't choose the first selected time (which can be either good or a bad, depending on your case).

Yes, but this returns a DataArrayResample. It is unclear to me how I could transform this into a 3d-array (except for with apply, which is too slow). The idea is to resample all data into buckets (2nd dimension in example above) without reducing (e.g. mean) it.
Not all methods to be applied on a DataArrayResample object are reducing methods. I added more in-depth explanations, hope it helps.

Collectives™ on Stack Overflow

How to efficiently resample a Pandas dataframe into 3d NumPy array?

1 Answer 1

Your case

Example

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your case

Example

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related