0

I have a big data frame with a DatetimeIndex and multiple columns. Now I would like to have an operation resample_3d which can be used like this:

index, array = df.resample_3d("1h", fill_value=0)

... and transforms the data frame

index | A | B | C | D
10:00 | 1 |   | 
10:01 | 1 |   | 
12:00 | 1 |   |
13:00 | 1 |   |

into a 3d-NumPy array of shape (3, 2, 4). The first dimension is the time (which can be looked up in the separately returned index), the second dimension is the row index within the "resample group" and the third dimension are the features. The size of the second dimension is equals the maximum rows in a single resample group. Unused entries are filled (e.g. with zeros).

Is there such a or a similar function in Pandas/another library or is there a way to implement something like this on top of Pandas efficiently without too much work?

I am aware that I could build something on top of df.resample().apply(list), but this is way too slow for bigger data frames.

I have already started my own implementation with Numba, but then quickly realized that this is quite some work.

(I have just discovered xarray and thought I tag this question with it because it may be the better base for doing this than Pandas.)

1 Answer 1

1

It is unclear what is your data like, but yes, xarray might be what you search for.

Once your data is well-formatted as a DataArray, you can then just do:

da.resample(time="1h")

It will return a DataArrayResample object.

Usually, when resampling, the new coordinates grid doesn't match the previous grid.

Thus, from there, you need to apply one of the numerous methods of the DataArrayResample object to tell xarray how to fill this new grid.

For example, you may want to interpolate values using the original data as knots:

da.resample(time="1h").interpolate("linear")

But you can also backfill, pad, use the nearest values etc.

If you don't want to fill the new grid, use .asfreq() and new times will be set to NaN. You'll still be able to interpolate later using interpolate_na().

Your case

In your case, it seems that you are doing a down-sampling, and thus that there is an exact match between new grid coordinates and original grid coordinates.

So, methods that will work for you are any of .nearest(), .asfreq(), .interpolate() (note that .interpolate() will convert int to float).

However, since you are downsampling at exact grid knots, what you are really doing is selecting a subset of your array, so you might want to use the .sel() method instead.

Example

An example of down-sampling on exact grid points knots.

Create the data:

>>> dims = ("time", "features")
>>> sizes = (6, 3)
>>> h_step = 0.5

>>> da = xr.DataArray(
        dims=dims,
        data=np.arange(np.prod(sizes)).reshape(*sizes),
        coords=dict(
            time=pd.date_range(
                "04/07/2020",
                periods=sizes[0],
                freq=pd.DateOffset(hours=h_step),
            ),
            features=list(string.ascii_uppercase[: sizes[1]]),
        ),
    )

>>> da
<xarray.DataArray (time: 6, features: 3)>
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17]])
Coordinates:
  * time      (time) datetime64[ns] 2020-04-07 ... 2020-04-07T02:30:00
  * features  (features) <U1 'A' 'B' 'C'

>>> da.time.values
array(['2020-04-07T00:00:00.000000000',
       '2020-04-07T00:30:00.000000000',
       '2020-04-07T01:00:00.000000000', 
       '2020-04-07T01:30:00.000000000',
       '2020-04-07T02:00:00.000000000',
       '2020-04-07T02:30:00.000000000'],
      dtype='datetime64[ns]')

Downsampling using .resample() and .nearest():

>>> da.resample(time="1h").nearest()
<xarray.DataArray (time: 3, features: 3)>
array([[ 0,  1,  2],
       [ 6,  7,  8],
       [12, 13, 14]])
Coordinates:
  * time      (time) datetime64[ns] 2020-04-07 ... 2020-04-07T02:00:00
  * features  (features) <U1 'A' 'B' 'C'

>>> da.resample(time="1h").nearest().time.values
array(['2020-04-07T00:00:00.000000000',
       '2020-04-07T01:00:00.000000000',
       '2020-04-07T02:00:00.000000000'],
      dtype='datetime64[ns]')

Down-sampling by selection:

>>> dwn_step = 2

>>> new_time = pd.date_range(
        "04/07/2020",
        periods=sizes[0] // dwn_step,
        freq=pd.DateOffset(hours=h_step * dwn_step),
    )

>>> da.sel(time=new_time)
<xarray.DataArray (time: 3, features: 3)>
array([[ 0,  1,  2],
       [ 6,  7,  8],
       [12, 13, 14]])
Coordinates:
  * time      (time) datetime64[ns] 2020-04-07 ... 2020-04-07T02:00:00
  * features  (features) <U1 'A' 'B' 'C'

>>> da.sel(time=new_time).time.values
array(['2020-04-07T00:00:00.000000000',
       '2020-04-07T01:00:00.000000000',
       '2020-04-07T02:00:00.000000000'],
      dtype='datetime64[ns]')

Another option to create new_time index is to merely do:

new_time = da.time[::dwn_coeff]

It is more straightforward, but you can't choose the first selected time (which can be either good or a bad, depending on your case).

Sign up to request clarification or add additional context in comments.

2 Comments

Yes, but this returns a DataArrayResample. It is unclear to me how I could transform this into a 3d-array (except for with apply, which is too slow). The idea is to resample all data into buckets (2nd dimension in example above) without reducing (e.g. mean) it.
Not all methods to be applied on a DataArrayResample object are reducing methods. I added more in-depth explanations, hope it helps.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.