How can I sort values within a Dask dataframe group?

Question

I have this code which generates autoregressive terms within each unique combination of variables 'grouping A' and 'grouping B'.

for i in range(1, 5):
    df.loc[:,'var_' + str(i)] = df.sort_values(by='date']) \
                                    .groupby(['grouping A', 'grouping B']) \
                                    ['target'].sum().shift(i).ffill().bfill().values

Is it possible to sort values, group, shift, and then assign to a new variable without computing in Dask?

I'm gussing that df is small and fits in memory and that your main goal of Dask is to speed things up by parallelizing over the for loop. Is this guess correct? — MRocklin
– MRocklin, Commented Mar 15, 2017 at 19:26
It is indeed small (150 million rows) and fits in memory - but I'm trying to build the script to utilize much, much larger data frames. — user1566200
– user1566200, Commented Mar 15, 2017 at 19:28

MRocklin · Accepted Answer · 2017-03-17 02:01:25Z

Dask.delayed

So if you want to just parallelize the for loop you might do the following with dask.delayed

ddf = dask.delayed(df)
results = []

for i in range(1, 5):
    result = ddf.sort_values(by='date']) \
                .groupby(['grouping A', 'grouping B']) \
                ['target'].sum().shift(i).ffill().bfill().values
    results.append(result)

results = dask.compute(results)

for i, result in results:
    df[...] = result  # mutate dataframe as you like

That is we wrap the dataframe in dask.delayed. Any method call on it will be lazy. We collect up all of these lazy method calls and then call them together with dask.compute. We don't want to mutate the dataframe during this period (that would be weird) so we do it afterwards.

Large dataframe

If you want to do this with a large dataframe then you would probably want to use dask.dataframe instead. This will be less straightforward, but will hopefully work decently well. You should really look out for the sort_values operation. Distributed sorting is a very hard problem and very expensive. You want to minimize this if possible.

import dask.dataframe as dd
df = load distributed dataframe with `dd.read_csv`, `dd.read_parquet`, etc.

df = df.set_index('date').persist()

results = []
for i in range(1, 5):
    results = ddf.groupby(['grouping A', 'grouping B']) \
                ['target'].sum().shift(i).ffill().bfill()

ddf2 = dd.concat([ddf] + results, axis=1)

Here we use set_index rather than sort_values and we make sure to do it exactly once (it's likely to take 10-100x longer than any other operation here). We then use normal groupby etc.. syntax and things should be fine (although I have to admit I haven't verified that ffill and bfill are definitely implement. I assume so though. As before we don't want to mutate our data during computation (this is weird) so we do a concat afterwards.

Maybe simpler

Probably you'll get a greatly reduced dataframe after the groupby-sum. Use Dask.dataframe for this and then ditch Dask and head back to the comfort of Pandas

ddf = load distributed dataframe with `dd.read_csv`, `dd.read_parquet`, etc.
pdf = ddf.groupby(['grouping A', 'grouping B']).target.sum().compute()
... do whatever you want with a much smaller pandas dataframe ...

Collectives™ on Stack Overflow

How can I sort values within a Dask dataframe group?

1 Answer 1

Dask.delayed

Large dataframe

Maybe simpler

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Dask.delayed

Large dataframe

Maybe simpler

Comments

Your Answer

Sign up or log in

Post as a guest

Related