How to create multiple list aggregations using groupby on a pandas dataframe in Python?

Question

Take the following example pandas DataFrame:

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
                   "start": ["jan1", "jan1", "jan4", "feb17", "jan4", "mar3"],
                   "end": ["jan3", "jan3", "jan21", "feb17", "jan21", "mar4"],
                   "duration": [2, 2, 17, 0, 17, 1],
                   "case_id": ["case1", "case43", "case6", "case1", "case22", "case69"]
                  })

I want to use a pandas groupby operation on columns start, end and duration to perform two list aggregations on the dataframe:

a list of the id values for each group
a list of the case_id values for each group

My desired output would look like this:

start    end    duration    ids    cases
jan1     jan3   2           [1, 2] [case1, case43]
jan4     jan21  17          [3, 5] [case6, case22]
feb17    feb17  0           [4]    [case1]
mar3     mar4   1           [6]    [case69]

How to do this efficiently using pandas groupby?

I know that if I would need only one aggregation I could do it like this:

df = df.groupby(['start', 'end', 'duration'])['id'].apply(list).to_frame()

How to do this for multiple list aggregations? And if there are multiple options, what would be the least time consuming? (the DataFrames I'm transforming are quite large)

sophocles · Accepted Answer · 2021-03-27 11:57:15Z

1

You will need to use pandas.groupby.agg, and specify the columns you want to return as list.

To lessen the time needed, since you have categorical columns in your data, make sure you use the observed=True option in your groupby command. This makes sure it only creates lines where an entry is present (more information on this here)

res = df.groupby(['start', 'end', 'duration'],observed=True)[['id','case_id']].agg(list).reset_index().sort_values(by='id')

Output:

res
Out[164]: 
   start    end  duration      id          case_id
1   jan1   jan3         2  [1, 2]  [case1, case43]
2   jan4  jan21        17  [3, 5]  [case6, case22]
0  feb17  feb17         0     [4]          [case1]
3   mar3   mar4         1     [6]         [case69]

Assuming that your unique categories are not too many and your dataset is not excessively large, this shouldn't be a problem. Generally, processing strings takes a lot longer than processing numbers, so if this takes too long to run, you could try converting your object columns to numeric columns and re-doing your groupby.

edited Mar 27, 2021 at 11:57

answered Mar 27, 2021 at 11:40

sophocles

13.9k3 gold badges18 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Peter Over a year ago

Excellent and clear answer! Thank you very much, it works ;)

Collectives™ on Stack Overflow

How to create multiple list aggregations using groupby on a pandas dataframe in Python?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related