1

Take the following example pandas DataFrame:

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
                   "start": ["jan1", "jan1", "jan4", "feb17", "jan4", "mar3"],
                   "end": ["jan3", "jan3", "jan21", "feb17", "jan21", "mar4"],
                   "duration": [2, 2, 17, 0, 17, 1],
                   "case_id": ["case1", "case43", "case6", "case1", "case22", "case69"]
                  })

I want to use a pandas groupby operation on columns start, end and duration to perform two list aggregations on the dataframe:

  • a list of the id values for each group
  • a list of the case_id values for each group

My desired output would look like this:

start    end    duration    ids    cases
jan1     jan3   2           [1, 2] [case1, case43]
jan4     jan21  17          [3, 5] [case6, case22]
feb17    feb17  0           [4]    [case1]
mar3     mar4   1           [6]    [case69]

How to do this efficiently using pandas groupby?

I know that if I would need only one aggregation I could do it like this:

df = df.groupby(['start', 'end', 'duration'])['id'].apply(list).to_frame()

How to do this for multiple list aggregations? And if there are multiple options, what would be the least time consuming? (the DataFrames I'm transforming are quite large)

1 Answer 1

1

You will need to use pandas.groupby.agg, and specify the columns you want to return as list.

To lessen the time needed, since you have categorical columns in your data, make sure you use the observed=True option in your groupby command. This makes sure it only creates lines where an entry is present (more information on this here)

res = df.groupby(['start', 'end', 'duration'],observed=True)[['id','case_id']].agg(list).reset_index().sort_values(by='id')

Output:

res
Out[164]: 
   start    end  duration      id          case_id
1   jan1   jan3         2  [1, 2]  [case1, case43]
2   jan4  jan21        17  [3, 5]  [case6, case22]
0  feb17  feb17         0     [4]          [case1]
3   mar3   mar4         1     [6]         [case69]

Assuming that your unique categories are not too many and your dataset is not excessively large, this shouldn't be a problem. Generally, processing strings takes a lot longer than processing numbers, so if this takes too long to run, you could try converting your object columns to numeric columns and re-doing your groupby.

Sign up to request clarification or add additional context in comments.

1 Comment

Excellent and clear answer! Thank you very much, it works ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.