add column with count by constraint

Question

Could someone please help me out, I'm trying to remove the need to iterate through the dataframe and know it is likely very easy for someone with the knowledge.

Dataframe:

    id racecourse going distance runners draw draw_bias
0   253375  178 Standard    7.0 13  2   0.50
1   253375  178 Standard    7.0 13  11  0.25
2   253375  178 Standard    7.0 13  12  1.00
3   253376  178 Standard    6.0 12  2   1.00
4   253376  178 Standard    6.0 12  8   0.50
... ... ... ... ... ... ... ...
378867  4802789 192 Standard    7.0 16  11  0.50
378868  4802789 192 Standard    7.0 16  16  0.10
378869  4802790 192 Standard    7.0 16  1   0.25
378870  4802790 192 Standard    7.0 16  3   0.50
378871  4802790 192 Standard    7.0 16  8   1.00
378872 rows × 7 columns

What I need is to add a new column with the count of unique races (id) by the conditions defined below. This code works as expected but it is sooo slow....

df['race_count'] = None
for i, row in df.iterrows():
  df.at[i, 'race_count'] = df.loc[(df.racecourse==row.racecourse)&(df.going==row.going)&(df.distance==row.distance)&(df.runners==row.runners), 'id'].nunique()

Flursch · Accepted Answer · 2021-08-13 21:56:57Z

Sorry, this is not a complete solution, just an idea.

In Pandas you can split a data frame in subgroups based on one or multiple grouping variables using the groupby method. You can then apply an operation (in this case nunique) to each of the subgroups:

df.groupby(['racecourse', 'going', 'distance', 'runners'])['id'].nunique()

This should give you the number of races with the same characteristics (racecourse, going, ...) but unique values for id.

Most importantly, this should be much faster than looping over the rows, especially for larger data frames.

EDIT:

Here's a complete solution also including the combination with the original data frame (thanks to ojdo for suggesting join/merge)

race_count = df.groupby(['racecourse', 'going', 'distance', 'runners'])['id'].nunique()
race_count.name = 'race_count'
df.merge(race_count, on=['racecourse', 'going', 'distance', 'runners'])

Conveniently, merge broadcasts the values in race_count to all rows of df based on the values in the columns specified by the on parameter.

This outputs:

        id  racecourse     going  distance  runners  draw  draw_bias  race_count  
0   253375         178  Standard       7.0       13     2       0.50           1  
1   253375         178  Standard       7.0       13    11       0.25           1  
2   253375         178  Standard       7.0       13    12       1.00           1  
3   253376         178  Standard       6.0       12     2       1.00           1  
4   253376         178  Standard       6.0       12     8       0.50           1  
5  4802789         192  Standard       7.0       16    11       0.50           2  
6  4802789         192  Standard       7.0       16    16       0.10           2  
7  4802790         192  Standard       7.0       16     1       0.25           2  
8  4802790         192  Standard       7.0       16     3       0.50           2  
9  4802790         192  Standard       7.0       16     8       1.00           2

And to complete this thought, the remaining step "combine", simply set the index of this result to the desired column, and join it with the original. (The general pattern is called split-apply-combine, and is often a good way to express operations.) — ojdo
– ojdo, Commented Aug 13, 2021 at 12:49
Yes, good call. To be honest, I had problems coming up with a good implementation for combining the results (number of unique elements) with the original data frame df. — Flursch
– Flursch, Commented Aug 13, 2021 at 13:05
This is fantastic, exactly what I needed. Thank you very much and may the force continue to be with you 😊🥳 — kfcobrien
– kfcobrien, Commented Aug 14, 2021 at 14:11
You can do this directly with groupby transform instead of needing merge at all -> df['race_count'] = df.groupby(['racecourse', 'going', 'distance', 'runners'])['id'].transform('nunique') — Henry Ecker
– Henry Ecker, Commented Sep 21, 2021 at 3:41
@Henry Ecker: Nice solution, very elegant and concise. I did not know about the groupby transform method. Cool! — Flursch
– Flursch, Commented Sep 22, 2021 at 14:44

Stack Exchange Network

add column with count by constraint

1 Answer 1

You must log in to answer this question.

Hot Network Questions

add column with count by constraint

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions