Pandas: sorting a dataframe in subgroups, issue with sorting equal values

Question

I am trying to solve the following problem. I have the following dataframe df:

df = pd.DataFrame({'A': ['id1', 'id1', 'id2', 'id2', 'id2','id2', 'id2', 'id2','id2', 'id3', 'id3', 'id3'] ,
                   'B': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21] , 
                 'C': [101 , 32, 10, 9, 15, 15, 15, 15, 15, 40, 36, 36]} )
df

Out[16]: 
      A   B    C
0   id1  10  101
1   id1  11   32
2   id2  12   10
3   id2  13    9
4   id2  14   15
5   id2  15   15
6   id2  16   15
7   id2  17   15
8   id2  18   15
9   id3  19   40
10  id3  20   36
11  id3  21   36

I now wish to rearrange the dataframe such that the values in column C are sorted in ascending order for each subgroup defined by the id values in column A. I use the following piece of code:

df2 = df
df2 = df2.sort_values(by=['A','C'], ascending=True).groupby('A').head()

and I get this:

df2
Out[18]: 
      A   B    C
1   id1  11   32
0   id1  10  101
3   id2  13    9
2   id2  12   10
4   id2  14   15
5   id2  15   15
6   id2  16   15
10  id3  20   36
11  id3  21   36
9   id3  19   40

The values in C corresponding to the subgroup id1 in col A have been all sorted correctly, as well as those values corresponding to the subgroup id3. However, the sorting operation of col C relative to id2 in col A has skipped two rows...

print len(df.index), len(df2.index)
12 10

Any idea why does this happen and how to fix this issue? Any help is very much appreciated.

Thanks, MarcoC

head() by default gets top 5 values.

Jarad
– Jarad

2016-10-31 18:26:41 +00:00
Commented Oct 31, 2016 at 18:26 — Jarad
– Jarad, Commented Oct 31, 2016 at 18:26
Thank you @Jarad. Indeed, that is a mistake.

MarcoC
– MarcoC

2016-11-01 14:52:36 +00:00
Commented Nov 1, 2016 at 14:52 — MarcoC
– MarcoC, Commented Nov 1, 2016 at 14:52

Kartik · Accepted Answer · 2016-10-31 18:23:49Z

2

Because of your .groupby('A').head(). .head just displays the first 5 rows of a DataFrame. It is returning just the first 5 rows of the group 'id2'. Remove the .head and you will get the right answer:

df2 = df2.sort_values(by=['A','C'], ascending=True).groupby('A') # Note, no .head()
print len(df.index), len(df2.index)
12 12

answered Oct 31, 2016 at 18:23

Kartik

8,73345 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jezrael · Accepted Answer · 2016-10-31 18:30:22Z

1

I think you need only DataFrame.sort_values:

df2=df.sort_values(by=['A','C'], ascending=True)
print (df2)
      A   B    C
1   id1  11   32
0   id1  10  101
3   id2  13    9
2   id2  12   10
4   id2  14   15
5   id2  15   15
6   id2  16   15
7   id2  17   15
8   id2  18   15
10  id3  20   36
11  id3  21   36
9   id3  19   40

And you remove rows, because by default head(5) - get only top 5 rows per group.

edited Oct 31, 2016 at 18:30

answered Oct 31, 2016 at 18:24

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

1 Comment

MarcoC Over a year ago

Thank you @jezrael. The sorting command was wrong in the first place, did not need to groupby, and did not need head().

Collectives™ on Stack Overflow

Pandas: sorting a dataframe in subgroups, issue with sorting equal values

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related