1

I am trying to solve the following problem. I have the following dataframe df:

df = pd.DataFrame({'A': ['id1', 'id1', 'id2', 'id2', 'id2','id2', 'id2', 'id2','id2', 'id3', 'id3', 'id3'] ,
                   'B': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21] , 
                 'C': [101 , 32, 10, 9, 15, 15, 15, 15, 15, 40, 36, 36]} )
df

Out[16]: 
      A   B    C
0   id1  10  101
1   id1  11   32
2   id2  12   10
3   id2  13    9
4   id2  14   15
5   id2  15   15
6   id2  16   15
7   id2  17   15
8   id2  18   15
9   id3  19   40
10  id3  20   36
11  id3  21   36

I now wish to rearrange the dataframe such that the values in column C are sorted in ascending order for each subgroup defined by the id values in column A. I use the following piece of code:

df2 = df
df2 = df2.sort_values(by=['A','C'], ascending=True).groupby('A').head()

and I get this:

df2
Out[18]: 
      A   B    C
1   id1  11   32
0   id1  10  101
3   id2  13    9
2   id2  12   10
4   id2  14   15
5   id2  15   15
6   id2  16   15
10  id3  20   36
11  id3  21   36
9   id3  19   40

The values in C corresponding to the subgroup id1 in col A have been all sorted correctly, as well as those values corresponding to the subgroup id3. However, the sorting operation of col C relative to id2 in col A has skipped two rows...

print len(df.index), len(df2.index)
12 10

Any idea why does this happen and how to fix this issue? Any help is very much appreciated.

Thanks, MarcoC

2
  • head() by default gets top 5 values. Commented Oct 31, 2016 at 18:26
  • Thank you @Jarad. Indeed, that is a mistake. Commented Nov 1, 2016 at 14:52

2 Answers 2

2

Because of your .groupby('A').head(). .head just displays the first 5 rows of a DataFrame. It is returning just the first 5 rows of the group 'id2'. Remove the .head and you will get the right answer:

df2 = df2.sort_values(by=['A','C'], ascending=True).groupby('A') # Note, no .head()
print len(df.index), len(df2.index)
12 12
Sign up to request clarification or add additional context in comments.

Comments

1

I think you need only DataFrame.sort_values:

df2=df.sort_values(by=['A','C'], ascending=True)
print (df2)
      A   B    C
1   id1  11   32
0   id1  10  101
3   id2  13    9
2   id2  12   10
4   id2  14   15
5   id2  15   15
6   id2  16   15
7   id2  17   15
8   id2  18   15
10  id3  20   36
11  id3  21   36
9   id3  19   40

And you remove rows, because by default head(5) - get only top 5 rows per group.

1 Comment

Thank you @jezrael. The sorting command was wrong in the first place, did not need to groupby, and did not need head().

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.