Pandas sorting within a group results in duplicated index

Question

I have a pandas DataFrame in the following format:

           C1        C2
A 0  1.764052  0.400157
  1  0.978738  2.240893
  2  1.867558 -0.977278
  3  0.950088 -0.151357
  4 -0.103219  0.410599
C 0  0.144044  1.454274
  1  0.761038  0.121675
  2  0.443863  0.333674
  3  1.494079 -0.205158
  4  0.313068 -0.854096

It is generated from this code:

import pandas as pd
import numpy as np
idx = pd.MultiIndex.from_product([['A','C'],range(5)])
np.random.seed(0)
df = pd.DataFrame(index=idx, data = np.random.randn(len(idx), 2), columns=['C1','C2'])

I would like to sort each group (of the first index level) separately by the column C1 so I use the groupby function:

df.groupby(level=0).apply(lambda x: x.sort_values('C1'))

and I get the result:

             C1        C2
A A 4 -0.103219  0.410599
    3  0.950088 -0.151357
    1  0.978738  2.240893
    0  1.764052  0.400157
    2  1.867558 -0.977278
C C 0  0.144044  1.454274
    4  0.313068 -0.854096
    2  0.443863  0.333674
    1  0.761038  0.121675
    3  1.494079 -0.205158

Why is the first level duplicated? Is there a better way to sort data within each group?

jezrael · Accepted Answer · 2016-11-03 18:37:24Z

4

You need add parameter group_keys=False to groupby:

a = df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('C1'))
print (a)
           C1        C2
A 4 -0.103219  0.410599
  3  0.950088 -0.151357
  1  0.978738  2.240893
  0  1.764052  0.400157
  2  1.867558 -0.977278
C 0  0.144044  1.454274
  4  0.313068 -0.854096
  2  0.443863  0.333674
  1  0.761038  0.121675
  3  1.494079 -0.205158

answered Nov 3, 2016 at 18:37

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Charles Naccio Over a year ago

This def solved my problem, but I'm not quite sure why. In some circumstances pandas didn't do any grouping, but when trying a larger dataset it did. Anyhow; thanks for the solution!

akuiper · Accepted Answer · 2016-11-03 18:37:04Z

3

You can sort by column C1 and then sort index by level 0. You have to specify sort_remaining parameter to False in order for this to work for multiIndex data frame:

df.sort_values('C1').sort_index(level = 0, sort_remaining=False)

answered Nov 3, 2016 at 18:37

akuiper

216k33 gold badges362 silver badges379 bronze badges

Collectives™ on Stack Overflow

Pandas sorting within a group results in duplicated index

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related