17

I have the following dataset:

import numpy as np
from pandas import DataFrame
import numpy.random as random

random.seed(12)

df = DataFrame (
    {
        "fac1" : ["a","a","a","a","b","b","b","b"] ,
        "val" : random.choice(np.arange(0,20), 8, replace=False)
    }
)
df2 = df.set_index(["fac1"])
df2

enter image description here

What I want is to sort by val within each fac1 group, to produce this:

enter image description here

I have combed the documentation and cannot find a straightforward way. The best I could do was the following hack:

df3 = df2.reset_index()
df4 = df3.sort_values(["fac1","val"],ascending=[True,True],axis=0)
df5 = df4.set_index(["fac1"])
df5
# Produces the picture above

(I realize the above could benefit from multiple inplace options, just doing it this way to make intermediate products clear).

I did find this SO post, which uses grouping and a sorting function. However the following code, adapted from that post, produced an incorrect result:

df2.groupby("fac1",axis=1).apply(lambda x : x.sort_values("val"))

(Output removed for space considerations)

Is there another way to approach this?

Update: Solution

The accepted solution is:

df2.sort_values(by='val').sort_index(kind='mergesort')

The sorting algorithm must be mergesort and it must be explicitly specified as it is not the default. As the sort_index documentation points out, "mergesort is the only stable algorithm." Here's another sample dataset that will not sort properly if you don't specify mergesort for kind:

random.seed(12)

len = 32 

df = DataFrame (
    {
        "fac1" : ["a" for i in range(int(len/2))] + ["b" for i in range(int(len/2))] ,
        "val" : random.choice(np.arange(0,100), len, replace=False)
    }
)
df2 = df.set_index(["fac1"])
df2.sort_values(by='val').sort_index()

(Am omitting all outputs for space consideration)

1
  • I've been trying to narrow down the point at which the failure occurs, and it's related to len - for everything else in the code equal the proposed solution works for len <= 16 and fails for larger values. Commented Mar 31, 2016 at 20:53

1 Answer 1

19
+50

EDIT: I looked into the documentation and the default sorting algorithm for sort_index is quicksort. This is NOT a "stable" algorithm, in that it does not preserve "the input order of equal elements in the sorted output" (from Wikipedia). However, sort_index gives you the option to choose "mergesort", which IS a stable sorting algorithm. So the fact that my original answer,

df2.sort_values(by='val').sort_index()

, worked, was simply happenstance. This code should work every time, since it uses a stable sorting algorithm:

df2.sort_values(by='val').sort_index(kind = 'mergesort')
Sign up to request clarification or add additional context in comments.

4 Comments

I just came upon the same thing while you were writing that last edit. Yup that's the answer. That being said, I'm kinda surprised that mergesort isn't the default for kind. Well I guess a case could be made either way. Anyway I think this resolves it
Im also surprised, but I think that it can take much longer in a worst-case scenario.
Agreed - judging by the documentation, it all goes back to the underlying numpy ndarray implementation. That library is really built for speed; at the same time, up at the pandas layer, my use case is a common one (I'm new to pandas but have done data science/stats for many years, mostly on SAS, where this is easy to do.) I think the API could be improved by having a boolean stable parameter instead of algorithm choice - people would be more likely to notice it. Anyway, now we know! Thx again (and enjoy the bounty :-) ).
If to you as well as me it is necessary to sort at first on a column, and then on an index you should change the order of sorting! e.g .sort_index().sort_values('A', kind='mergesort')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.