2

I have a data frame in pandas and I want to sort it by column. If I use .sort_values() like in the below code:

df.sort_values(by='id')

I get the output in the 'id' column as:

1075_2016-06-01_0_1
1075_2016-06-01_10_1
1075_2016-06-01_10_2
1075_2016-06-01_11_1
1075_2016-06-01_11_2
1075_2016-06-01_1_1
1075_2016-06-01_1_2

I expected:

1075_2016-06-01_0_1
1075_2016-06-01_1_1
1075_2016-06-01_1_2
1075_2016-06-01_10_1
1075_2016-06-01_10_2
1075_2016-06-01_11_1
1075_2016-06-01_11_2

What is the best way to do this in pandas?

3 Answers 3

1

One possible solution with natsort for get indices of sorting values and change of original DataFrame by loc:

from natsort import index_natsorted, order_by_index

df2 = df.loc[order_by_index(df.index, index_natsorted(df['id']))]

Or split all values by _, then convert columns to integer and optionaly to datetimes, sorting for indices and last use loc with original DataFrame:

df1 = df['id'].str.split('_', expand=True)
df1[[0,2,3]] = df1[[0,2,3]].astype(int)
df1[1] = pd.to_datetime(df1[1])

df2 = df.loc[df1.sort_values([0,1,2,3]).index]
print (df2)
                     id
0   1075_2016-06-01_0_1
5   1075_2016-06-01_1_1
6   1075_2016-06-01_1_2
1  1075_2016-06-01_10_1
2  1075_2016-06-01_10_2
3  1075_2016-06-01_11_1
4  1075_2016-06-01_11_2

Next solution with argsort for sorting and change order by indexing by positions with iloc for working with any index values:

f = lambda x: [int(x[0]), pd.to_datetime(x[1]), int(x[2]), int(x[3])]
df2 = df.iloc[df['id'].str.split('_').map(f).argsort()]
print (df2)
                     id
0   1075_2016-06-01_0_1
5   1075_2016-06-01_1_1
6   1075_2016-06-01_1_2
1  1075_2016-06-01_10_1
2  1075_2016-06-01_10_2
3  1075_2016-06-01_11_1
4  1075_2016-06-01_11_2
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the answer! 'id' is not a pandas index it's just a column
@hyper - Yes, both solution working with id column.
1

Guys thank you very much! The combination of the two solutions worked:

df['sort_val']= df['id'].str.split('_') 
f = lambda x: [int(x[0]), pd.to_datetime(x[1]), int(x[2]), int(x[3])]
df['sort_val'] = df['sort_val'].map(f)
df.sort_values(by='sort_val').drop('sort_val',1)

Comments

0

You can first split the values per underscores, and then sort these, like:

df['sort_val'] = df['id'].str.split('_')
df = df.sort_values('sort_val').drop('sort_val', 1)

The above should work on an arbitrary number of underscore separated values.

This gives us:

>>> df
                     id
0   1075_2016-06-01_0_1
1  1075_2016-06-01_10_1
2  1075_2016-06-01_10_2
3  1075_2016-06-01_11_1
4  1075_2016-06-01_11_2
5   1075_2016-06-01_1_1
6   1075_2016-06-01_1_2
>>> df['sort_val'] = df['id'].str.split('_')
>>> df = df.sort_values('sort_val').drop('sort_val', 1)
>>> df
                     id
0   1075_2016-06-01_0_1
5   1075_2016-06-01_1_1
6   1075_2016-06-01_1_2
1  1075_2016-06-01_10_1
2  1075_2016-06-01_10_2
3  1075_2016-06-01_11_1
4  1075_2016-06-01_11_2

2 Comments

for some reason this does not work if there are also other elements, for example: 1075_2016-06-01_1_3 1075_2016-06-01_2_0 1075_2016-06-01_6_0
@hyper: for the given sample data, I get the items sorted in the order you provide these, which seems to be the correct sorting?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.