24

I have sorted a dataframe by a specific column but the answer that pandas spits out is not exactly correct.

Here is the code I have used:

league_dataframe.sort_values('overall_league_position')

The result that the sort method yields values in column 'overall league position' are not sorted in ascending order which is the default for the method.

enter image description here

What am I doing wrong?

2
  • 1
    Please paste your code directly into the question, not as images. You can use the {} button to format it correctly. You should do this for the output too. Commented Dec 20, 2017 at 21:01
  • 2
    It's a column of strings, that's why. Commented Dec 20, 2017 at 21:01

2 Answers 2

36

For whatever reason, you seem to be working with a column of strings, and sort_values is returning you a lexsorted result.

Here's an example.

df = pd.DataFrame({"Col": ['1', '2', '3', '10', '20', '19']})
df

  Col
0   1
1   2
2   3
3  10
4  20
5  19

df.sort_values('Col')

  Col
0   1
3  10
5  19
1   2
4  20
2   3

The remedy is to convert it to numeric, either using .astype or pd.to_numeric.

df.Col = df.Col.astype(float)

Or,

df.Col = pd.to_numeric(df.Col, errors='coerce')
df.sort_values('Col')

   Col
0    1
1    2
2    3
3   10
5   19
4   20

The only difference b/w astype and pd.to_numeric is that the latter is more robust at handling non-numeric strings (they're coerced to NaN), and will attempt to preserve integers if a coercion to float is not necessary (as is seen in this case).

Sign up to request clarification or add additional context in comments.

5 Comments

What if I need to actually handle string values and keep them as such? For example strings "1%", "2%", "10%", "25%", ...? Is there a way of sorting the values by a custom comparator without having to transform the data back and forth?
@AdamBajger You can lookup "pandas natural sorting column" online.
@cs95 I just found a comprehensive answer here, thanks for natsorted, though, helped to.
@AdamBajger oh awesome, I think I know the chap who owns that answer...
Saved my day, man! The .astype(float) worked.
1

If the dtype of the column should be numeric, then converting to a numeric dtype before sorting is the correct approach. However, if you want to preserve the string/object dtype but sort naturally anyway, you can pass a pd.to_numeric as a key to sort_values().

df = pd.DataFrame({"Col": ['1', '2', '3', '10', '20', '19']})
df = df.sort_values('Col', key=pd.to_numeric)

This is similar to sorted(mylist, key=float) in vanilla Python where the type of the actual data is not modified but that of the keys change. However, unlike sorted(), the function to pass as a key to sort_values() must be vectorized (as in, it must return the entire column of sorting keys); so float cannot work but pd.to_numeric can.

A more illustrative example is to sort a column of percentages naturally. In that case, we can pass a lambda function that strips the percentage symbols and converts the column into a numeric one as the sorting key.

df = pd.DataFrame({"Col": ['1%', '10%', '3%', '2%', '20%', '19%']})
df = df.sort_values('Col', key=lambda x: pd.to_numeric(x.str.rstrip('%'), errors='coerce'))
# if the data is clean, can use `astype` as well
df = df.sort_values('Col', key=lambda x: x.str.rstrip('%').astype(float))

result

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.