10

I have a dataframe a pandas dataframe with the following columns:

df = pd.DataFrame([
    ['A2', 2],
    ['B1', 1],
    ['A1', 2],
    ['A2', 1],
    ['B1', 2],
    ['A1', 1]], 
  columns=['one','two'])

Which I am hoping to sort primarily by column 'two', then by column 'one'. For the secondary sort, I would like to use a custom sorting rule that will sort column 'one' by the alphabetic character [A-Z] and then the trailing numeric number [0-100]. So, the outcome of the sort would be:

one two
 A1   1
 B1   1
 A2   1
 A1   2
 B1   2
 A2   2

I have sorted a list of strings similar to column 'one' before using a sorting rule like so:

def custom_sort(value):
    return (value[0], int(value[1:]))

my_list.sort(key=custom_sort)

If I try to apply this rule via a pandas sort, I run into a number of issues including:

  1. The pandas DataFrame.sort_values() function accepts a key for sorting like the sort() function, but the key function should be vectorized (per the pandas documentation). If I try to apply the sorting key to only column 'one', I get the error "TypeError: cannot convert the series to <class 'int'>"
  2. When you use the pandas DataFrame.sort_values() method, it applies the sort key to all columns you pass in. This will not work since I want to sort first by the column 'two' using a native numerical sort.

How would I go about sorting the DataFrame as mentioned above?

0

4 Answers 4

4

You can split column one into its constituent parts, add them as columns to the dataframe and then sort on them with column two. Finally, remove the temporary columns.

>>> (df.assign(lhs=df['one'].str[0], rhs=df['one'].str[1:].astype(int))
       .sort_values(['two', 'rhs', 'lhs'])
       .drop(columns=['lhs', 'rhs']))
  one  two
5  A1    1
1  B1    1
3  A2    1
2  A1    2
4  B1    2
0  A2    2
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your help. This answer worked, you just need to switch the order of the lhs and rhs in the sort_values() call to get the expected sort order (looks like my wording in the question was a bit off though).
Is there a way to sort the columns using "natural sort" (e.g. via the natsort package) and avoid creating temporary columns?
@Agostino to do with natsort, see answer here or there
3

With pandas >= 1.1.0 and natsort, you can also do this now:

import natsort

sorted_df = df.sort_values(["one", "two"], key=natsort.natsort_keygen())

1 Comment

This is the best solution actually, in fact, it is enough with sorted_df = df.sort_values(["one", "two"]) no need for natsort. Thanks!
1

use str.extract to create some temp columns that are based off 1) alphabet (a-zA-Z]+) and 2) Number (\d+) and then drop them:

df = pd.DataFrame([
    ['A2', 2],
    ['B1', 1],
    ['A1', 2],
    ['A2', 1],
    ['B1', 2],
    ['A1', 1]], 
  columns=['one','two'])

df['one-letter'] = df['one'].str.extract('([a-zA-Z]+)')
df['one-number'] = df['one'].str.extract('(\d+)')
df = df.sort_values(['two', 'one-number', 'one-letter']).drop(['one-letter', 'one-number'], axis=1)
df
Out[38]: 
  one  two
5  A1    1
1  B1    1
3  A2    1
2  A1    2
4  B1    2

Comments

1

One of the solutions is to make both columns pd.Categorical and pass the expected order as an argument "categories".

But I have some requirements where I cannot coerce unknown\unexpected values and unfortunately that is what pd.Categorical is doing. Also None is not supported as a category and coerced automatically.

So my solution was to use a key to sort on multiple columns with a custom sorting order:

import pandas as pd


df = pd.DataFrame([
    [A2, 2],
    [B1, 1],
    [A1, 2],
    [A2, 1],
    [B1, 2],
    [A1, 1]], 
  columns=['one','two'])


def custom_sorting(col: pd.Series) -> pd.Series:
    """Series is input and ordered series is expected as output"""
    to_ret = col
    # apply custom sorting only to column one:
    if col.name == "one":
        custom_dict = {}
        # for example ensure that A2 is first, pass items in sorted order here:
        def custom_sort(value):
            return (value[0], int(value[1:]))

        ordered_items = list(col.unique())
        ordered_items.sort(key=custom_sort)
        # apply custom order first:
        for index, item in enumerate(ordered_items):
            custom_dict[item] = index
        to_ret = col.map(custom_dict)
    # default text sorting is about to be applied
    return to_ret


# pass two columns to be sorted
df.sort_values(
    by=["two", "one"],
    ascending=True,
    inplace=True,
    key=custom_sorting,
)

print(df)

Output:

5  A1    1
3  A2    1
1  B1    1
2  A1    2
0  A2    2
4  B1    2

Be aware that this solution can be slow.

1 Comment

The suggestion to use a pd.Categorical column was useful in my case. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.