Sort a Pandas Dataframe by Multiple Columns Using Key Argument

Question

I have a dataframe a pandas dataframe with the following columns:

df = pd.DataFrame([
    ['A2', 2],
    ['B1', 1],
    ['A1', 2],
    ['A2', 1],
    ['B1', 2],
    ['A1', 1]], 
  columns=['one','two'])

Which I am hoping to sort primarily by column 'two', then by column 'one'. For the secondary sort, I would like to use a custom sorting rule that will sort column 'one' by the alphabetic character [A-Z] and then the trailing numeric number [0-100]. So, the outcome of the sort would be:

one two
 A1   1
 B1   1
 A2   1
 A1   2
 B1   2
 A2   2

I have sorted a list of strings similar to column 'one' before using a sorting rule like so:

def custom_sort(value):
    return (value[0], int(value[1:]))

my_list.sort(key=custom_sort)

If I try to apply this rule via a pandas sort, I run into a number of issues including:

The pandas DataFrame.sort_values() function accepts a key for sorting like the sort() function, but the key function should be vectorized (per the pandas documentation). If I try to apply the sorting key to only column 'one', I get the error "TypeError: cannot convert the series to <class 'int'>"
When you use the pandas DataFrame.sort_values() method, it applies the sort key to all columns you pass in. This will not work since I want to sort first by the column 'two' using a native numerical sort.

How would I go about sorting the DataFrame as mentioned above?

Alexander · Accepted Answer · 2020-10-14 16:25:11Z

4

You can split column one into its constituent parts, add them as columns to the dataframe and then sort on them with column two. Finally, remove the temporary columns.

>>> (df.assign(lhs=df['one'].str[0], rhs=df['one'].str[1:].astype(int))
       .sort_values(['two', 'rhs', 'lhs'])
       .drop(columns=['lhs', 'rhs']))
  one  two
5  A1    1
1  B1    1
3  A2    1
2  A1    2
4  B1    2
0  A2    2

edited Oct 14, 2020 at 16:25

answered Oct 14, 2020 at 2:45

Alexander

111k32 gold badges212 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user11058068 Over a year ago

Thanks for your help. This answer worked, you just need to switch the order of the lhs and rhs in the sort_values() call to get the expected sort order (looks like my wording in the question was a bit off though).

Agostino Over a year ago

Is there a way to sort the columns using "natural sort" (e.g. via the natsort package) and avoid creating temporary columns?

Akaisteph7 Over a year ago

@Agostino to do with natsort, see answer here or there

Akaisteph7 · Accepted Answer · 2022-07-08 21:41:31Z

3

With pandas >= 1.1.0 and natsort, you can also do this now:

import natsort

sorted_df = df.sort_values(["one", "two"], key=natsort.natsort_keygen())

answered Jul 8, 2022 at 21:41

Akaisteph7

6,9222 gold badges39 silver badges53 bronze badges

1 Comment

Moisés Briseño Estrello Over a year ago

This is the best solution actually, in fact, it is enough with sorted_df = df.sort_values(["one", "two"]) no need for natsort. Thanks!

David Erickson · Accepted Answer · 2020-10-14 02:48:19Z

1

use str.extract to create some temp columns that are based off 1) alphabet (a-zA-Z]+) and 2) Number (\d+) and then drop them:

df = pd.DataFrame([
    ['A2', 2],
    ['B1', 1],
    ['A1', 2],
    ['A2', 1],
    ['B1', 2],
    ['A1', 1]], 
  columns=['one','two'])

df['one-letter'] = df['one'].str.extract('([a-zA-Z]+)')
df['one-number'] = df['one'].str.extract('(\d+)')
df = df.sort_values(['two', 'one-number', 'one-letter']).drop(['one-letter', 'one-number'], axis=1)
df
Out[38]: 
  one  two
5  A1    1
1  B1    1
3  A2    1
2  A1    2
4  B1    2

answered Oct 14, 2020 at 2:48

David Erickson

16.7k2 gold badges21 silver badges37 bronze badges

Comments

Ievgen · Accepted Answer · 2021-09-02 07:47:07Z

One of the solutions is to make both columns pd.Categorical and pass the expected order as an argument "categories".

But I have some requirements where I cannot coerce unknown\unexpected values and unfortunately that is what pd.Categorical is doing. Also None is not supported as a category and coerced automatically.

So my solution was to use a key to sort on multiple columns with a custom sorting order:

import pandas as pd


df = pd.DataFrame([
    [A2, 2],
    [B1, 1],
    [A1, 2],
    [A2, 1],
    [B1, 2],
    [A1, 1]], 
  columns=['one','two'])


def custom_sorting(col: pd.Series) -> pd.Series:
    """Series is input and ordered series is expected as output"""
    to_ret = col
    # apply custom sorting only to column one:
    if col.name == "one":
        custom_dict = {}
        # for example ensure that A2 is first, pass items in sorted order here:
        def custom_sort(value):
            return (value[0], int(value[1:]))

        ordered_items = list(col.unique())
        ordered_items.sort(key=custom_sort)
        # apply custom order first:
        for index, item in enumerate(ordered_items):
            custom_dict[item] = index
        to_ret = col.map(custom_dict)
    # default text sorting is about to be applied
    return to_ret


# pass two columns to be sorted
df.sort_values(
    by=["two", "one"],
    ascending=True,
    inplace=True,
    key=custom_sorting,
)

print(df)

Output:

Be aware that this solution can be slow.

The suggestion to use a pd.Categorical column was useful in my case. Thanks.

Collectives™ on Stack Overflow

Sort a Pandas Dataframe by Multiple Columns Using Key Argument

4 Answers 4

3 Comments

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related