Pandas Dataframe - sort a list of values in each row of a column

Question

Here is a pandas dataframe:

dt          name  type                                         City                            
05-10-2021  MK    [PQRRC, MNXYZ, AYPIC, KLUYT, GFTBE, BYPAC]   NYC
05-10-2021  MK    [GFTBE, AYPIC, MNXYZ, BYPAC, KLUYT, PQRRC]   NYC
05-12-2021  MK    [KLUYT, PQRRC, BYPAC, AYPIC, GFTBE, MNXYZ]   NYC
05-12-2021  MK    [BYPAC, KLUYT, GFTBE, AYPIC, MNXYZ, PQRRC]   NYC
05-13-2021  PS    [XYDFE, QRTSL, CPQLE, VXWUT, ORSHC, LTRDX]   BAL
05-13-2021  PS    [VXWUT, ORSHC, QRTSL, XYDFE, LTRDX, CPQLE]   BAL

.... Please note that the list of values in column type for each column name is the same but not sorted in alphabetical order.

I want the output as below: sort the column type and find the distinct dt, name, type, City.

dt          name  type                                         City                            
05-10-2021  MK    [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC]   NYC
05-12-2021  MK    [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC]   NYC
05-13-2021  PS    [CPQLE, LTRDX, ORSHC, QRTSL, VXWUT, XYDFE]   BAL

I tried using sort_values, sorted, drop_duplicates; not working. May be I made some mistakes. Its dropping some names altogether, meaning missing some names when using drop_duplicates(). Can someone help me? Thank you.

Are the lists guaranteed to have the same values, or does there need to be logic merging the lists together? — Henry Ecker
– Henry Ecker ♦, Commented Jun 7, 2021 at 15:44
The sample data seems have some problem. The first 2 list are diffrerent. 2nd has 2 AYPIC — SeaBean
– SeaBean, Commented Jun 7, 2021 at 16:03
Do we need to check duplicate of column type also ? Seems need to, right ? — SeaBean
– SeaBean, Commented Jun 7, 2021 at 16:05
For each column 'name', the list of values in the column 'type' is same, but not sorted in order. Thank you. — Murali
– Murali, Commented Jun 7, 2021 at 16:06
Sorry, corrected the sample data. No need to check for duplicates in list of values in column 'type. Just sort it and select the distinct values, as shown in the sample output. — Murali
– Murali, Commented Jun 7, 2021 at 16:07

SeaBean · Accepted Answer · 2021-06-07 22:51:30Z

3

If you want to sort the lists in column type and remove the duplicates checked based on other columns, you can use numpy.sort() to sort the list, and then use .drop_duplicates() to check duplicates on other columns:

Using numpy.sort() is more performance efficient than similar Python processing since numpy modules are optimized for system performance and run faster for Pandas and numpy lists/arrays.

import numpy as np

# in case your column "type" is of string type, run one of the following line (depending on your string list layout):
# use this for string list layout e.g. "['GFTBE', 'AYPIC', 'MNXYZ', 'BYPAC', 'KLUYT', 'PQRRC']"
df['type'] = df['type'].str.strip("[]").str.replace("'", "").str.split(', ')   
#df['type'] = df['type'].map(eval)    # for general use to convert string like a list to a real list
#df['type'] = df['type'].str.strip('[]').str.split(',')  # for use when no extra spaces and extra single quotes  


df['type'] = df['type'].map(np.sort).map(list)   # convert the sorted numpy array to Python list to avoid incorrect formatting (e.g. missing comma) in writing to CSV 
df = df.drop_duplicates(subset=['dt', 'name', 'City'])

Result:

print(df)

           dt name                                        type City
0  05-10-2021   MK  [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC]  NYC
2  05-12-2021   MK  [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC]  NYC
4  05-13-2021   PS  [CPQLE, LTRDX, ORSHC, QRTSL, VXWUT, XYDFE]  BAL

edited Jun 7, 2021 at 22:51

answered Jun 7, 2021 at 16:11

SeaBean

23.4k3 gold badges16 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

19 Comments

Murali Over a year ago

Getting ValueError: axis(=-1) out of bounds.

SeaBean Over a year ago

@Murali Which line got the error ? First line or second line ? Is you list in column type really defined as a list or just string written like a list ?

Murali Over a year ago

Getting ValueError: axis(=-1) out of bounds for the first line: df['type'] = df['type'].map(np.sort)

SeaBean Over a year ago

@Murali Add the line df['type'] = df['type'].str.strip('[]').str.split(',') in front of the 2 lines and try again. Thanks!

Murali Over a year ago

In the final output, after using your code, I see an additional double quote + an empty space in the list of values of the column 'type'. Like for example, [" 'AYPIC'", " 'BYPAC'", " 'GFTBE'", " 'KLUYT'", " 'MNXYZ'", " 'PQRRC'"]. How to remove that?

|

AmineBTG · Accepted Answer · 2021-06-07 15:35:56Z

0

Try the below:

df["type"] = df["type"].apply(lambda x: sorted(list(x)))

this assume that all the values of the column 'type' are lists

answered Jun 7, 2021 at 15:35

AmineBTG

6974 silver badges13 bronze badges

2 Comments

Murali Over a year ago

Hi Amine, thanks...I tried your code, I am getting the result as: df.type.head() 0 [, , , , ,]...something is going wrong. The values are missing.

Murali Over a year ago

df["type"] = df["type"].map(lambda x: sorted(list(x))) worked the same way as df['type'] = df['type'].map(np.sort).map(list). Thanks.

Collectives™ on Stack Overflow

Pandas Dataframe - sort a list of values in each row of a column

2 Answers 2

19 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

19 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related