0

Here is a pandas dataframe:

dt          name  type                                         City                            
05-10-2021  MK    [PQRRC, MNXYZ, AYPIC, KLUYT, GFTBE, BYPAC]   NYC
05-10-2021  MK    [GFTBE, AYPIC, MNXYZ, BYPAC, KLUYT, PQRRC]   NYC
05-12-2021  MK    [KLUYT, PQRRC, BYPAC, AYPIC, GFTBE, MNXYZ]   NYC
05-12-2021  MK    [BYPAC, KLUYT, GFTBE, AYPIC, MNXYZ, PQRRC]   NYC
05-13-2021  PS    [XYDFE, QRTSL, CPQLE, VXWUT, ORSHC, LTRDX]   BAL
05-13-2021  PS    [VXWUT, ORSHC, QRTSL, XYDFE, LTRDX, CPQLE]   BAL

.... Please note that the list of values in column type for each column name is the same but not sorted in alphabetical order.

I want the output as below: sort the column type and find the distinct dt, name, type, City.

dt          name  type                                         City                            
05-10-2021  MK    [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC]   NYC
05-12-2021  MK    [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC]   NYC
05-13-2021  PS    [CPQLE, LTRDX, ORSHC, QRTSL, VXWUT, XYDFE]   BAL

I tried using sort_values, sorted, drop_duplicates; not working. May be I made some mistakes. Its dropping some names altogether, meaning missing some names when using drop_duplicates(). Can someone help me? Thank you.

5
  • Are the lists guaranteed to have the same values, or does there need to be logic merging the lists together? Commented Jun 7, 2021 at 15:44
  • The sample data seems have some problem. The first 2 list are diffrerent. 2nd has 2 AYPIC Commented Jun 7, 2021 at 16:03
  • Do we need to check duplicate of column type also ? Seems need to, right ? Commented Jun 7, 2021 at 16:05
  • For each column 'name', the list of values in the column 'type' is same, but not sorted in order. Thank you. Commented Jun 7, 2021 at 16:06
  • Sorry, corrected the sample data. No need to check for duplicates in list of values in column 'type. Just sort it and select the distinct values, as shown in the sample output. Commented Jun 7, 2021 at 16:07

2 Answers 2

3

If you want to sort the lists in column type and remove the duplicates checked based on other columns, you can use numpy.sort() to sort the list, and then use .drop_duplicates() to check duplicates on other columns:

Using numpy.sort() is more performance efficient than similar Python processing since numpy modules are optimized for system performance and run faster for Pandas and numpy lists/arrays.

import numpy as np

# in case your column "type" is of string type, run one of the following line (depending on your string list layout):
# use this for string list layout e.g. "['GFTBE', 'AYPIC', 'MNXYZ', 'BYPAC', 'KLUYT', 'PQRRC']"
df['type'] = df['type'].str.strip("[]").str.replace("'", "").str.split(', ')   
#df['type'] = df['type'].map(eval)    # for general use to convert string like a list to a real list
#df['type'] = df['type'].str.strip('[]').str.split(',')  # for use when no extra spaces and extra single quotes  


df['type'] = df['type'].map(np.sort).map(list)   # convert the sorted numpy array to Python list to avoid incorrect formatting (e.g. missing comma) in writing to CSV 
df = df.drop_duplicates(subset=['dt', 'name', 'City'])

Result:

print(df)

           dt name                                        type City
0  05-10-2021   MK  [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC]  NYC
2  05-12-2021   MK  [AYPIC, BYPAC, GFTBE, KLUYT, MNXYZ, PQRRC]  NYC
4  05-13-2021   PS  [CPQLE, LTRDX, ORSHC, QRTSL, VXWUT, XYDFE]  BAL

Sign up to request clarification or add additional context in comments.

19 Comments

Getting ValueError: axis(=-1) out of bounds.
@Murali Which line got the error ? First line or second line ? Is you list in column type really defined as a list or just string written like a list ?
Getting ValueError: axis(=-1) out of bounds for the first line: df['type'] = df['type'].map(np.sort)
@Murali Add the line df['type'] = df['type'].str.strip('[]').str.split(',') in front of the 2 lines and try again. Thanks!
In the final output, after using your code, I see an additional double quote + an empty space in the list of values of the column 'type'. Like for example, [" 'AYPIC'", " 'BYPAC'", " 'GFTBE'", " 'KLUYT'", " 'MNXYZ'", " 'PQRRC'"]. How to remove that?
|
0

Try the below:

df["type"] = df["type"].apply(lambda x: sorted(list(x)))

this assume that all the values of the column 'type' are lists

2 Comments

Hi Amine, thanks...I tried your code, I am getting the result as: df.type.head() 0 [, , , , ,]...something is going wrong. The values are missing.
df["type"] = df["type"].map(lambda x: sorted(list(x))) worked the same way as df['type'] = df['type'].map(np.sort).map(list). Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.