Pandas Delete rows from dataframe based on condition

Question

Consider this code:

from StringIO import StringIO
import pandas as pd

txt = """a, RR
10, 1asas
20, 1asasas
30,
40, asas
50, ayty
60, 2asas
80, 3asas"""
frame = pd.read_csv(StringIO(txt), skipinitialspace=True)

print frame,"\n\n\n"

l=[]
for i,j in frame[~ frame['RR'].str.startswith("1", na=True)]['RR'].iteritems():
    if j.startswith(('2','3')):
         if frame[frame['RR'].str.startswith("1", na=False)]['RR'].str.match("1"+j[1:], as_indexer = True).any():
            l.append(i)
    else:
        if frame[frame['RR'].str.startswith("1", na=False)]['RR'].str.match("1"+j, as_indexer = True).any():
            l.append(i)
 frame = frame.drop(frame.index[l])
 print frame

What I am doing here is,

Loop through dataframes to drop any RR which already has 1RR in dataframe
If RR has 2 or 3 at start , then drop if that RR has 1RR[1:] in dataframe.
If RR startswith 1 or is NaN do not touch it.

The code is working fine but this dataframe will have up to 1 million entries and I don't think this code is optimised. As I have just started with pandas I have limited knowledge. Is there any way we can achieve this without iteration. Does pandas have any in-built utility to do this?

@IanS the string 1 + RR like we have here asas and also 1asas so that asas will be dropped — vks
– vks, Commented Oct 14, 2016 at 9:37
I don't have time for a fully-fledged answer, but this code could point you in the right direction: series1 = frame.loc[frame['RR'].str.startswith("1", na=False), 'RR']; frame.loc[(frame['RR'].str.startswith("2")) | (frame['RR'].str.startswith("3")), 'RR'].str.slice(1).isin(series1.str.slice(1)) (deals with your second condition). — IanS
– IanS, Commented Oct 14, 2016 at 9:56
@IanS gr8!!!! thanx....will look for 1st condition.....but this will return the whole new frame rt? — vks
– vks, Commented Oct 14, 2016 at 10:18

IanS · Accepted Answer · 2016-10-14 14:45:10Z

1

First, keep all strings starting with 1 or nan:

keep = frame['RR'].str.startswith("1", na=True)
keep1 = keep[keep]  # will be used at the end

Second, keep strings starting with 2 or 3 that are not in the first dataframe rr1:

rr1 = frame.loc[frame['RR'].str.startswith("1", na=False), 'RR']
keep2 = ~frame.loc[
            (frame['RR'].str.startswith("2")) | (frame['RR'].str.startswith("3")), 'RR'
        ].str.slice(1).isin(rr1.str.slice(1))

Third, keep other strings that are not in rr1 after adding a leading 1:

import numpy as np
keep3 = ~("1" + frame.loc[
            ~frame['RR'].str.slice(0,1).isin([np.nan, "1", "2", "3"]), 'RR'
        ]).isin(rr1)

Finally, put everything together:

frame[pd.concat([keep1, keep2, keep3]).sort_index()]

answered Oct 14, 2016 at 14:45

IanS

16.3k9 gold badges64 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

vks Over a year ago

We can have 2 approaches ... the one you do or we can use the other one where we can drop rows from existing frame.Which of the 2 would be good here?

IanS Over a year ago

I don't think there is any practical difference between the two.

Collectives™ on Stack Overflow

Pandas Delete rows from dataframe based on condition

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related