2

Consider this code:

from StringIO import StringIO
import pandas as pd

txt = """a, RR
10, 1asas
20, 1asasas
30,
40, asas
50, ayty
60, 2asas
80, 3asas"""
frame = pd.read_csv(StringIO(txt), skipinitialspace=True)

print frame,"\n\n\n"

l=[]
for i,j in frame[~ frame['RR'].str.startswith("1", na=True)]['RR'].iteritems():
    if j.startswith(('2','3')):
         if frame[frame['RR'].str.startswith("1", na=False)]['RR'].str.match("1"+j[1:], as_indexer = True).any():
            l.append(i)
    else:
        if frame[frame['RR'].str.startswith("1", na=False)]['RR'].str.match("1"+j, as_indexer = True).any():
            l.append(i)
 frame = frame.drop(frame.index[l])
 print frame

What I am doing here is,

  1. Loop through dataframes to drop any RR which already has 1RR in dataframe

  2. If RR has 2 or 3 at start , then drop if that RR has 1RR[1:] in dataframe.

  3. If RR startswith 1 or is NaN do not touch it.

The code is working fine but this dataframe will have up to 1 million entries and I don't think this code is optimised. As I have just started with pandas I have limited knowledge. Is there any way we can achieve this without iteration. Does pandas have any in-built utility to do this?

4
  • What do you mean by 1RR? Commented Oct 14, 2016 at 9:36
  • @IanS the string 1 + RR like we have here asas and also 1asas so that asas will be dropped Commented Oct 14, 2016 at 9:37
  • I don't have time for a fully-fledged answer, but this code could point you in the right direction: series1 = frame.loc[frame['RR'].str.startswith("1", na=False), 'RR']; frame.loc[(frame['RR'].str.startswith("2")) | (frame['RR'].str.startswith("3")), 'RR'].str.slice(1).isin(series1.str.slice(1)) (deals with your second condition). Commented Oct 14, 2016 at 9:56
  • @IanS gr8!!!! thanx....will look for 1st condition.....but this will return the whole new frame rt? Commented Oct 14, 2016 at 10:18

1 Answer 1

1

First, keep all strings starting with 1 or nan:

keep = frame['RR'].str.startswith("1", na=True)
keep1 = keep[keep]  # will be used at the end

Second, keep strings starting with 2 or 3 that are not in the first dataframe rr1:

rr1 = frame.loc[frame['RR'].str.startswith("1", na=False), 'RR']
keep2 = ~frame.loc[
            (frame['RR'].str.startswith("2")) | (frame['RR'].str.startswith("3")), 'RR'
        ].str.slice(1).isin(rr1.str.slice(1))

Third, keep other strings that are not in rr1 after adding a leading 1:

import numpy as np
keep3 = ~("1" + frame.loc[
            ~frame['RR'].str.slice(0,1).isin([np.nan, "1", "2", "3"]), 'RR'
        ]).isin(rr1)

Finally, put everything together:

frame[pd.concat([keep1, keep2, keep3]).sort_index()]
Sign up to request clarification or add additional context in comments.

2 Comments

We can have 2 approaches ... the one you do or we can use the other one where we can drop rows from existing frame.Which of the 2 would be good here?
I don't think there is any practical difference between the two.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.