8

I have a dataframe with columns A,B,C. I have a list of tuples like [(x1,y1), (x2,y2), ...]. I would like to delete all rows that meet the following condition: (B=x1 && C=y1) | (B=x2 && C=y2) | ... How can I do that in pandas? I wanted to use the isin function, but not sure if it is possible since my list has tuples. I could do something like this:

for x,y in tuples:   
    df = df.drop(df[df.B==x && df.C==y].index)

Maybe there is an easier way.

3 Answers 3

7

Use pandas indexing

df.set_index(list('BC')).drop(tuples, errors='ignore').reset_index()

Timing

def linear_indexing_based(df, tuples):
    idx = np.array(tuples)
    BC_arr = df[['B','C']].values
    shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
    BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
    idx_IDs = np.ravel_multi_index(idx.T,shp)
    return df[~np.in1d(BC_IDs,idx_IDs)]

def divakar(df, tuples):
    idx = np.array(tuples)
    mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
    return df[~mask.any(0)]

def pirsquared(df, tuples):
    return df.set_index(list('BC')).drop(tuples).reset_index()

10 rows, 1 tuple

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (1, 2))]

enter image description here

10,000 rows, 500 tuples

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10000, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (500, 2))]

enter image description here

Sign up to request clarification or add additional context in comments.

5 Comments

Hahah that's slick!
Thanks! That idx = np.array(tuples) could be added to approach #2 I guess for fairness.
one sec. I missed that
This only works if tuples is an array/list of tuples right? Is that because the main dataframe becomes multi-indexed and needs tuples? For example, if the "tuples" is actually 2 columns in a second DataFrame, the only way I could think of was to extract them with something like ((x, y) for x, y in df2[['A', 'B']].values) in the place of tuples which seems messy. Is there a better way to multi-index with the contents on a DataFrame?
That is messy and under certain circumstances incorrect. You'd want list(zip(df['A'], df['B'])) OR the same thing but different [*zip(*map(df.get, ['A', 'B']))]
4

Approach #1

Here's a vectorized approach using NumPy's broadcasting -

def broadcasting_based(df, tuples):
    idx = np.array(tuples)
    mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
    return df[~mask.any(0)]

Sample run -

In [224]: df
Out[224]: 
   A  B  C
0  6  4  4
1  2  0  3
2  8  3  4
3  7  8  3
4  6  7  8
5  3  3  2
6  5  4  2
7  2  4  7
8  6  1  6
9  1  1  1

In [225]: tuples = [(3,4),(7,8),(1,6)]

In [226]: broadcasting_based(df,tuples)
Out[226]: 
   A  B  C
0  6  4  4
1  2  0  3
3  7  8  3
5  3  3  2
6  5  4  2
7  2  4  7
9  1  1  1

Approach #2 : To cover a generic number of columns

For a case like this, one could collapse the information from different columns into one single entry that would represent the uniqueness among all columns. This could be achieved by considering each row as indexing tuple. Thus, basically each row would become one entry. Similarly, each entry from the list of tuple that is to be matched could be reduced to a 1D array with each tuple becoming one scalar each. Finally, we use np.in1d to look for the correspondence, get the valid mask and have the desired rows removed dataframe, Thus, the implementation would be -

def linear_indexing_based(df, tuples):
    idx = np.array(tuples)
    BC_arr = df[['B','C']].values
    shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
    BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
    idx_IDs = np.ravel_multi_index(idx.T,shp)
    return df[~np.in1d(BC_IDs,idx_IDs)]

Comments

0

It will probably be more efficient to use boolean indexing than a bunch of calls to DataFrame.drop. This is because Pandas doesn't have to reallocate memory in each loop iteration.

m = pd.Series(False, index=df.index)
for x,y in tuples:
    m |= (df.B == x) & (df.C == y)
df = df[~m]

1 Comment

@root haha you're absolutely right. Should've read more closely. Updating

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.