Pandas: Delete rows based on multiple columns values

Question

I have a dataframe with columns A,B,C. I have a list of tuples like [(x1,y1), (x2,y2), ...]. I would like to delete all rows that meet the following condition: (B=x1 && C=y1) | (B=x2 && C=y2) | ... How can I do that in pandas? I wanted to use the isin function, but not sure if it is possible since my list has tuples. I could do something like this:

for x,y in tuples:   
    df = df.drop(df[df.B==x && df.C==y].index)

Maybe there is an easier way.

piRSquared · Accepted Answer · 2016-07-22 23:42:01Z

7

Use pandas indexing

df.set_index(list('BC')).drop(tuples, errors='ignore').reset_index()

Timing

def linear_indexing_based(df, tuples):
    idx = np.array(tuples)
    BC_arr = df[['B','C']].values
    shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
    BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
    idx_IDs = np.ravel_multi_index(idx.T,shp)
    return df[~np.in1d(BC_IDs,idx_IDs)]

def divakar(df, tuples):
    idx = np.array(tuples)
    mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
    return df[~mask.any(0)]

def pirsquared(df, tuples):
    return df.set_index(list('BC')).drop(tuples).reset_index()

10 rows, 1 tuple

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (1, 2))]

10,000 rows, 500 tuples

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(range(10), (10000, 3)), columns=list('ABC'))
tuples = [tuple(row) for row in np.random.choice(range(10), (500, 2))]

edited Jul 22, 2016 at 23:42

answered Jul 22, 2016 at 22:32

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Alex Over a year ago

Hahah that's slick!

Divakar Over a year ago

Thanks! That idx = np.array(tuples) could be added to approach #2 I guess for fairness.

piRSquared Over a year ago

one sec. I missed that

Marses Over a year ago

This only works if tuples is an array/list of tuples right? Is that because the main dataframe becomes multi-indexed and needs tuples? For example, if the "tuples" is actually 2 columns in a second DataFrame, the only way I could think of was to extract them with something like ((x, y) for x, y in df2[['A', 'B']].values) in the place of tuples which seems messy. Is there a better way to multi-index with the contents on a DataFrame?

piRSquared Over a year ago

That is messy and under certain circumstances incorrect. You'd want list(zip(df['A'], df['B'])) OR the same thing but different [*zip(*map(df.get, ['A', 'B']))]

Divakar · Accepted Answer · 2016-07-22 23:37:29Z

Approach #1

Here's a vectorized approach using NumPy's broadcasting -

def broadcasting_based(df, tuples):
    idx = np.array(tuples)
    mask = (df.B.values == idx[:, None, 0]) & (df.C.values == idx[:, None, 1])
    return df[~mask.any(0)]

Sample run -

In [224]: df
Out[224]: 
   A  B  C
0  6  4  4
1  2  0  3
2  8  3  4
3  7  8  3
4  6  7  8
5  3  3  2
6  5  4  2
7  2  4  7
8  6  1  6
9  1  1  1

In [225]: tuples = [(3,4),(7,8),(1,6)]

In [226]: broadcasting_based(df,tuples)
Out[226]: 
   A  B  C
0  6  4  4
1  2  0  3
3  7  8  3
5  3  3  2
6  5  4  2
7  2  4  7
9  1  1  1

Approach #2 : To cover a generic number of columns

For a case like this, one could collapse the information from different columns into one single entry that would represent the uniqueness among all columns. This could be achieved by considering each row as indexing tuple. Thus, basically each row would become one entry. Similarly, each entry from the list of tuple that is to be matched could be reduced to a 1D array with each tuple becoming one scalar each. Finally, we use np.in1d to look for the correspondence, get the valid mask and have the desired rows removed dataframe, Thus, the implementation would be -

def linear_indexing_based(df, tuples):
    idx = np.array(tuples)
    BC_arr = df[['B','C']].values
    shp = np.maximum(BC_arr.max(0)+1,idx.max(0)+1)
    BC_IDs = np.ravel_multi_index(BC_arr.T,shp)
    idx_IDs = np.ravel_multi_index(idx.T,shp)
    return df[~np.in1d(BC_IDs,idx_IDs)]

Alex · Accepted Answer · 2016-07-22 22:54:55Z

0

It will probably be more efficient to use boolean indexing than a bunch of calls to DataFrame.drop. This is because Pandas doesn't have to reallocate memory in each loop iteration.

m = pd.Series(False, index=df.index)
for x,y in tuples:
    m |= (df.B == x) & (df.C == y)
df = df[~m]

edited Jul 22, 2016 at 22:54

answered Jul 22, 2016 at 22:18

Alex

19.2k9 gold badges65 silver badges82 bronze badges

1 Comment

Alex Over a year ago

@root haha you're absolutely right. Should've read more closely. Updating

Collectives™ on Stack Overflow

Pandas: Delete rows based on multiple columns values

3 Answers 3

Timing

5 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Timing

5 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related