Select rows of a dataframe based on another dataframe in Python

Question

I have the following dataframe:

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
print(df1)

    A      B   C   D
0  foo    one  0   0
1  bar    one  1   2
2  foo    two  2   4
3  bar  three  3   6
4  foo    two  4   8
5  bar    two  5  10
6  foo    one  6  12
7  foo  three  7  14

I hope to select rows in df1 by the df2 as follows:

df2 = pd.DataFrame({'A': 'foo bar'.split(),
                   'B': 'one two'.split()
                   })
print(df2)

     A    B
0  foo  one
1  bar  two

Here is what I have tried in Python, but I just wonder if there is another method. Thanks.

df = df1.merge(df2, on=['A','B'])
print(df)

This is the output expected.

    A      B   C   D
0  foo    one  0   0
1  bar    two  5  10
2  foo    one  6  12

Using pandas to select rows using two different columns from dataframe?

Select Columns of a DataFrame based on another DataFrame

I think maybe we can try with loc, iloc or isin function, what do you think? — ah bon
– ah bon, Commented Jan 2, 2019 at 12:25

jezrael · Accepted Answer · 2019-01-02 12:34:59Z

21

Simpliest is use merge with inner join.

Another solution with filtering:

arr = [np.array([df1[k] == v for k, v in x.items()]).all(axis=0) for x in df2.to_dict('r')]
df = df1[np.array(arr).any(axis=0)]
print(df)
     A    B  C   D
0  foo  one  0   0
5  bar  two  5  10
6  foo  one  6  12

Or create MultiIndex and filter with Index.isin:

df = df1[df1.set_index(['A','B']).index.isin(df2.set_index(['A','B']).index)]
print(df)
     A    B  C   D
0  foo  one  0   0
5  bar  two  5  10
6  foo  one  6  12

answered Jan 2, 2019 at 12:34

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

ah bon Over a year ago

Sincere thanks and great respect to Python masters who give comments and helps here!! ;)

jezrael Over a year ago

@ahbon - yes, and super thing is there is a lot of masters of pandas, so a lot of answers are answered ;) Happy coding!

ah bon Over a year ago

Your second solution is awesome!! Solve problem on one line and it's the answer expected.

jezrael Over a year ago

@ahbon - yes, it is better :)

Dmitriy Work · Accepted Answer · 2022-06-08 07:03:33Z

3

Method #4. .apply + key function:

>>> key = lambda row: (row.A, row.B)
>>> df1[df1.apply(key, axis=1).isin(df2.apply(key, axis=1))]
     A    B  C   D
0  foo  one  0   0
5  bar  two  5  10
6  foo  one  6  12

Method #5. .join:

>>> df1.join(df2.set_index(['A', 'B']), on=['A', 'B'], how='right')
     A    B  C   D
0  foo  one  0   0
6  foo  one  6  12
5  bar  two  5  10

Methods already mentioned:

.merge by @ahbon
Filtering with .to_dict('records') by @jezrael (fastest)
Index.isin by @jezrael

Performance comparison (fastest to slowest):

>>> %%timeit
>>> df1[np.array([np.array([df1[k] == v for k, v in x.items()]).all(axis=0) for x in df2.to_dict('records')]).any(axis=0)]
1.62 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> key = lambda row: (row.A, row.B)
>>> %%timeit
>>> df1[df1.apply(key, axis=1).isin(df2.apply(key, axis=1))]
2.96 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %%timeit
>>> df1.merge(df2, on=['A','B'])
3.15 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %%timeit
>>> df1.join(df2.set_index(['A', 'B']), on=['A', 'B'], how='right')
3.97 ms ± 341 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %%timeit
>>> df1[df1.set_index(['A','B']).index.isin(df2.set_index(['A','B']).index)]
6.55 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Jun 8, 2022 at 7:03

answered Jun 8, 2022 at 6:58

Dmitriy Work

95311 silver badges19 bronze badges

1 Comment

pauljohn32 Over a year ago

The error I see with simple approach is "ValueError: Can only compare identically-labeled Series objects" So it is necessary to set index. This approach handles that and it works. It the best.

Oscar Rangel · Accepted Answer · 2022-06-12 12:31:46Z

# The .merge method performs an inner join by default. 
# The resulting dataframe will only have rows where the 
# merge column value exists in both dataframes

x = df_only_english.merge(train_orders.assign(id=train_orders.id))
x

Unnamed: 0  language    score   id  iso_language_name   is_en   cell_order
0   0   en  0.999998    00015c83e2717b  English English 2e94bd7a 3e99dee9 b5e286ea da4f7550 c417225b 51e3cd89 2600b4eb 75b65993 cf195f8b 25699d02 72b3201a f2c750d3 de148b56...
1   1   en  0.999999    0001bdd4021779  English English 3fdc37be 073782ca 8ea7263c 80543cd8 38310c80 073e27e5 015d52a4 ad7679ef 7fde4f04 07c52510 0a1a7a39 0bcd3fef 58bf360b
2   2   en  0.999996    0001daf4c2c76d  English English 97266564 a898e555 86605076 76cc2642 ef279279 df6c939f 2476da96 00f87d0a ae93e8e6 58aadb1d d20b0094 986fd4f1 b4ff1015...

Collectives™ on Stack Overflow

Select rows of a dataframe based on another dataframe in Python

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related