19

I have the following dataframe:

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
print(df1)

    A      B   C   D
0  foo    one  0   0
1  bar    one  1   2
2  foo    two  2   4
3  bar  three  3   6
4  foo    two  4   8
5  bar    two  5  10
6  foo    one  6  12
7  foo  three  7  14

I hope to select rows in df1 by the df2 as follows:

df2 = pd.DataFrame({'A': 'foo bar'.split(),
                   'B': 'one two'.split()
                   })
print(df2)

     A    B
0  foo  one
1  bar  two

Here is what I have tried in Python, but I just wonder if there is another method. Thanks.

df = df1.merge(df2, on=['A','B'])
print(df)

This is the output expected.

    A      B   C   D
0  foo    one  0   0
1  bar    two  5  10
2  foo    one  6  12

Using pandas to select rows using two different columns from dataframe?

Select Columns of a DataFrame based on another DataFrame

7
  • what's wrong with current method? Commented Jan 2, 2019 at 12:22
  • 3
    In my opinion it is good solution. Commented Jan 2, 2019 at 12:23
  • I think maybe we can try with loc, iloc or isin function, what do you think? Commented Jan 2, 2019 at 12:25
  • stackoverflow.com/questions/13937022/… Commented Jan 2, 2019 at 12:25
  • 1
    Sorry maybe i ask a silly question. Commented Jan 2, 2019 at 12:26

3 Answers 3

21

Simpliest is use merge with inner join.

Another solution with filtering:

arr = [np.array([df1[k] == v for k, v in x.items()]).all(axis=0) for x in df2.to_dict('r')]
df = df1[np.array(arr).any(axis=0)]
print(df)
     A    B  C   D
0  foo  one  0   0
5  bar  two  5  10
6  foo  one  6  12

Or create MultiIndex and filter with Index.isin:

df = df1[df1.set_index(['A','B']).index.isin(df2.set_index(['A','B']).index)]
print(df)
     A    B  C   D
0  foo  one  0   0
5  bar  two  5  10
6  foo  one  6  12
Sign up to request clarification or add additional context in comments.

4 Comments

Sincere thanks and great respect to Python masters who give comments and helps here!! ;)
@ahbon - yes, and super thing is there is a lot of masters of pandas, so a lot of answers are answered ;) Happy coding!
Your second solution is awesome!! Solve problem on one line and it's the answer expected.
@ahbon - yes, it is better :)
3

Method #4. .apply + key function:

>>> key = lambda row: (row.A, row.B)
>>> df1[df1.apply(key, axis=1).isin(df2.apply(key, axis=1))]
     A    B  C   D
0  foo  one  0   0
5  bar  two  5  10
6  foo  one  6  12

Method #5. .join:

>>> df1.join(df2.set_index(['A', 'B']), on=['A', 'B'], how='right')
     A    B  C   D
0  foo  one  0   0
6  foo  one  6  12
5  bar  two  5  10

Methods already mentioned:

  1. .merge by @ahbon
  2. Filtering with .to_dict('records') by @jezrael (fastest)
  3. Index.isin by @jezrael

Performance comparison (fastest to slowest):

>>> %%timeit
>>> df1[np.array([np.array([df1[k] == v for k, v in x.items()]).all(axis=0) for x in df2.to_dict('records')]).any(axis=0)]
1.62 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> key = lambda row: (row.A, row.B)
>>> %%timeit
>>> df1[df1.apply(key, axis=1).isin(df2.apply(key, axis=1))]
2.96 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %%timeit
>>> df1.merge(df2, on=['A','B'])
3.15 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %%timeit
>>> df1.join(df2.set_index(['A', 'B']), on=['A', 'B'], how='right')
3.97 ms ± 341 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %%timeit
>>> df1[df1.set_index(['A','B']).index.isin(df2.set_index(['A','B']).index)]
6.55 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

1 Comment

The error I see with simple approach is "ValueError: Can only compare identically-labeled Series objects" So it is necessary to set index. This approach handles that and it works. It the best.
1
# The .merge method performs an inner join by default. 
# The resulting dataframe will only have rows where the 
# merge column value exists in both dataframes

x = df_only_english.merge(train_orders.assign(id=train_orders.id))
x

Unnamed: 0  language    score   id  iso_language_name   is_en   cell_order
0   0   en  0.999998    00015c83e2717b  English English 2e94bd7a 3e99dee9 b5e286ea da4f7550 c417225b 51e3cd89 2600b4eb 75b65993 cf195f8b 25699d02 72b3201a f2c750d3 de148b56...
1   1   en  0.999999    0001bdd4021779  English English 3fdc37be 073782ca 8ea7263c 80543cd8 38310c80 073e27e5 015d52a4 ad7679ef 7fde4f04 07c52510 0a1a7a39 0bcd3fef 58bf360b
2   2   en  0.999996    0001daf4c2c76d  English English 97266564 a898e555 86605076 76cc2642 ef279279 df6c939f 2476da96 00f87d0a ae93e8e6 58aadb1d d20b0094 986fd4f1 b4ff1015...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.