How to check if a substring in a pandas dataframe column exists in a substring of another column in the same dataframe?

Question

I have a dataframe with columns like this:

  A                               B
0  - 5923FoxRd                    5923 Fox Rd
1 631 Newhaven Ave                Modesto
2 Saratoga Street, Suite 200      Saratoga Street, Suite 200

I want to create a list with values from A that matches values from B. The list should look like [- 5923FoxRd, Saratoga Street, Suite 200...]. What is the easiest way to do this?

Why is row 0 a match?

Bill Huang
– Bill Huang

2020-10-01 01:09:55 +00:00
Commented Oct 1, 2020 at 1:09 — Bill Huang
– Bill Huang, Commented Oct 1, 2020 at 1:09
Thats because the address is the same in both the columns

Soham
– Soham

2020-10-01 01:15:05 +00:00
Commented Oct 1, 2020 at 1:15 — Soham
– Soham, Commented Oct 1, 2020 at 1:15

David Erickson · Accepted Answer · 2020-10-01 01:32:28Z

To make a little go a long way, do the following:

Create a new series for each column and pass the regex pattern \W+ to str.replace()
use str.lower()
create replace lists to normalize drive to dr, avenue to ave, etc.

s1 = df['A'].str.replace('\W+', '').str.lower()
s2 = df['B'].str.replace('\W+', '').str.lower()
lst = [*df[s1==s2]['A']]
lst
Out[1]: ['- 5923FoxRd', 'Saratoga Street, Suite 200']

This is what s1 and s2 look like:

print(s1,s2)

0                 5923foxrd
1            631newhavenave
2    saratogastreetsuite200
Name: A, dtype: object

0                 5923foxrd
1                   modesto
2    saratogastreetsuite200
Name: B, dtype: object

From there, you might want to create some replace values in order to normalize your data even further like:

to_replace = ['drive', 'avenue', 'street']
replaced = ['dr', 'ave', 'str']

to_replace = ['drive', 'avenue', 'street']
replaced = ['dr', 'ave', 'str']
s1 = df['A'].str.replace('\W+', '').str.lower().replace(to_replace, replaced, regex=True)
s2 = df['B'].str.replace('\W+', '').str.lower().replace(to_replace, replaced, regex=True)
lst = [*df[s1==s2]['A']]
lst
print(s1,s2)
0              5923foxrd
1         631newhavenave
2    saratogastrsuite200
Name: A, dtype: object

0              5923foxrd
1                modesto
2    saratogastrsuite200
Name: B, dtype: object

Collectives™ on Stack Overflow

How to check if a substring in a pandas dataframe column exists in a substring of another column in the same dataframe?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related