Python Dataframe: String in DF Column Contains Substring from Different DF and Substring Values Returned When Match

Question

Colleagues,

Maybe you can help me with what appears to be simple task, but I am not yet experianced enough to figure it out.

Lets say we have two dataframes:

df1 contains substrings;
df2 contains longer blocks of text, some of them contain substrings from df1.

df1 = {'subst': ['LONDON BRIDGE', 'TRUE GRIT', 'FIVE TIMES FIVE', 'THREE TIME DEAD', 'TRUE IS NOT', 'OH NO', 'LEBRON JAMES']}

df2 = {'strng': ['LEBRON JAMES SCORED 20', 'THREE TIMES DEAD JOHNY WAS HELL OF THE COOK', 'TRUE IS NOT WHAT YOU THINK', 'FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED']}

df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)

Here is what I need:

I need to iterate through the rows to check if substrings in df1['subst'] are present anywhere in df2['strng']
If it is present in df2, I want new column ['match_df1'] in df2 that would contain substring value from df1.

Final output in df2 would look something like this

strng	match_df1
LEBRON JAMES SCORED 20	LEBRON JAMES
THREE TIMES DEAD JOHNY WAS HELL OF THE COOK	THREE TIMES DEAD
TRUE IS NOT WHAT YOU THINK	TRUE IS NOT
FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED	FIVE TIMES FIVE

Does this answer your question? Searching substring of a dataframe if it exists in another dataframe column — Chris
– Chris, Commented Sep 21, 2021 at 13:39

tlentali · Accepted Answer · 2021-09-23 06:33:15Z

0

As noticed by @Chris, this answer might do the job.
Then just filter on the empty string like so :

>>> for ind1 in df1.index:
...    df1.loc[ind1, 'strng'] = ', '.join(list(df2[df2['strng'].str.contains(df1['subst'][ind1])]['strng']))
>>> df1[df1['strng'].str.len() > 0]
    subst                strng
2   FIVE TIMES FIVE      FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED
4   TRUE IS NOT          TRUE IS NOT WHAT YOU THINK
6   LEBRON JAMES         LEBRON JAMES SCORED 20

All code :

import pandas as pd

df1 = {'subst': ['LONDON BRIDGE', 'TRUE GRIT', 'FIVE TIMES FIVE', 'THREE TIME DEAD', 'TRUE IS NOT', 'OH NO', 'LEBRON JAMES']}
df2 = {'strng': ['LEBRON JAMES SCORED 20', 'THREE TIMES DEAD JOHNY WAS HELL OF THE COOK', 'TRUE IS NOT WHAT YOU THINK', 'FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED']}

df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)

for ind1 in df1.index:
   df1.loc[ind1, 'strng'] = ', '.join(list(df2[df2['strng'].str.contains(df1['subst'][ind1])]['strng']))
df1[df1['strng'].str.len() > 0]

edited Sep 23, 2021 at 6:33

answered Sep 21, 2021 at 13:50

tlentali

3,4632 gold badges18 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Raganosis Over a year ago

Thank a lot! Though when trying to execute this on my data (as well as example), I am getting "'list' object is not callable" error.

tlentali Over a year ago

Hi @Raganosis, I just rerun your 4 lines of code followed by mine and I get the same exact result without any error on my side. I added the exact code I just run at the end of the answer, can you copy and try it on your side to see if you still have an error ?

Raganosis Over a year ago

Hi @tlentali, thanks again. Interestingly works only when I restarted python. Anyway, thanks a lot, it solves the issue!

Collectives™ on Stack Overflow

Python Dataframe: String in DF Column Contains Substring from Different DF and Substring Values Returned When Match

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related