0

Colleagues,

Maybe you can help me with what appears to be simple task, but I am not yet experianced enough to figure it out.

Lets say we have two dataframes:

  1. df1 contains substrings;
  2. df2 contains longer blocks of text, some of them contain substrings from df1.
df1 = {'subst': ['LONDON BRIDGE', 'TRUE GRIT', 'FIVE TIMES FIVE', 'THREE TIME DEAD', 'TRUE IS NOT', 'OH NO', 'LEBRON JAMES']}

df2 = {'strng': ['LEBRON JAMES SCORED 20', 'THREE TIMES DEAD JOHNY WAS HELL OF THE COOK', 'TRUE IS NOT WHAT YOU THINK', 'FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED']}

df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)

Here is what I need:

  1. I need to iterate through the rows to check if substrings in df1['subst'] are present anywhere in df2['strng']
  2. If it is present in df2, I want new column ['match_df1'] in df2 that would contain substring value from df1.

Final output in df2 would look something like this

strng match_df1
LEBRON JAMES SCORED 20 LEBRON JAMES
THREE TIMES DEAD JOHNY WAS HELL OF THE COOK THREE TIMES DEAD
TRUE IS NOT WHAT YOU THINK TRUE IS NOT
FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED FIVE TIMES FIVE
1

1 Answer 1

0

As noticed by @Chris, this answer might do the job.
Then just filter on the empty string like so :

>>> for ind1 in df1.index:
...    df1.loc[ind1, 'strng'] = ', '.join(list(df2[df2['strng'].str.contains(df1['subst'][ind1])]['strng']))
>>> df1[df1['strng'].str.len() > 0]
    subst                strng
2   FIVE TIMES FIVE      FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED
4   TRUE IS NOT          TRUE IS NOT WHAT YOU THINK
6   LEBRON JAMES         LEBRON JAMES SCORED 20

All code :

import pandas as pd

df1 = {'subst': ['LONDON BRIDGE', 'TRUE GRIT', 'FIVE TIMES FIVE', 'THREE TIME DEAD', 'TRUE IS NOT', 'OH NO', 'LEBRON JAMES']}
df2 = {'strng': ['LEBRON JAMES SCORED 20', 'THREE TIMES DEAD JOHNY WAS HELL OF THE COOK', 'TRUE IS NOT WHAT YOU THINK', 'FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED']}

df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)

for ind1 in df1.index:
   df1.loc[ind1, 'strng'] = ', '.join(list(df2[df2['strng'].str.contains(df1['subst'][ind1])]['strng']))
df1[df1['strng'].str.len() > 0]
Sign up to request clarification or add additional context in comments.

3 Comments

Thank a lot! Though when trying to execute this on my data (as well as example), I am getting "'list' object is not callable" error.
Hi @Raganosis, I just rerun your 4 lines of code followed by mine and I get the same exact result without any error on my side. I added the exact code I just run at the end of the answer, can you copy and try it on your side to see if you still have an error ?
Hi @tlentali, thanks again. Interestingly works only when I restarted python. Anyway, thanks a lot, it solves the issue!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.