2

I am trying to compare two Dataframes using a sub-string in one column with another Dataframe.

Given below is how my data looks like:

Dataframe 1

prod_name, prod_id, prod_category
prod_1, cate_1000101, category_1 
prod_2, cate_123001, category_2
prod_3, cate_900, category_3
prod_4, cate_808, category_4

Dataframe 2

bill_id, bill_date, prod_ref
101, 2021-01-01, 3001
102, 2021-01-01, 5001
103, 2021-01-01, 8080

I am trying to compare if any part of prod_id from Dataframe 1 is available in prod_ref in Dataframe 2

Expected output:

prod_name, prod_id, bill_id, bill_date, prod_ref
prod_2, cate_123001, 101, 2021-01-01, 3001
prod_4, cate_808, 103, 2021-01-01, 8080
7
  • Is prod_ref a string column? Commented Apr 16, 2021 at 3:30
  • @DerekO, yes it is of type string Commented Apr 16, 2021 at 3:33
  • 1
    Is there a limit on how short the substring match should be? Because it seems prod_ref=5001 could also get matched with any prod_id containing 1, e.g. prod_id=cate_1000101 Commented Apr 16, 2021 at 4:10
  • When you say any part of prod_id, is there a minimum number of digits that you are willing to compare? 808 is in 8080, but the '01' from the end of '123001' is in '3001' and '5001' Commented Apr 16, 2021 at 4:17
  • 1
    @KevinNash oh nice! when you are able to, you should accept your own answer so that people who have the same question get directed to the right answer. Cheers! Commented Apr 19, 2021 at 15:13

1 Answer 1

1

I was able to get the required output using the below

df1.merge(df2, left_on = df2.prod_ref.str.extract('(\d+)', expand = False), right_on = df1.prod_id.str.extract('(\d+)', expand = False), how = 'left')
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.