2

I have a df_1 like this:

A                      

apple, iphone, android
facebook, apple
macbook, laptop
firestick, hulu, netflix
android, laptop
laptop

And df_2 like this:

A           B

apple       1
macbook     2
facebook    3
firestick   4
hulu        5
netflix     6
android     7
laptop      8

I am trying to extract a single word from A column of df_1 that has the lowest value in column B from df_2 like so:

A                               B_new

apple, iphone, android          apple
facebook, apple                 apple
macbook, laptop                 macbook
hulu, netflix, firestick        firesick
laptop, android                 android                 
laptop                          laptop

I assume I could sort each value of df_1 column A based on the values of B in df_2. Or create a function that takes in a single A value from df_1 and returns str with the smallest number in B from df_2. But as the data is quite big I assume using apply is not very efficient. Is there a neat Pandas way of doing such task?

1
  • So neither of your data frames are indexed? Commented Dec 8, 2020 at 10:32

3 Answers 3

2

You can create dictionary and matching values if exist, then get maximal value else missing value:

d = df_2.set_index('A')['B'].to_dict()

def f(x):
    d1 = {y:d[y] for y in x.split(', ') if y in d}
    return min(d1, key=d1.get)  if len(d1) > 1 else np.nan

Or:

import operator

def f(x):
    d1 = {y:d[y] for y in x.split(', ') if y in d}
    return min(d1.items(), key=operator.itemgetter(1))[0] if len(d1) > 1 else np.nan

df_1['new'] = df_1['A'].apply(f)
print (df_1)
                          A        new
0    apple, iphone, android      apple
1           facebook, apple      apple
2           macbook, laptop    macbook
3  firestick, hulu, netflix  firestick
4           android, laptop    android
5                    laptop     laptop
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. I ran timeit the first function runs 138 ms ± 2.74 ms per loop while the other 156 ms ± 2.29 ms per loop. Any recommendations when I should use the operator module?
@JonasPalačionis - Hard question, I use this solution.
1

@jezrael's solution is cleaner, and should be faster, since we are dealing with strings; the solution below is an alternative:

Iterate to get the individual entries:

values = [[value.strip() for value in entry.split(",")] 
          for entry in df1.A.__iter__()]
values

[['apple', 'iphone', 'android'],
 ['facebook', 'apple'],
 ['macbook', 'laptop'],
 ['firestick', 'hulu', 'netflix'],
 ['android', 'laptop'],
 ['laptop']]

Get the minimum, which in this case will be the first True:

values = [df2.loc[df2.A.isin(value), "B"].idxmin() 
          for value in values]
values
[0, 0, 1, 3, 6, 7]

Select the values and assign to the new column:

df1.loc[:, 'B_new'] = df2.iloc[values, 0]


       A                           B_new
0   apple, iphone, android         apple
1   facebook, apple                apple
2   macbook, laptop               macbook
3   firestick, hulu, netflix    firestick
4   android, laptop              android
5   laptop                        laptop

3 Comments

df1['B_new'] = df2.iloc[values, 0] ?
Yes, it takes a while but does the job, thanks for another perspective on this problem.
yea, more often than not, string functions are faster in plain python than Pandas
0

Not the fastest either, but I like this other way of thinking about this.

You could use the method str.get_dummies, perform column wise multiplications and take the idxmin :

ab_dict = df_2.set_index("A")["B"].to_dict()
df_1["B"] = df_1["A"].str.get_dummies(", ")
        .apply(lambda c: c.replace(0, np.nan)*ab_dict.get(c.name, np.nan))
        .idxmin(axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.