Sorting pandas value based on another dataframe values

Question

I have a df_1 like this:

A                      

apple, iphone, android
facebook, apple
macbook, laptop
firestick, hulu, netflix
android, laptop
laptop

And df_2 like this:

A           B

apple       1
macbook     2
facebook    3
firestick   4
hulu        5
netflix     6
android     7
laptop      8

I am trying to extract a single word from A column of df_1 that has the lowest value in column B from df_2 like so:

A                               B_new

apple, iphone, android          apple
facebook, apple                 apple
macbook, laptop                 macbook
hulu, netflix, firestick        firesick
laptop, android                 android                 
laptop                          laptop

I assume I could sort each value of df_1 column A based on the values of B in df_2. Or create a function that takes in a single A value from df_1 and returns str with the smallest number in B from df_2. But as the data is quite big I assume using apply is not very efficient. Is there a neat Pandas way of doing such task?

So neither of your data frames are indexed?

etch_45
– etch_45

2020-12-08 10:32:48 +00:00
Commented Dec 8, 2020 at 10:32 — etch_45
– etch_45, Commented Dec 8, 2020 at 10:32

jezrael · Accepted Answer · 2020-12-08 10:47:18Z

2

You can create dictionary and matching values if exist, then get maximal value else missing value:

d = df_2.set_index('A')['B'].to_dict()

def f(x):
    d1 = {y:d[y] for y in x.split(', ') if y in d}
    return min(d1, key=d1.get)  if len(d1) > 1 else np.nan

Or:

import operator

def f(x):
    d1 = {y:d[y] for y in x.split(', ') if y in d}
    return min(d1.items(), key=operator.itemgetter(1))[0] if len(d1) > 1 else np.nan

df_1['new'] = df_1['A'].apply(f)
print (df_1)
                          A        new
0    apple, iphone, android      apple
1           facebook, apple      apple
2           macbook, laptop    macbook
3  firestick, hulu, netflix  firestick
4           android, laptop    android
5                    laptop     laptop

edited Dec 8, 2020 at 10:47

answered Dec 8, 2020 at 10:42

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jonas Palačionis Over a year ago

Thank you. I ran timeit the first function runs 138 ms ± 2.74 ms per loop while the other 156 ms ± 2.29 ms per loop. Any recommendations when I should use the operator module?

jezrael Over a year ago

@JonasPalačionis - Hard question, I use this solution.

sammywemmy · Accepted Answer · 2020-12-08 11:22:03Z

1

@jezrael's solution is cleaner, and should be faster, since we are dealing with strings; the solution below is an alternative:

Iterate to get the individual entries:

values = [[value.strip() for value in entry.split(",")] 
          for entry in df1.A.__iter__()]
values

[['apple', 'iphone', 'android'],
 ['facebook', 'apple'],
 ['macbook', 'laptop'],
 ['firestick', 'hulu', 'netflix'],
 ['android', 'laptop'],
 ['laptop']]

Get the minimum, which in this case will be the first True:

values = [df2.loc[df2.A.isin(value), "B"].idxmin() 
          for value in values]
values
[0, 0, 1, 3, 6, 7]

Select the values and assign to the new column:

df1.loc[:, 'B_new'] = df2.iloc[values, 0]


       A                           B_new
0   apple, iphone, android         apple
1   facebook, apple                apple
2   macbook, laptop               macbook
3   firestick, hulu, netflix    firestick
4   android, laptop              android
5   laptop                        laptop

edited Dec 8, 2020 at 11:22

answered Dec 8, 2020 at 10:56

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

3 Comments

jezrael Over a year ago

df1['B_new'] = df2.iloc[values, 0] ?

Jonas Palačionis Over a year ago

Yes, it takes a while but does the job, thanks for another perspective on this problem.

sammywemmy Over a year ago

yea, more often than not, string functions are faster in plain python than Pandas

Whole Brain · Accepted Answer · 2020-12-08 11:18:23Z

0

Not the fastest either, but I like this other way of thinking about this.

You could use the method str.get_dummies, perform column wise multiplications and take the idxmin :

ab_dict = df_2.set_index("A")["B"].to_dict()
df_1["B"] = df_1["A"].str.get_dummies(", ")
        .apply(lambda c: c.replace(0, np.nan)*ab_dict.get(c.name, np.nan))
        .idxmin(axis=1)

answered Dec 8, 2020 at 11:18

Whole Brain

2,1772 gold badges13 silver badges20 bronze badges

Collectives™ on Stack Overflow

Sorting pandas value based on another dataframe values

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related