Target encoding multiple columns in pandas python

Question

I have the following table.

id col1 col2 col3 col4  target
1    A    B  A    101   1
2    B    B  A    191   1
3    A    B  A     81   0 
4    C    B  C     67   1
5    B    C  C      3   0

I want to target encode every column except col4.

Expected Output:

e1    e2     e3     target
0.5   0.75   0.667    1
0.5   0.75   0.667    1
0.5   0.75   0.667    0
1.0   0.75   0.5      1
0.5   0.00   0.5      0

EDIT: For each column of col1, col2, col3 I want to get the target encodings.

For example, in col3, A appears 3 times and 2/3 times it has a target of 1. thus the encoding will be 0.667 for A. Similarly for C it will be 0.5 in col3.

I've tried something like this one for one column:

encodings = df.groupby('col1')['target'].mean().reset_index()
df = df.merge(encodings, how = 'left', on = 'col1')
df.drop('col1', axis = 1, inplace = TRUE)

Apologises - I've updated the output. Hopefully it makes more sense. — Eisen
– Eisen, Commented Sep 2, 2022 at 13:27
For col3, A appears 3/5 times so it will calculate to 0.6 for e3. C appears 2/5 times, so it will calculate to 0.4 for e3. Same logic applies for col2 and col1. — Eisen
– Eisen, Commented Sep 2, 2022 at 13:30
Note that the calculation is fully independent from target ;) — mozway
– mozway, Commented Sep 2, 2022 at 13:36
I apologize I asked the question incorrectly and updated the example output. It is actually dependent on the target. — Eisen
– Eisen, Commented Sep 2, 2022 at 13:42
Instead of pandas you could use TargetEncoder from category encoders or MeanEncoder from feature-engine. They do the math under the hood. — Sole Galli
– Sole Galli, Commented Jun 11 at 2:33

mozway · Accepted Answer · 2022-09-02 13:57:46Z

update after clarification:

You need to use the same approach as in your original attempt, but using map

df.update(df[['col1', 'col2', 'col3']]
          .apply(lambda s: s.map(df['target'].groupby(s).mean()))
          )

output:

   id col1  col2      col3  col4  target
0   1  0.5  0.75  0.666667   101       1
1   2  0.5  0.75  0.666667   191       1
2   3  0.5  0.75  0.666667    81       0
3   4  1.0  0.75       0.5    67       1
4   5  0.5   0.0       0.5     3       0

older answer prior to OP clarification

IIUC, you want to map the normalized value_counts:

df[['col1', 'col2', 'col3']].apply(lambda s: s.map(s.value_counts(normalize=True)))

output:

   col1  col2  col3
0   0.4   0.8   0.6
1   0.4   0.8   0.6
2   0.4   0.8   0.6
3   0.2   0.8   0.4
4   0.4   0.2   0.4

updating the data in place:

df.update(df[['col1', 'col2', 'col3']]
          .apply(lambda s: s.map(s.value_counts(normalize=True)))
          )

updated DataFrame:

   id col1 col2 col3  col4  target
0   1  0.4  0.8  0.6   101       1
1   2  0.4  0.8  0.6   191       1
2   3  0.4  0.8  0.6    81       0
3   4  0.2  0.8  0.4    67       1
4   5  0.4  0.2  0.4     3       0

BENY · Accepted Answer · 2022-09-02 13:46:50Z

1

You may can try with transform with for loop

l = [df.groupby(col)['target'].transform('mean') for col in ['col1','col2','col3']]
out = pd.concat(l + [df.target],keys = ['e1','e2','e3','target'],axis=1)
out
Out[247]: 
    e1    e2        e3  target
0  0.5  0.75  0.666667       1
1  0.5  0.75  0.666667       1
2  0.5  0.75  0.666667       0
3  1.0  0.75  0.500000       1
4  0.5  0.00  0.500000       0

answered Sep 2, 2022 at 13:46

BENY

324k22 gold badges176 silver badges250 bronze badges

Comments

Vladimir Fokow · Accepted Answer · 2022-09-02 14:00:51Z

Use .apply. For each column - calculate the average of target grouped by this column:

df[['col1', 'col2', 'col3']].apply(lambda s: s.map(df['target'].groupby(s).mean()))

   col1  col2      col3
0   0.5  0.75  0.666667
1   0.5  0.75  0.666667
2   0.5  0.75  0.666667
3   1.0  0.75  0.500000
4   0.5  0.00  0.500000

If you also want to have a target column, you can just use .assign() at the end:

df[['col1', 'col2', 'col3']].apply(lambda s: s.map(df['target'].groupby(s).mean())).assign(target=df['target'])

   col1  col2      col3  target
0   0.5  0.75  0.666667       1
1   0.5  0.75  0.666667       1
2   0.5  0.75  0.666667       0
3   1.0  0.75  0.500000       1
4   0.5  0.00  0.500000       0

Note: .apply() and .transform() give identical results here. You can replace one with the other.

G.G · Accepted Answer · 2022-11-25 02:57:24Z

0

pd.concat([df1[col].map(pd.crosstab(df1[col],df1.target,normalize='index')[1]) for col in ['col1','col2','col3']],axis=1).join(df1.target)
    
      col1  col2      col3  target
    0   0.5  0.75  0.666667       1
    1   0.5  0.75  0.666667       1
    2   0.5  0.75  0.666667       0
    3   1.0  0.75  0.500000       1
    4   0.5  0.00  0.500000       0

edited Nov 25, 2022 at 2:57

answered Nov 16, 2022 at 3:12

G.G

7654 silver badges5 bronze badges

Collectives™ on Stack Overflow

Target encoding multiple columns in pandas python

4 Answers 4

update after clarification:

older answer prior to OP clarification

updating the data in place:

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

update after clarification:

older answer prior to OP clarification

updating the data in place:

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related