0

I have the following table.

id col1 col2 col3 col4  target
1    A    B  A    101   1
2    B    B  A    191   1
3    A    B  A     81   0 
4    C    B  C     67   1
5    B    C  C      3   0

I want to target encode every column except col4.

Expected Output:

e1    e2     e3     target
0.5   0.75   0.667    1
0.5   0.75   0.667    1
0.5   0.75   0.667    0
1.0   0.75   0.5      1
0.5   0.00   0.5      0

EDIT: For each column of col1, col2, col3 I want to get the target encodings.

For example, in col3, A appears 3 times and 2/3 times it has a target of 1. thus the encoding will be 0.667 for A. Similarly for C it will be 0.5 in col3.

I've tried something like this one for one column:

encodings = df.groupby('col1')['target'].mean().reset_index()
df = df.merge(encodings, how = 'left', on = 'col1')
df.drop('col1', axis = 1, inplace = TRUE)
5
  • Apologises - I've updated the output. Hopefully it makes more sense. Commented Sep 2, 2022 at 13:27
  • For col3, A appears 3/5 times so it will calculate to 0.6 for e3. C appears 2/5 times, so it will calculate to 0.4 for e3. Same logic applies for col2 and col1. Commented Sep 2, 2022 at 13:30
  • Note that the calculation is fully independent from target ;) Commented Sep 2, 2022 at 13:36
  • I apologize I asked the question incorrectly and updated the example output. It is actually dependent on the target. Commented Sep 2, 2022 at 13:42
  • Instead of pandas you could use TargetEncoder from category encoders or MeanEncoder from feature-engine. They do the math under the hood. Commented Jun 11 at 2:33

4 Answers 4

3
update after clarification:

You need to use the same approach as in your original attempt, but using map

df.update(df[['col1', 'col2', 'col3']]
          .apply(lambda s: s.map(df['target'].groupby(s).mean()))
          )

output:

   id col1  col2      col3  col4  target
0   1  0.5  0.75  0.666667   101       1
1   2  0.5  0.75  0.666667   191       1
2   3  0.5  0.75  0.666667    81       0
3   4  1.0  0.75       0.5    67       1
4   5  0.5   0.0       0.5     3       0
older answer prior to OP clarification

IIUC, you want to map the normalized value_counts:

df[['col1', 'col2', 'col3']].apply(lambda s: s.map(s.value_counts(normalize=True)))

output:

   col1  col2  col3
0   0.4   0.8   0.6
1   0.4   0.8   0.6
2   0.4   0.8   0.6
3   0.2   0.8   0.4
4   0.4   0.2   0.4
updating the data in place:
df.update(df[['col1', 'col2', 'col3']]
          .apply(lambda s: s.map(s.value_counts(normalize=True)))
          )

updated DataFrame:

   id col1 col2 col3  col4  target
0   1  0.4  0.8  0.6   101       1
1   2  0.4  0.8  0.6   191       1
2   3  0.4  0.8  0.6    81       0
3   4  0.2  0.8  0.4    67       1
4   5  0.4  0.2  0.4     3       0
Sign up to request clarification or add additional context in comments.

Comments

1

You may can try with transform with for loop

l = [df.groupby(col)['target'].transform('mean') for col in ['col1','col2','col3']]
out = pd.concat(l + [df.target],keys = ['e1','e2','e3','target'],axis=1)
out
Out[247]: 
    e1    e2        e3  target
0  0.5  0.75  0.666667       1
1  0.5  0.75  0.666667       1
2  0.5  0.75  0.666667       0
3  1.0  0.75  0.500000       1
4  0.5  0.00  0.500000       0

Comments

0

Use .apply. For each column - calculate the average of target grouped by this column:

df[['col1', 'col2', 'col3']].apply(lambda s: s.map(df['target'].groupby(s).mean()))
   col1  col2      col3
0   0.5  0.75  0.666667
1   0.5  0.75  0.666667
2   0.5  0.75  0.666667
3   1.0  0.75  0.500000
4   0.5  0.00  0.500000

If you also want to have a target column, you can just use .assign() at the end:

df[['col1', 'col2', 'col3']].apply(lambda s: s.map(df['target'].groupby(s).mean())).assign(target=df['target'])
   col1  col2      col3  target
0   0.5  0.75  0.666667       1
1   0.5  0.75  0.666667       1
2   0.5  0.75  0.666667       0
3   1.0  0.75  0.500000       1
4   0.5  0.00  0.500000       0

Note: .apply() and .transform() give identical results here. You can replace one with the other.

Comments

0
pd.concat([df1[col].map(pd.crosstab(df1[col],df1.target,normalize='index')[1]) for col in ['col1','col2','col3']],axis=1).join(df1.target)
    
      col1  col2      col3  target
    0   0.5  0.75  0.666667       1
    1   0.5  0.75  0.666667       1
    2   0.5  0.75  0.666667       0
    3   1.0  0.75  0.500000       1
    4   0.5  0.00  0.500000       0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.