Sklearn Labelencoder keep encoded values when encoding new dataframe

Question

I'm writing a script that uses 'Local Outlier Factor' algorithm for 'novelty detection'. In this case we need to 'fit' a 'clean/training' dataframe before making predictions. For the algorithm to work, we need to encode the values in the dataframe, for example 'vrrp' to '0' and 'udp' to '2', and so on. For this purpose I use sklearn's LabelEncoder(), which enables me to pass the encoded dataframe into the algorithm.

encoder = LabelEncoder()
dataEnc = dataEnc.apply(encoder.fit_transform)

...

dataframeEnc = dataframeEnc.apply(encoder.fit_transform)

Where 'dataEnc' is the training dataset and 'dataframeEnc' is the dataset for making the predictions.

The problem arises when I try to make predictions with a new dataframe: the encoded values of the 'training' are not the same as the encoded values of the 'predict' dataframe for the same original value.

My objective is to keep the resulting encoded values with reference to the original values when encoding a new dataframe.

When encoding a "Training" dataframe, when encoding the value '10.67.21.254', for example, it always encodes to '23'. However, when encoding a new dataframe (validation dataframe), the same value will result in a different encoded value, in my case it's '1'.

As an example of what I'm expecting is that this:

10.67.21.254       234.1.2.88      0      0     udp  3.472 KB       62

Which encodes to this:

23     153      0      0         4  1254       61          0

Is expected that, for the same original values, it would encode into the same encoded values, however, what I get after encoding it again is:

1       1      0      0         1     2        2          0

What I believe it is doing is attributing new values for each row on the dataset based on the other values of the same dataset.

My question then is: How can I make sure that when encoding the values of the new dataset(predict), that I get the same encoded values as in the previous (training) dataset?

You ought to use only .tranform() on the test data, not .fit_transform(). — KRKirov
– KRKirov, Commented Nov 7, 2019 at 20:00
@KRKirov When using dataframeEnc = dataframeEnc.apply(encoder.transform) on the test data, it might be a new, unseed value, which then results in ValueError: ("y contains previously unseen labels: '11.31.77.119'", :( — tegraze
– tegraze, Commented Nov 8, 2019 at 10:28
Yes, this is the usual pain. Introducing an 'unk' category in your training and setting any previously unseen values in the data to 'unk' can help with this. Alternatively, you can use scikit-learn's OneHotEncoder with handle_unknown='ignore' to handle automatically this. Posting ten rows of training data and a couple of validation data can also help. — KRKirov
– KRKirov, Commented Nov 8, 2019 at 11:49
@KRKirov It would solve it raising an error yes. However, It would defeat the purpose of my objective: detecting novelties. What I want to happen is: There are 200 different IPs, each is encoded from 0 to 199. When a new IP is seen, it should encode to 200 instead of 0 or None: with handle_unknown='ignore' what happens is every non-known value becomes 0 and inversing it makes it a None (null) which is useless for my use-case :\ TLDR: New unknown values should be "added" to the encoding instead of all becoming 0 (ignored). — tegraze
– tegraze, Commented Nov 8, 2019 at 12:16

KRKirov · Accepted Answer · 2019-11-09 15:00:13Z

1

The custom transformer should help. You would have to create a loop and create a dictionary of encoders if you want to transform the whole data frame.

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin


class TTLabelEncoder(BaseEstimator, TransformerMixin):
    """Transform data frame columns with different categorical values
    in training and test data. TT stands for Train-Test

    Pass individual data frame columns to the class instance"""

    def __init__(self):
        self.code_dic = None
        self.max_code = None
        self.fitted = False

    def fit(self, df):
        self.code_dict = dict(zip(df.unique(),
                                  np.arange(len(df.unique()))))
        self.__max_code__()
        self.fitted = True
        return self

    def transform(self, df):
        assert self.fitted == True, 'Fit the data before transforming.'
        new_cat = set(df.unique()).difference(set(self.code_dict.keys()))
        if new_cat:
            new_codes = dict(zip(new_cat, 
                     np.arange(len(new_cat)) + self.max_code + 1))
            self.code_dict.update(new_codes)
            self.__max_code__()
        return df.map(self.code_dict)

    def __max_code__(self):
        self.max_code = max(self.code_dict.values())
        return self

    def fit_transform(self, df):
        if self.fitted == False:
            self.fit(df)
        df = self.transform(df)
        return df

df_1 = pd.DataFrame({'IP': np.random.choice(list('ABCD'), size=5),
                   'Counts': np.random.randint(10, 20, size=5)})

df_2 = pd.DataFrame({'IP': np.random.choice(list('DEF'), size=5),
                     'Counts': np.random.randint(10, 20, size=5)})

df_3 = pd.DataFrame({'IP': np.random.choice(list('XYZ'), size=5),
                     'Counts': np.random.randint(10, 20, size=5)})

ip_encoder = TTLabelEncoder()
ip_encoder.fit(df_1['IP'])
ip_encoder.code_dict

df_1['IP'] = ip_encoder.transform(df_1['IP'])
df_2['IP'] = ip_encoder.transform(df_2['IP'])
df_3['IP'] = ip_encoder.fit_transform(df_3['IP'])

Output:

 df_1 #Before transformation
Out[54]: 
  IP  Counts
0  D      11
1  C      16
2  B      14
3  A      15
4  D      14

df_1 #After transformation
Out[58]: 
   IP  Counts
0   0      11
1   1      16
2   2      14
3   3      15
4   0      14

df_2 #Before transformation
Out[62]: 
  IP  Counts
0  F      15
1  D      10
2  E      19
3  F      18
4  F      14

df_2 #After transformation
Out[64]: 
   IP  Counts
0   4      15
1   0      10
2   5      19
3   4      18
4   4      14

df_3 #Before tranformation
Out[66]: 
  IP  Counts
0  X      19
1  Z      18
2  X      12
3  X      13
4  Y      18

df_3
Out[68]: #After tranformation
   IP  Counts
0   7      19
1   6      18
2   7      12
3   7      13
4   8      18

ip_encoder.code_dict
Out[69]: {'D': 0, 'C': 1, 'B': 2, 'A': 3, 'F': 4, 'E': 5, 'Z': 6, 'X': 7, 'Y': 8}

edited Nov 9, 2019 at 15:00

answered Nov 8, 2019 at 19:05

KRKirov

4,0142 gold badges20 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

tegraze Over a year ago

I apologize for giving late feedback. Thank you for the answer, will try that as soon as possible! Also, is there a limit for how many entries are in the dictionary? I'm guessing I'll have to count for scalability

tegraze Over a year ago

Tested it and works as intended! With a few adjustments it serves the intended purposes on my use case! Thank you :) Hopefullly someone with the same problem will find this

KRKirov Over a year ago

Great! I am very glad to hear it.

Collectives™ on Stack Overflow

Sklearn Labelencoder keep encoded values when encoding new dataframe

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related