I'm writing a script that uses 'Local Outlier Factor' algorithm for 'novelty detection'. In this case we need to 'fit' a 'clean/training' dataframe before making predictions. For the algorithm to work, we need to encode the values in the dataframe, for example 'vrrp' to '0' and 'udp' to '2', and so on. For this purpose I use sklearn's LabelEncoder(), which enables me to pass the encoded dataframe into the algorithm.
encoder = LabelEncoder()
dataEnc = dataEnc.apply(encoder.fit_transform)
...
dataframeEnc = dataframeEnc.apply(encoder.fit_transform)
Where 'dataEnc' is the training dataset and 'dataframeEnc' is the dataset for making the predictions.
The problem arises when I try to make predictions with a new dataframe: the encoded values of the 'training' are not the same as the encoded values of the 'predict' dataframe for the same original value.
My objective is to keep the resulting encoded values with reference to the original values when encoding a new dataframe.
When encoding a "Training" dataframe, when encoding the value '10.67.21.254', for example, it always encodes to '23'. However, when encoding a new dataframe (validation dataframe), the same value will result in a different encoded value, in my case it's '1'.
As an example of what I'm expecting is that this:
10.67.21.254 234.1.2.88 0 0 udp 3.472 KB 62
Which encodes to this:
23 153 0 0 4 1254 61 0
Is expected that, for the same original values, it would encode into the same encoded values, however, what I get after encoding it again is:
1 1 0 0 1 2 2 0
What I believe it is doing is attributing new values for each row on the dataset based on the other values of the same dataset.
My question then is: How can I make sure that when encoding the values of the new dataset(predict), that I get the same encoded values as in the previous (training) dataset?
dataframeEnc = dataframeEnc.apply(encoder.transform)on the test data, it might be a new, unseed value, which then results inValueError: ("y contains previously unseen labels: '11.31.77.119'",:(