Sklearn Label Encoding multiple columns pandas dataframe

Question

I try to encode a number of columns containing categorical data ("Yes" and "No") in a large pandas dataframe. The complete dataframe contains over 400 columns so I look for a way to encode all desired columns without having to encode them one by one. I use Scikit-learn LabelEncoder to encode the categorical data.

The first part of the dataframe does not have to be encoded, however I am looking for a method to encode all the desired columns containing categorical date directly without split and concatenate the dataframe.

To demonstrate my question I first tried to solve it on a small part of the dataframe. However get stuck at the last part where the data is fitted and transformed and get a ValueError: bad input shape (4,3). The code as I ran:

# Create a simple dataframe resembling large dataframe
    data = pd.DataFrame({'A': [1, 2, 3, 4],
                         'B': ["Yes", "No", "Yes", "Yes"],
                         'C': ["Yes", "No", "No", "Yes"],
                         'D': ["No", "Yes", "No", "Yes"]})


# Import required module
from sklearn.preprocessing import LabelEncoder

# Create an object of the label encoder class
labelencoder = LabelEncoder()

# Apply labelencoder object on columns
labelencoder.fit_transform(data.ix[:, 1:])   # First column does not need to be encoded

Complete error report:

labelencoder.fit_transform(data.ix[:, 1:])
Traceback (most recent call last):

  File "<ipython-input-47-b4986a719976>", line 1, in <module>
    labelencoder.fit_transform(data.ix[:, 1:])

  File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 129, in fit_transform
    y = column_or_1d(y, warn=True)

  File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in column_or_1d
    raise ValueError("bad input shape {0}".format(shape))

ValueError: bad input shape (4, 3)

Does anyone know how to do this?

Label encoder only supports single columns. You need to iterate your columns in order to encode them. — Vivek Kumar
– Vivek Kumar, Commented Jun 10, 2017 at 15:10
Thanks! I will look into this and write a follow-up on the post — HelloBlob
– HelloBlob, Commented Jun 11, 2017 at 15:07

Keiku · Accepted Answer · 2017-09-19 02:36:53Z

22

As the following code, you can encode the multiple columns by applying LabelEncoder to DataFrame. However, please note that we cannot obtain the classes information for all columns.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'A': [1, 2, 3, 4],
                   'B': ["Yes", "No", "Yes", "Yes"],
                   'C': ["Yes", "No", "No", "Yes"],
                   'D': ["No", "Yes", "No", "Yes"]})
print(df)
#    A    B    C    D
# 0  1  Yes  Yes   No
# 1  2   No   No  Yes
# 2  3  Yes   No   No
# 3  4  Yes  Yes  Yes

# LabelEncoder
le = LabelEncoder()

# apply "le.fit_transform"
df_encoded = df.apply(le.fit_transform)
print(df_encoded)
#    A  B  C  D
# 0  0  1  1  0
# 1  1  0  0  1
# 2  2  1  0  0
# 3  3  1  1  1

# Note: we cannot obtain the classes information for all columns.
print(le.classes_)
# ['No' 'Yes']

answered Sep 19, 2017 at 2:36

Keiku

8,9036 gold badges45 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Vaidøtas I. Over a year ago

TypeError: '<' not supported between instances of 'str' and 'float' During handling of the above exception, another exception occurred: and raise TypeError("argument must be a string or number") TypeError: ('argument must be a string or number', 'occurred at index workclass') Why is this considered a valid solution?

Keiku Over a year ago

@VaidøtasIvøška I do not know your situation in detail, but please check if your data contains np.nan.

Vaidøtas I. Over a year ago

@Keiku is this the game-breaker? The np.nan? You mean if my dataset contains any nan's then this error could occur? I will try this out asap.

Vaidøtas I. Over a year ago

My problem: Traceback (most recent call last): bla bla TypeError: '<' not supported between instances of 'str' and 'float' and: raise TypeError("argument must be a string or number") TypeError: ('argument must be a string or number' I don't get it... My example: x = df.dropna() catfm = x.dtypes==object cat_cols = adult_train.columns[catfm].tolist() df[categorical_cols] = adult_train[cat_cols].apply(lambda col: le.fit_transform(col))

Vaidøtas I. Over a year ago

EDIT2: Now I get this: Must pass DataFrame with boolean values only. So this .apply works only with booleans?

|

Darshan Jain · Accepted Answer · 2020-04-27 15:52:02Z

5

First, find out all the features with type object:

objList = all_data.select_dtypes(include = "object").columns
print (objList)

Now, to convert the above objList features into numeric type, you can use a forloop as given below:

#Label Encoding for object to numeric conversion
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for feat in objList:
    df[feat] = le.fit_transform(df[feat].astype(str))

print (df.info())

Note that we are explicitly mentioning as type string in the forloop because if you remove that it throws an error.

answered Apr 27, 2020 at 15:52

Darshan Jain

82811 silver badges19 bronze badges

Comments

TkrA · Accepted Answer · 2021-04-14 18:02:12Z

4

If you know the name of the columns and don't want to use all of them, you can do something like this (you are also getting rid of a for loop):

categ = ['Pclass','Cabin_Group','Ticket','Embarked']

# Encode Categorical Columns
le = LabelEncoder()
df[categ] = df[categ].apply(le.fit_transform)

answered Apr 14, 2021 at 18:02

TkrA

7161 gold badge10 silver badges23 bronze badges

Comments

TheRobotCarlson · Accepted Answer · 2019-08-16 14:21:12Z

3

Scikit-learn has something for this now: OrdinalEncoder

from sklearn.preprocessing import OrdinalEncoder
data = pd.DataFrame({'A': [1, 2, 3, 4],
                         'B': ["Yes", "No", "Yes", "Yes"],
                         'C': ["Yes", "No", "No", "Yes"],
                         'D': ["No", "Yes", "No", "Yes"]})

oe = OrdinalEncoder()

t_data = oe.fit_transform(data)
print(t_data)
# [[0. 1. 1. 0.]
# [1. 0. 0. 1.]
# [2. 1. 0. 0.]
# [3. 1. 1. 1.]]

Works straight out of the box.

answered Aug 16, 2019 at 14:21

TheRobotCarlson

4124 silver badges9 bronze badges

Comments

Tobi · Accepted Answer · 2018-01-23 11:20:04Z

1

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelBinarizer
# df is the pandas dataframe
class preprocessing (BaseEstimator, TransformerMixin):
      def __init__ (self, df):
         self.datatypes = df.dtypes.astype(str)
         self.catcolumns = []
         self.cat_encoders = []
         self.encoded_df = []

      def fit (self, df, y = None):
          for ix, val in zip(self.datatypes.index.values, 
          self.datatypes.values):
              if val =='object':
                 self.catcolumns.append(ix)
          fit_objs = [str(i) for i in range(len(self.catcolumns))]
          for encs, name in zip(fit_objs,self.catcolumns):
              encs = LabelBinarizer()
              encs.fit(df[name])
              self.cat_encoders.append((name, encs))
          return self
      def transform (self, df , y = None): 
          for name, encs in self.cat_encoders:
              df_c = encs.transform(df[name])
              self.encoded_df.append(pd.DataFrame(df_c))
          self.encoded_df = pd.concat(self.encoded_df, axis = 1, 
          ignore_index 
          = True)
          self.df_num = df.drop(self.catcolumns, axis = 1)
          y = pd.concat([self.df_num, self.encoded_df], axis = 1, 
          ignore_index = True)
          return y        
# use return y.values to use in sci-kit learn pipeline
""" Finds categorical columns in a dataframe and one hot encodes the 
    columns. you can replace labelbinarizer with labelencoder if you 
    require only label encoding. Function returns encoded categorcial data 
    and numerical data as a dataframe """

edited Jan 23, 2018 at 11:20

answered Jan 23, 2018 at 11:05

Tobi

112 bronze badges

2 Comments

GPhilo Over a year ago

Please avoid giving "code-only" answers, but instead explain your changes/methods and how they solve the OP's problem.

Tobi Over a year ago

How do I include a description. I don't seem to understand the interface

SOURIN ROY · Accepted Answer · 2020-09-02 14:21:55Z

1

You can also loop through the different columns you want to apply the encoding to. This method might not the most efficient, but it works fine.

from sklearn import preprocessing
LE = preprocessing.LabelEncoder()
for col in df.columns:
    df[col] = LE.fit(df[col])
    df[col] = LE.transform(df[col])
    test_data[col] = LE.transform(test_data[col])

answered Sep 2, 2020 at 14:21

SOURIN ROY

311 silver badge5 bronze badges

Comments

Dharman · Accepted Answer · 2021-05-13 22:41:00Z

1

Here is the Simplest I could write:

Step 1: Get all categorical columns:

categorical_columns = train.select_dtypes(['object']).columns

This will store all categorical columns.

Step2: Write a for loop to transform, as fit_transform only takes 1 index at a time. but here is the crack.

from sklearn.preprocessing import LabelEncoder
label_encoder = preprocessing.LabelEncoder()
for col in train[categorical_columns]:
    train[col]= label_encoder.fit_transform(train[col])

Step3: Vote up lol :)

Hope you find this useful.

edited May 13, 2021 at 22:41

Dharman♦

33.9k27 gold badges106 silver badges157 bronze badges

answered May 13, 2021 at 22:35

Amit Bidlan

811 silver badge5 bronze badges

Collectives™ on Stack Overflow

Sklearn Label Encoding multiple columns pandas dataframe

7 Answers 7

7 Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

7 Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related