sklearn OneHotEncoder outputs non-array object error

Question

I am working through a Python Machine Learning Course on Udemy on the following dataset (showing the first few rows only)

   R&D Spend  Administration  Marketing Spend       State  Profit
0     165349          136898           471784    New York  192262
1     162598          151378           443899  California  191792
2     153442          101146           407935     Florida  191050
3     144372          118672           383200    New York  182902

The course was made in 2016 so some of the modules have been updated and I have changed this in my code (e.g: using ColumnTransformer make_column_transformer). The output of this code should be a float array (and it is in the Udemy tutorial) however, for some reason, after the code updates, my variable x is considered to be an ndarray object after carrying out the processing on it. I am not sure why because when I print the variable x it prints out an array of floats.

The original data file can be found at this link (a zip folder) in the file 50_startups.csv.

I tried adding .toarray() but this broke the code.

Thanks

import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 

dataset = pd.read_csv("Startups (multiple linear regression).csv")
x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1]


#Encode categorical variables (New York, California, Florida)
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import OneHotEncoder
preprocess = make_column_transformer((OneHotEncoder(),[-1]),remainder="passthrough")
x = preprocess.fit_transform(x)

So the code runs perfectly. The error is with the output variable x. It should output an array that can be viewed using spyders array editor/viewer however the output x is an ndarray object and spyder cannot open it using its array editor. Do you have any idea why? Thanks — MrJoe
– MrJoe, Commented Jun 4, 2019 at 12:14

Grr · Accepted Answer · 2019-06-04 12:51:15Z

1

In this case I think this is just a result of the mixed data types in your input and outputs. For example if you examine x:

x
array([[165349, 136898, 471784, 'New York'],
       [162598, 151378, 443899, 'California'],
       [153442, 101146, 407935, 'Florida'],
       [144372, 118672, 383200, 'New York']], dtype=object)

You will see that it has dtype=object. This is because of the mix of integers and strings in the array. Because of that the passthrough array (R&D Spend, Administration, and Marketing Spend) maintain the same dtype. Within fit_transform this array is then stacked with the result of your OneHotEncoder transformation to produce the result. In this way the output dtype is the same as the input you provided.

If you want to change the dtype you can always just use .astype(float).

answered Jun 4, 2019 at 12:51

Grr

16.2k7 gold badges72 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

sklearn OneHotEncoder outputs non-array object error

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related