0

I am working through a Python Machine Learning Course on Udemy on the following dataset (showing the first few rows only)

   R&D Spend  Administration  Marketing Spend       State  Profit
0     165349          136898           471784    New York  192262
1     162598          151378           443899  California  191792
2     153442          101146           407935     Florida  191050
3     144372          118672           383200    New York  182902

The course was made in 2016 so some of the modules have been updated and I have changed this in my code (e.g: using ColumnTransformer make_column_transformer). The output of this code should be a float array (and it is in the Udemy tutorial) however, for some reason, after the code updates, my variable x is considered to be an ndarray object after carrying out the processing on it. I am not sure why because when I print the variable x it prints out an array of floats.

The original data file can be found at this link (a zip folder) in the file 50_startups.csv.

I tried adding .toarray() but this broke the code.

Thanks

import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 

dataset = pd.read_csv("Startups (multiple linear regression).csv")
x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1]


#Encode categorical variables (New York, California, Florida)
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import OneHotEncoder
preprocess = make_column_transformer((OneHotEncoder(),[-1]),remainder="passthrough")
x = preprocess.fit_transform(x)

enter image description here

2
  • where exactly and what is the error? Commented Jun 4, 2019 at 12:00
  • So the code runs perfectly. The error is with the output variable x. It should output an array that can be viewed using spyders array editor/viewer however the output x is an ndarray object and spyder cannot open it using its array editor. Do you have any idea why? Thanks Commented Jun 4, 2019 at 12:14

1 Answer 1

1

In this case I think this is just a result of the mixed data types in your input and outputs. For example if you examine x:

x
array([[165349, 136898, 471784, 'New York'],
       [162598, 151378, 443899, 'California'],
       [153442, 101146, 407935, 'Florida'],
       [144372, 118672, 383200, 'New York']], dtype=object)

You will see that it has dtype=object. This is because of the mix of integers and strings in the array. Because of that the passthrough array (R&D Spend, Administration, and Marketing Spend) maintain the same dtype. Within fit_transform this array is then stacked with the result of your OneHotEncoder transformation to produce the result. In this way the output dtype is the same as the input you provided.

If you want to change the dtype you can always just use .astype(float).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.