3

I am currently going through the book "Hands-On machine learning... " by Aurèlion Gèron. However, I am getting the following error message: (it is somewhat cumbersome to reproduce because the following two CSV downloads are required: OECD IMF.

Error message:

File "C:\Users\xxx\Miniconda3\lib\site-packages\pandas\core\frame.py", line 4548, in set_index raise KeyError(f"None of {missing} are in the columns")

KeyError: "None of ['Country'] are in the columns"

The code:

import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd 
import sklearn.linear_model

oecd_bli = pd.read_csv("BLI_24092020220751169.csv", thousands =',')

gdp_per_capita = pd.read_csv("gdp_per_capita.csv", thousands =',', delimiter ='\t', encoding =' latin1', na_values="n/a")

def prepare_country_stats(oecd_bli, gdp_per_capita):
    oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
    gdp_per_capita.rename(columns={"2015":"GDP per capita"}, inplace=True)
    gdp_per_capita.set_index("Country", inplace=True)
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True)
    full_country_stats.sort_values(by="GDP per capita", inplace=True)
    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    keep_indices = list(set(range(36)) - set(remove_indices))
    return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]

country_stats = prepare_country_stats(oecd_bli, gdp_per_capita) 
X = np.c_[country_stats["GDP per capita"]] 
y = np.c_[country_stats["Life satisfaction"]] 

# Visualize the data 
country_stats.plot( kind ='scatter', X ="GDP per capita", y ='Life satisfaction') 
plt.show() 

# Select a linear model 
model = sklearn.linear_model.LinearRegression() 

# Train the model 
model.fit(X, y) 

# Make a prediction for Cyprus 
X_new = [[22587]] 
# Cyprus's GDP per capita 

print( model.predict(X_new))

However, already in the function I get stuck. The error seems to be related to the set_index command, which I thought was a very reliable function. Of course, in my CSV file the Country column is present.

Here is a screenshot of the gdp_per_capita CSV.

gdp_per_capita

If anyone takes the time to reproduce, it would be highly appreciated.

3
  • 2
    Are you sure you are setting delimiter correctly? In the screenshot it seems from the header that ; is your delimiter. Commented Sep 26, 2020 at 13:34
  • Jeez, how could have have overlooked that. I wish I could blame my German PC settings for nt paying attention. Many thanks! Commented Sep 26, 2020 at 13:42
  • @รยקคгรђשค Thanks so much for clearing this up - this comment helped me out as well Commented Oct 18, 2020 at 7:41

2 Answers 2

3

Couple of changes required in your code:

Change this line:

gdp_per_capita = pd.read_csv("gdp_per_capita.csv", thousands =',', delimiter ='\t', encoding =' latin1', na_values="n/a")

to this (remove the encoding='latin1'):

gdp_per_capita = pd.read_csv("gdp_per_capita.csv", thousands =',', delimiter ='\t', na_values="n/a")

And change this:

country_stats.plot(kind='scatter', X="GDP per capita", y='Life satisfaction')

To this (Capital X to x):

country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')

I was able to get a scatter plot after these 2 changes:

enter image description here

Sign up to request clarification or add additional context in comments.

5 Comments

Many thanks. I did notice that the X was in CAP, but thought since in CAP throughout, it wouldn't matter
Sorry for the follow up. I noticed that you plot the regression in the above graph. Is this caused by your IDE? I use Spyder and it did not plot the regression
I use Pycharm/VS Code. A separate window opens up when you run matplotlib.
Why did you accept the other answer? Just wondering.
I think I did accept your answer mate. Not sure what happened :) thanks again
1
  • Reading the gdp_per_capita csv with 'latin' encoding reads the Country column as Country. Therefore, I suggest 'utf-8' encoding, which resolves this issue.

  • You had a typo in the scatterplot, which @NYC coder has already pointed out.

Try this:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model

oecd_bli = pd.read_csv("BLI_26092020152902439.csv", thousands =',')

gdp_per_capita = pd.read_csv("gdp_per_capita.csv", delimiter = '\t', thousands =',', encoding ='utf-8', na_values="n/a")

def prepare_country_stats(oecd_bli, gdp_per_capita):
    oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
    gdp_per_capita.rename(columns={"2015":"GDP per capita"}, inplace=True)
    print(gdp_per_capita)
    gdp_per_capita.set_index("Country", inplace=True)
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True)
    full_country_stats.sort_values(by="GDP per capita", inplace=True)
    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    keep_indices = list(set(range(36)) - set(remove_indices))
    return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]

country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]

# Visualize the data 
country_stats.plot( kind ='scatter', x ="GDP per capita", y ='Life satisfaction')
plt.show()

# Select a linear model 
model = sklearn.linear_model.LinearRegression()

# Train the model 
model.fit(X, y)

# Make a prediction for Cyprus 
X_new = [[22587]]
# Cyprus's GDP per capita 

print( model.predict(X_new))

Output:

enter image description here

2 Comments

Many thanks! Weirdly, for me works "Latin1". When I use "utf-8" I get the following: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 235638: invalid start byte
Glad it works. Please accept and upvote the answer if It helps you. So that people know that the issue is resolved.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.