In python 2.7.6, matlablib, scikit learn 0.17.0, When I make a polynomial regression lines on a scatter plot, the polynomial curve will be really messy like this:
The script is like this: it will read two columns of floating data and make a scatter plot and regression
import pandas as pd
import scipy.stats as stats
import pylab
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pylab as pl
import sklearn
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
df=pd.read_csv("boston_real_estate_market_clean.csv")
LSTAT = df['LSTAT'].as_matrix()
LSTAT=LSTAT.reshape(LSTAT.shape[0], 1)
MEDV=df['MEDV'].as_matrix()
MEDV=MEDV.reshape(MEDV.shape[0], 1)
# Train test set split
X_train1, X_test1, y_train1, y_test1 = train_test_split(LSTAT,MEDV,test_size=0.3,random_state=1)
# Ploynomial Regression-nst order
plt.scatter(X_test1, y_test1, s=10, alpha=0.3)
for degree in [1,2,3,4,5]:
model = make_pipeline(PolynomialFeatures(degree), Ridge())
model.fit(X_train1,y_train1)
y_plot = model.predict(X_test1)
plt.plot(X_test1, y_plot, label="degree %d" % degree
+'; $q^2$: %.2f' % model.score(X_train1, y_train1)
+'; $R^2$: %.2f' % model.score(X_test1, y_test1))
plt.legend(loc='upper right')
plt.show()
I guess the reason is because the "X_test1, y_plot" are not sorted properly?
X_test1 is a numpy array like this:
[[ 5.49]
[ 16.65]
[ 17.09]
....
[ 25.68]
[ 24.39]]
yplot is a numpy array like this:
[[ 29.78517812]
[ 17.16759833]
[ 16.86462359]
[ 23.18680265]
...[ 37.7631725 ]]
I try to sort with this:
[X_test1, y_plot] = zip(*sorted(zip(X_test1, y_plot), key=lambda y_plot: y_plot[0]))
plt.plot(X_test1, y_plot, label="degree %d" % degree
+'; $q^2$: %.2f' % model.score(X_train1, y_train1)
+'; $R^2$: %.2f' % model.score(X_test1, y_test1))
The curve looks normal now but the result is weird with a negative R^2.
Could any guru show me the real issue is or how to sort here properly? Thank you!


reverse = Trueas an argument ofsorted? No idea if it will work, but worth a try.