Passing pandas NumPy arrays as feature vectors in scikit learn?

Question

I have a vector of 5 different values that I use as my sample value, and the label is a single integer of 0, 1, or 3. The machine learning algorithms work when I pass an array as a sample, but I get this warning. How do I pass feature vectors without getting this warning?

import numpy as np
from numpy import random

from sklearn import neighbors
from sklearn.model_selection import train_test_split
import pandas as pd

filepath = 'test.csv'

# example label values
index = [0,1,3,1,1,1,0,0]

# example sample arrays
data = []
for i in range(len(index)):
    d = []
    for i in range(6):
        d.append(random.randint(50,200))
    data.append(d)

feat1 = 'brightness'
feat2, feat3, feat4 = ['h', 's', 'v']
feat5 = 'median hue'
feat6 = 'median value'

features = [feat1, feat2, feat3, feat4, feat5, feat6]

df = pd.DataFrame(data, columns=features, index=index)
df.index.name = 'state'

with open(filepath, 'a') as f:
    df.to_csv(f, header=f.tell() == 0)

states = pd.read_csv(filepath, usecols=['state'])

df_partial = pd.read_csv(filepath, usecols=features)

states = states.astype(np.float32)
states = states.values
labels = states

samples = np.array([])
for i, row in df_partial.iterrows():
    r = row.values
    samples = np.vstack((samples, r)) if samples.size else r

n_neighbors = 5

test_size = .3
labels, test_labels, samples, test_samples = train_test_split(labels, samples, test_size=test_size)
clf1 = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
clf1 = clf1.fit(samples, labels)

score1 = clf1.score(test_samples, test_labels)

print("Here's how the models performed \nknn: %d %%" %(score1 * 100))

Warning:

"DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). clf1 = clf1.fit(samples, labels)"

sklearn documentation for fit(self, X, Y)

Honey Gourami · Accepted Answer · 2019-07-23 01:59:36Z

2

Try replacing

states = states.values by states = states.values.flatten()

OR

clf1 = clf1.fit(samples, labels) by clf1 = clf1.fit(samples, labels.flatten()).

states = states.values holds the correct labels that were stored in your panda dataframe, however they are getting stored on different rows. Using .flatten() put all those labels on the same row. (https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.ndarray.flatten.html)

In Sklearn's KNeighborsClassifier documentation (https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), they show in their example that the labels must be stored on the same row: y = [0, 0, 1, 1].

edited Jul 23, 2019 at 1:59

answered Jul 23, 2019 at 1:32

Honey Gourami

15011 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ev C Over a year ago

Also I'm really confused by the documentation. What does it mean about the x input? That it is a 2d array of n_samples x n_samples or something else? or is it literally just a list [n_samples, n_samples] like it says? I put a screenshot of the sklearn documentation in my question above ^^^

Honey Gourami Over a year ago

You are very welcome! Regarding the x input, it is an array/matrix that holds points that have n_features. In your case, your points have 6 features (Brightness, h, s, v, median hue, and median value), so n_features = 6. Your X therefore holds 28 points having 6 features each, so its shape [n_samples, n_features] will be [28, 6].Try adding print(samples) right before clf1 = clf1.fit(samples, labels) in your code. It will help you visualize it better.

Paul Dawson · Accepted Answer · 2019-07-23 13:13:18Z

0

When you retrieve data from dataframe states, it is stored in multiple rows (column vector) whereas it expected values in single row.

You can also try using ravel() function which is used to create a contiguous flattened array.

numpy.ravel(array, order = ‘C’) : returns contiguous flattened array (1D array with all the input-array elements and with the same type as it)

Try:

states = states.values.ravel() in place of states = states.values

edited Jul 23, 2019 at 13:13

Paul Dawson

1,38214 silver badges32 bronze badges

answered Jul 23, 2019 at 12:08

SUN

1895 bronze badges

3 Comments

Ev C Over a year ago

So ravel() and flatten() are the same thing essentially?

SUN Over a year ago

I did little bit research on this, Although ravel() and flatten() are two ways to convert a ndarray to 1D array, they have some differences. Ravel return reference to original array, and changes in array reflect in original array, whereas Flatten copy the original array and changes to array does not affect original array. As Ravel is just a reference of original array and completely avoid copying of data, it is faster than flatten.

SUN Over a year ago

References : geeksforgeeks.org/differences-flatten-ravel-numpy stackoverflow.com/questions/28930465/…

Collectives™ on Stack Overflow

Passing pandas NumPy arrays as feature vectors in scikit learn?

2 Answers 2

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related