Convert NumPy arrays to Pandas Dataframe with columns

Question

I want to normalize my both categorical and numeric values.

cols = df.columns.values.tolist()
df_num = df.drop(CAT_COLUMNS, axis=1)
df_num = df_num.as_matrix()
df_num = preprocessing.StandardScaler().fit_transform(df_num)

df.fillna('NA', inplace=True)
df_cat = df.T.to_dict().values()

vec_cat = DictVectorizer( sparse=False )
df_cat = vec_cat.fit_transform(df_cat)

After that I need to combine 2 numpy arrays back to pandas dataframe, but below approach doesn't work for me.

mas = np.hstack((df_num, df_cat))
df = pd.DataFrame(data=mas, columns=cols)

Error Message: ValueError: Shape of passed values is (475, 243), indices imply (83, 243)

One more approach:

columns = df.columns.values.tolist()
for col in columns:
    try:
        if col in CAT_COLUMNS:
            df[col] = pd.get_dummies(df[col])
        else:
            df[col] = df[col].apply(preprocessing.StandardScaler().fit)
    except Exception, err:
        print 'Column: %s and msg=%s' % (col, err.message)

Error Message:

Column: DATE and msg=Singleton array array(1444424400.0) cannot be considered a valid collection. Column: QTR_HR_START and msg=Singleton array array(21600000L, dtype=int64) cannot be considered a valid collection. ...

PS. Is there any way to avoid numpy et all? As example, I want to leverage on pandas_ml library

doesn't work does not explain why it failed. Why doesn't it work? it gives an error or it doesn't give the expected output? — EdChum
– EdChum, Commented Dec 25, 2015 at 15:10
I added an example of how to do this pure pandas. Although, if your goal is machine learning, it might be better to go the pure numpy route and not convert back to pandas. — David Maust
– David Maust, Commented Dec 25, 2015 at 16:56
Agree, but I am investigating very convenient library pandas_ml, and here all calculations based on pandas — SpanishBoy
– SpanishBoy, Commented Dec 25, 2015 at 17:00

David Maust · Accepted Answer · 2015-12-25 17:05:34Z

2

What you are looking for is pandas.get_dummies(). It will perform one hot encoding on categorical columns, and produce a dataframe as the result. From there you can use pandas.concat([existing_df, new_df],axis=0) to add the new columns to your existing dataframe. This will avoid the use of a numpy array.

An example of how it could be used:

for cat_column in CAT_COLUMNS:
    dummy_df = pd.get_dummies(df[column])

    #Optionally rename columns to indicate categorical feature name
    dummy_df.columns = ["%s_%s" % (cat_column, col) for col in dummy_df.columns]
    df = pd.concat([df, dummy_df], axis=1)

edited Dec 25, 2015 at 17:05

answered Dec 25, 2015 at 16:55

David Maust

8,3003 gold badges34 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

SpanishBoy Over a year ago

Any advice about: 1. how correctly replace categorical columns? 2 how normalize numeric columns in this case properly?

SpanishBoy · Accepted Answer · 2015-12-28 18:03:14Z

0

What about pretty simple following approach?

def normalize_dataframe(df):
    columns = df.columns.values.tolist()
    for col in columns:
        try:
            if col in CAT_COLUMNS:
                df[col] = pd.get_dummies(df[col])
            else:
                df[col] = preprocessing.StandardScaler().fit_transform(df[col])
        except Exception, err:
            print 'Column: %s and msg=%s' % (col, err.message)
    return df

answered Dec 28, 2015 at 18:03

SpanishBoy

2,2256 gold badges31 silver badges54 bronze badges

Collectives™ on Stack Overflow

Convert NumPy arrays to Pandas Dataframe with columns

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related