2

I have a pandas Dataframe with mixed datatypes (float64 and strings), to use it in a sklearn Pipeline I need to convert it to a numpy array. In the end of the Pipeline I want to make a Dataframe again.

The problem is, when creating a numpy array with mixed types all data is converted to dtype "object". That way, when I create a new dataframe at the end all data is categorical.

Example:

Dataframe with mixed data

>>> dataframe = pd.DataFrame([[1,2,3],["a","b","c"]], columns = ["num", "cat"])

>>> dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     3 non-null      int64 
 1   cat     3 non-null      object
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes

To numpy array

>>> array = dataframe.to_numpy()

array([[1, 'a'],
       [2, 'b'],
       [3, 'c']], dtype=object)

Back to dataframe

>>> new_df = pd.DataFrame(array, columns = ["num", "cat"])

>>> new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     3 non-null      object
 1   cat     3 non-null      object
dtypes: object(2)
memory usage: 176.0+ bytes

Now the two columns are categorical.

Is there a way to make pandas recognize the true data types inside the numpy array?

2 Answers 2

2

If you are using pandas >= 1.0, there's convert_dtypes:

>>> new_df = pd.DataFrame(array, columns = ["num", "cat"]).convert_dtypes()
>>> new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     3 non-null      Int64 
 1   cat     3 non-null      string
dtypes: Int64(1), string(1)
memory usage: 179.0 bytes
Sign up to request clarification or add additional context in comments.

Comments

2

you can use infer_objects() as well:

new_df = pd.DataFrame(array, columns = ["num", "cat"]).infer_objects()
print(new_df,'\n\n',new_df.dtypes)

  num cat
0    1   a
1    2   b
2    3   c 

num     int64
cat    object
dtype: object

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.