Python pandas - detect and convert numpy.ndarray columns to list columns

Question

We have the following dtypes in our pandas dataframe:

>>> results_df.dtypes
_id                              int64
playerId                         int64
leagueId                         int64
firstName                       object
lastName                        object
fullName                        object
shortName                       object
gender                          object
nickName                        object
height                         float64
jerseyNum                       object
position                        object
teamId                           int64
updated            datetime64[ns, UTC]
teamMarket                      object
conferenceId                     int64
teamName                        object
updatedDate                     object
competitionIds                  object
dtype: object

The object types are not helpful in the .dtypes output here since some columns are ordinary strings (eg. firstName, lastName), whereas other columns are more complex (competitionIds is an numpy.ndarray of int64s).

We'd like to convert competitionIds, and any other columns that are numpy.ndarray columns, into list columns, without explicitly passing competitionIds, since it's not always known which columns are the numpy.ndarray columns. So, even though this works: results_df['competitionIds'] = results_df['competitionIds'].apply(list), it doesn't entirely solve the problem because I'm explicitly passing competitionIds here, whereas we need to automatically detect which columns are the numpy.ndarray columns.

Something like all(isinstance(x, np.ndarray) for x in column_that's_object) or so? — Mad Physicist
– Mad Physicist, Commented Nov 28, 2020 at 19:59
Or, if the contents of a column are known to be consistent, just check the first element? — Mad Physicist
– Mad Physicist, Commented Nov 28, 2020 at 20:00
column_that's_object here would be a list of column names? — Canovice
– Canovice, Commented Nov 28, 2020 at 20:01
It might be one of the dtype scalar instances then, not np.ndarray. what does type(...) for the first element of the column give? — Mad Physicist
– Mad Physicist, Commented Nov 28, 2020 at 20:02
The columns should be consistent but there is some missing data in these tables. competitionIds in particular has empty / missing values. — Canovice
– Canovice, Commented Nov 28, 2020 at 20:02

webelo · Accepted Answer · 2020-11-28 20:09:09Z

4

Pandas treats just about anything that isn't an int, float or category as an "object" (including lists!). So the best way to go about this is to look at the type of an actual element of the column:

import pandas as pd
import numpy as np

df = pd.DataFrame([{'str': 'a', 'arr': np.random.randint(0, 4, (4))} for _ in range(3)])

df.apply(lambda c: list(c) if isinstance(c[0], np.ndarray)  else c)

This will prevent you from converting other types that you may want to keep in place (e.g. sets) as well.

answered Nov 28, 2020 at 20:09

webelo

1,9431 gold badge16 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

willwrighteng · Accepted Answer · 2020-11-28 20:02:54Z

1

Here is a toy example of what I'm thinking:

import numpy as np

data = {'col1':np.nan, 'col2':np.ndarray(0)}

for col in data:
    print(isinstance(data[col],np.ndarray))

resulting in:

#False
#True

answered Nov 28, 2020 at 20:02

willwrighteng

3,1912 gold badges22 silver badges36 bronze badges

Collectives™ on Stack Overflow

Python pandas - detect and convert numpy.ndarray columns to list columns

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related