0

We have the following dtypes in our pandas dataframe:

>>> results_df.dtypes
_id                              int64
playerId                         int64
leagueId                         int64
firstName                       object
lastName                        object
fullName                        object
shortName                       object
gender                          object
nickName                        object
height                         float64
jerseyNum                       object
position                        object
teamId                           int64
updated            datetime64[ns, UTC]
teamMarket                      object
conferenceId                     int64
teamName                        object
updatedDate                     object
competitionIds                  object
dtype: object

The object types are not helpful in the .dtypes output here since some columns are ordinary strings (eg. firstName, lastName), whereas other columns are more complex (competitionIds is an numpy.ndarray of int64s).

We'd like to convert competitionIds, and any other columns that are numpy.ndarray columns, into list columns, without explicitly passing competitionIds, since it's not always known which columns are the numpy.ndarray columns. So, even though this works: results_df['competitionIds'] = results_df['competitionIds'].apply(list), it doesn't entirely solve the problem because I'm explicitly passing competitionIds here, whereas we need to automatically detect which columns are the numpy.ndarray columns.

9
  • Something like all(isinstance(x, np.ndarray) for x in column_that's_object) or so? Commented Nov 28, 2020 at 19:59
  • Or, if the contents of a column are known to be consistent, just check the first element? Commented Nov 28, 2020 at 20:00
  • column_that's_object here would be a list of column names? Commented Nov 28, 2020 at 20:01
  • It might be one of the dtype scalar instances then, not np.ndarray. what does type(...) for the first element of the column give? Commented Nov 28, 2020 at 20:02
  • The columns should be consistent but there is some missing data in these tables. competitionIds in particular has empty / missing values. Commented Nov 28, 2020 at 20:02

2 Answers 2

4

Pandas treats just about anything that isn't an int, float or category as an "object" (including lists!). So the best way to go about this is to look at the type of an actual element of the column:

import pandas as pd
import numpy as np

df = pd.DataFrame([{'str': 'a', 'arr': np.random.randint(0, 4, (4))} for _ in range(3)])

df.apply(lambda c: list(c) if isinstance(c[0], np.ndarray)  else c)

This will prevent you from converting other types that you may want to keep in place (e.g. sets) as well.

Sign up to request clarification or add additional context in comments.

Comments

1

Here is a toy example of what I'm thinking:

import numpy as np

data = {'col1':np.nan, 'col2':np.ndarray(0)}

for col in data:
    print(isinstance(data[col],np.ndarray))

resulting in:

#False
#True

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.