groupby rows from several columns in list in python pandas

Question

I have Pandas DataFrame that looks like:

    id     a    b    c       col
1   a      1    2    Null    'aa'
2   a      2    2    3       'aa'
3   b      4    3    1       'bb'
4   c      1    Null 3       'gg'
5   c      Null 2    Null    'gg'

I want to groupby the columns to get the following:

    id     new_col           col
1   a      [1, 2, 2, 2, 3]   'aa'
2   b      [4, 3, 1]         'bb'
3   c      [1, 3, 2]         'gg'

Is it possible to do it using pd.groupby?

Thanks

anky · Accepted Answer · 2020-02-12 11:18:58Z

3

You can use df.melt with groupby+agg:

final = (df.replace('Null',np.nan).melt(['id','col'],value_name='new_col').groupby('id'
         ,as_index=False).agg({'new_col':lambda x: x.dropna().tolist(),'col':'first'}))

Or stack first with set_index then groupby+agg

final1 = (df.replace('Null',np.nan).set_index(['id','col']).stack().rename('new_col')
       .reset_index('col').groupby(level=0).agg({'new_col':list,'col':'first'}))

  id          new_col   col
0  a  [1, 2, 2, 2, 3]  'aa'
1  b        [4, 3, 1]  'bb'
2  c        [1, 2, 3]  'gg'

edited Feb 12, 2020 at 11:18

answered Feb 12, 2020 at 11:13

anky

75.3k11 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jezrael · Accepted Answer · 2020-02-12 11:18:44Z

2

Use GroupBy.apply with DataFrame.stack by all columns without specified in list by Index.difference:

df = df.replace('Null', np.nan)

c = df.columns.difference(['id','col'])
f = lambda x: x.stack().tolist()
df = df.groupby(['id','col'])[c].apply(f).reset_index(name='new_col')[['id','new_col','col']]
print (df)
  id          new_col   col
0  a  [1, 2, 2, 2, 3]  'aa'
1  b        [4, 3, 1]  'bb'
2  c        [1, 3, 2]  'gg'

edited Feb 12, 2020 at 11:18

answered Feb 12, 2020 at 11:12

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Comments

Mo7art · Accepted Answer · 2020-02-12 11:56:37Z

1

df["d"] = df[['a', 'b', 'c']].values.tolist()
dup = df.groupby(['id','col'])['d'].sum().reset_index(name='new_col')

answered Feb 12, 2020 at 11:56

Mo7art

3031 silver badge8 bronze badges

Collectives™ on Stack Overflow

groupby rows from several columns in list in python pandas

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related