How to group dataframe rows into list in pandas groupby

Question

Given a dataframe, I want to groupby the first column and get second column as lists in rows, so that a dataframe like:

a b
A 1
A 2
B 5
B 5
B 4
C 6

becomes

A [1,2]
B [5,5,4]
C [6]

How do I do this?

imagine a scenario where I want to add another A records if the aggregate of A's element list exceeds 10. how to accomplish this ? — Sithija Piyuman Thewa Hettige
– Sithija Piyuman Thewa Hettige, Commented Dec 16, 2021 at 4:23

EdChum · Accepted Answer · 2019-09-09 07:34:36Z

794

You can do this using groupby to group on the column of interest and then apply list to every group:

In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
        df

Out[1]: 
   a  b
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

In [2]: df.groupby('a')['b'].apply(list)
Out[2]: 
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
        df1
Out[3]: 
   a        new
0  A     [1, 2]
1  B  [5, 5, 4]
2  C        [6]

edited Sep 9, 2019 at 7:34

answered Mar 6, 2014 at 10:28

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

17 Comments

Abhishek Thakur Over a year ago

This takes a lot of time if the dataset is huge, say 10million rows. Is there any faster way to do this? The number of uniques in 'a' is however around 500k

EdChum Over a year ago

groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think

Andarin Over a year ago

When I tried this solution with my problem (having multiple columns to groupBy and to group), it didn't work - pandas sent 'Function does not reduce'. Then I used tuplefollowing the second answer here: stackoverflow.com/questions/19530568/… . See second answer in stackoverflow.com/questions/27439023/… for explanation.

EdChum Over a year ago

@PoeteMaudit Sorry I don't understand what you're asking and asking questions in comments is bad form in SO. Are you asking how to concatenate multiple columns into a single list?

Modem Rakesh goud Over a year ago

is there a way to do it multiple columns at a time? df1 = df.groupby('a')['b','c'].apply(list).reset_index(name='new')

|

Anamika Modi · Accepted Answer · 2018-09-27 06:28:03Z

134

A handy way to achieve this would be:

df.groupby('a').agg({'b':lambda x: list(x)})

Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py

answered Sep 27, 2018 at 6:28

Anamika Modi

1,5181 gold badge10 silver badges5 bronze badges

3 Comments

BallpointBen Over a year ago

lambda args: f(args) is equivalent to f

cs95 Over a year ago

Actually, just agg(list) is enough. Also see here.

Akshay Sehgal Over a year ago

!! I was just googling for some syntax and realised my own notebook was referenced for the solution lol. Thanks for linking this. Just to add, since 'list' is not a series function, you will have to either use it with apply df.groupby('a').apply(list) or use it with agg as part of a dict df.groupby('a').agg({'b':list}). You could also use it with lambda (which I recommend) since you can do so much more with it. Example: df.groupby('a').agg({'c':'first', 'b': lambda x: x.unique().tolist()}) which lets you apply a series function to the col c and a unique then a list function to col b.

B. M. · Accepted Answer · 2019-04-05 21:07:03Z

75

If performance is important go down to numpy level:

import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})

def f(df):
         keys, values = df.sort_values('a').values.T
         ukeys, index = np.unique(keys, True)
         arrays = np.split(values, index[1:])
         df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
         return df2

Tests:

In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop

In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop

edited Apr 5, 2019 at 21:07

answered Mar 2, 2017 at 8:42

B. M.

18.7k2 gold badges40 silver badges56 bronze badges

2 Comments

ru111 Over a year ago

How could we use this if we are grouping by two or more keys e.g. with .groupby([df.index.month, df.index.day]) instead of just .groupby('a')?

v.tralala Over a year ago

@ru111 I have added an answer below which you might want to check out. It does also handle grouping with multiple columns

Markus Dutschke · Accepted Answer · 2018-10-31 16:25:24Z

63

To solve this for several columns of a dataframe:

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
Out[6]: 
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]: 
           b          c
a                      
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

This answer was inspired from Anamika Modi's answer. Thank you!

answered Oct 31, 2018 at 16:25

Markus Dutschke

10.8k5 gold badges73 silver badges67 bronze badges

1 Comment

Claudiu Over a year ago

What should I do if some of the columns is the list!

cs95 · Accepted Answer · 2019-04-24 22:48:31Z

44

Use any of the following groupby and agg recipes.

# Setup
df = pd.DataFrame({
  'a': ['A', 'A', 'B', 'B', 'B', 'C'],
  'b': [1, 2, 5, 5, 4, 6],
  'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df

   a  b  c
0  A  1  x
1  A  2  y
2  B  5  z
3  B  5  x
4  B  4  y
5  C  6  z

To aggregate multiple columns as lists, use any of the following:

df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)

           b          c
a                      
A     [1, 2]     [x, y]
B  [5, 5, 4]  [z, x, y]
C        [6]        [z]

To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,

df.groupby('a').agg({'b': list})  # 4.42 ms 
df.groupby('a')['b'].agg(list)    # 2.76 ms - faster

a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

edited Apr 24, 2019 at 22:48

answered Apr 24, 2019 at 22:35

cs95

406k106 gold badges744 silver badges797 bronze badges

9 Comments

Kai Over a year ago

are the methods above guaranteed to preserve order? meaning that elements from the same row (but different columns, b and c in your code above) will have the same index in the resulting lists?

cs95 Over a year ago

@Kai oh, good question. Yes and no. GroupBy sorts the output by the grouper key values. However the sort is generally stable so the relative ordering per group is preserved. To disable the sorting behavior entirely, use groupby(..., sort=False). Here, it'd make no difference since I'm grouping on column A which is already sorted.

Federico Gentile Over a year ago

This is a very good answer! Is there also a way to make the values of the list unique? something like .agg(pd.Series.tolist.unique) maybe?

cs95 Over a year ago

@FedericoGentile you can use a lambda. Here's one way: df.groupby('a')['b'].agg(lambda x: list(set(x)))

cs95 Over a year ago

@Moondra Not sure, perhaps you want df.groupby('a').agg(lambda x: x.to_numpy().ravel().tolist())

|

Mithril · Accepted Answer · 2020-05-06 08:30:31Z

34

It is time to use agg instead of apply .

When

df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})

If you want multiple columns stack into list , result in pd.DataFrame

df.groupby('a')[['b', 'c']].agg(list)
# or 
df.groupby('a').agg(list)

If you want single column in list, result in ps.Series

df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)

Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .

edited May 6, 2020 at 8:30

answered May 6, 2020 at 8:22

Mithril

14k21 gold badges109 silver badges160 bronze badges

Comments

Acorbe · Accepted Answer · 2014-03-06 10:17:52Z

28

As you were saying the groupby method of a pd.DataFrame object can do the job.

Example

 L = ['A','A','B','B','B','C']
 N = [1,2,5,5,4,6]

 import pandas as pd
 df = pd.DataFrame(zip(L,N),columns = list('LN'))


 groups = df.groupby(df.L)

 groups.groups
      {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}

which gives and index-wise description of the groups.

To get elements of single groups, you can do, for instance

 groups.get_group('A')

     L  N
  0  A  1
  1  A  2

  groups.get_group('B')

     L  N
  2  B  5
  3  B  5
  4  B  4

edited Mar 6, 2014 at 10:17

answered Mar 6, 2014 at 10:12

Acorbe

8,4115 gold badges43 silver badges68 bronze badges

Comments

Sean.H · Accepted Answer · 2022-12-07 09:28:01Z

21

Just a suplement. pandas.pivot_table is much more universal and seems more convenient：

"""data"""
df = pd.DataFrame( {'a':['A','A','B','B','B','C'],
                    'b':[1,2,5,5,4,6],
                    'c':[1,2,1,1,1,6]})
print(df)

   a  b  c
0  A  1  1
1  A  2  2
2  B  5  1
3  B  5  1
4  B  4  1
5  C  6  6

"""pivot_table"""
pt = pd.pivot_table(df,
                    values=['b', 'c'],
                    index='a',
                    aggfunc={'b': list,
                             'c': set})
print(pt)
           b       c
a                   
A     [1, 2]  {1, 2}
B  [5, 5, 4]     {1}
C        [6]     {6}

edited Dec 7, 2022 at 9:28

answered Mar 29, 2021 at 11:55

Sean.H

6822 gold badges6 silver badges21 bronze badges

Comments

Metrd · Accepted Answer · 2022-03-25 07:15:09Z

10

The easiest way I have found to achieve the same thing, at least for one column, which is similar to Anamika's answer, just with the tuple syntax for the aggregate function.

df.groupby('a').agg(b=('b','unique'), c=('c','unique'))

edited Mar 25, 2022 at 7:15

answered May 22, 2020 at 12:34

Metrd

1091 silver badge7 bronze badges

Comments

Vanshika · Accepted Answer · 2019-07-04 17:07:02Z

9

If looking for a unique list while grouping multiple columns this could probably help:

df.groupby('a').agg(lambda x: list(set(x))).reset_index()

answered Jul 4, 2019 at 17:07

Vanshika

1531 gold badge3 silver badges7 bronze badges

Comments

v.tralala · Accepted Answer · 2021-02-02 22:19:24Z

Building upon @B.M answer, here is a more general version and updated to work with newer library version: (numpy version 1.19.2, pandas version 1.2.1) And this solution can also deal with multi-indices:

However this is not heavily tested, use with caution.

If performance is important go down to numpy level:

import pandas as pd
import numpy as np

np.random.seed(0)
df = pd.DataFrame({'a': np.random.randint(0, 10, 90), 'b': [1,2,3]*30, 'c':list('abcefghij')*10, 'd': list('hij')*30})


def f_multi(df,col_names):
    if not isinstance(col_names,list):
        col_names = [col_names]
        
    values = df.sort_values(col_names).values.T

    col_idcs = [df.columns.get_loc(cn) for cn in col_names]
    other_col_names = [name for idx, name in enumerate(df.columns) if idx not in col_idcs]
    other_col_idcs = [df.columns.get_loc(cn) for cn in other_col_names]

    # split df into indexing colums(=keys) and data colums(=vals)
    keys = values[col_idcs,:]
    vals = values[other_col_idcs,:]
    
    # list of tuple of key pairs
    multikeys = list(zip(*keys))
    
    # remember unique key pairs and ther indices
    ukeys, index = np.unique(multikeys, return_index=True, axis=0)
    
    # split data columns according to those indices
    arrays = np.split(vals, index[1:], axis=1)

    # resulting list of subarrays has same number of subarrays as unique key pairs
    # each subarray has the following shape:
    #    rows = number of non-grouped data columns
    #    cols = number of data points grouped into that unique key pair
    
    # prepare multi index
    idx = pd.MultiIndex.from_arrays(ukeys.T, names=col_names) 

    list_agg_vals = dict()
    for tup in zip(*arrays, other_col_names):
        col_vals = tup[:-1] # first entries are the subarrays from above 
        col_name = tup[-1]  # last entry is data-column name
        
        list_agg_vals[col_name] = col_vals

    df2 = pd.DataFrame(data=list_agg_vals, index=idx)
    return df2

Tests:

In [227]: %timeit f_multi(df, ['a','d'])

2.54 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [228]: %timeit df.groupby(['a','d']).agg(list)

4.56 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Results:

for the random seed 0 one would get:

Great answer. Please share example, if you need only one column, and not multiple

BENY · Accepted Answer · 2018-11-30 20:59:27Z

4

Let us using df.groupby with list and Series constructor

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]: 
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object

answered Nov 30, 2018 at 20:59

BENY

324k22 gold badges176 silver badges250 bronze badges

Comments

Shashank · Accepted Answer · 2021-09-02 09:49:34Z

Sorting consumes O(nlog(n)) time which is the most time consuming operation in the solutions suggested above

For a simple solution (containing single column) pd.Series.to_list would work and can be considered more efficient unless considering other frameworks

e.g.

import pandas as pd
from string import ascii_lowercase
import random

def generate_string(case=4):
    return ''.join([random.choice(ascii_lowercase) for _ in range(case)])

df = pd.DataFrame({'num_val':[random.randint(0,100) for _ in range(20000000)],'string_val':[generate_string() for _ in range(20000000)]})


%timeit df.groupby('string_val').agg({'num_val':pd.Series.to_list})

For 20 million records it takes about 17.2 seconds. compared to apply(list) which takes about 19.2 and lambda function which takes about 20.6s

N_Z · Accepted Answer · 2022-09-13 14:28:21Z

1

Just to add up to previous answers, In my case, I want the list and other functions like min and max. The way to do that is:

df = pd.DataFrame({
    'a':['A','A','B','B','B','C'], 
    'b':[1,2,5,5,4,6]
})

df=df.groupby('a').agg({
    'b':['min', 'max',lambda x: list(x)]
})

#then flattening and renaming if necessary
df.columns = df.columns.to_flat_index()
df.rename(columns={('b', 'min'): 'b_min', ('b', 'max'): 'b_max', ('b', '<lambda_0>'): 'b_list'},inplace=True)

answered Sep 13, 2022 at 14:28

N_Z

9371 gold badge11 silver badges24 bronze badges

Comments

Jonas Puchinger · Accepted Answer · 2020-03-19 16:37:54Z

0

Here I have grouped elements with "|" as a separator

    import pandas as pd

    df = pd.read_csv('input.csv')

    df
    Out[1]:
      Area  Keywords
    0  A  1
    1  A  2
    2  B  5
    3  B  5
    4  B  4
    5  C  6

    df.dropna(inplace =  True)
    df['Area']=df['Area'].apply(lambda x:x.lower().strip())
    print df.columns
    df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})

    df_op.to_csv('output.csv')
    Out[2]:
    df_op
    Area  Keywords

    A       [1| 2]
    B    [5| 5| 4]
    C          [6]

edited Mar 19, 2020 at 16:37

Jonas Puchinger

33 bronze badges

answered Jun 10, 2019 at 11:33

Ganesh Kharad

3412 silver badges7 bronze badges

Comments

Abhilash Awasthi · Accepted Answer · 2020-08-23 08:56:46Z

Answer based on @EdChum's comment on his answer. Comment is this -

groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think

Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.

df = pd.DataFrame(columns=['a', 'b'])
df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
df['b'] = list(range(20000000))
print(df.shape)
df.head()

# Sort data by first column 
df.sort_values(by=['a'], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)

# Create a temp column
df['temp_idx'] = list(range(df.shape[0]))

# Take all values of b in a separate list
all_values_b = list(df.b.values)
print(len(all_values_b))

# For each category in column a, find min and max indexes
gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
gp_df.reset_index(inplace=True)
gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']

# Now create final list_b column, using min and max indexes for each category of a and filtering list of b. 
gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)

print(gp_df.shape)
gp_df.head()

This above code takes 2 minutes for 20 million rows and 500k categories in first column.

Collectives™ on Stack Overflow

How to group dataframe rows into list in pandas groupby

16 Answers 16

17 Comments

3 Comments

If performance is important go down to numpy level:

Tests:

2 Comments

1 Comment

9 Comments

Comments

Comments

Comments

Comments

Comments

If performance is important go down to numpy level:

Tests:

Results:

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

16 Answers 16

17 Comments

3 Comments

If performance is important go down to numpy level:

Tests:

2 Comments

1 Comment

9 Comments

Comments

Comments

Comments

Comments

Comments

If performance is important go down to numpy level:

Tests:

Results:

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related