pandas: split string, and count values? [duplicate]

Question

I've got a pandas dataset with a column that's a comma-separated string, e.g. 1,2,3,10:

data = [
  { 'id': 1, 'score': 9, 'topics': '11,22,30' },
  { 'id': 2, 'score': 7, 'topics': '11,18,30' },
  { 'id': 3, 'score': 6, 'topics': '1,12,30' },
  { 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)

I'd like to get a count and a mean score for each value in topics. So:

topic_id,count,mean
1,2,5
11,2,8
12,1,6

et cetera. How can I do this?

I've got as far as:

df['topic_ids'] = df.topics.str.split()

But now I guess I want to explode topic_ids out, so there's a column for each unique value in the entire set of values...?

By mean score, do you mean df.topics.str.split(',',expand=True).astype(int).mean(axis=1)? — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Jan 29, 2018 at 19:31

BENY · Accepted Answer · 2018-01-29 19:32:04Z

5

unnest then groupby and agg

df.topics=df.topics.str.split(',')
New_df=pd.DataFrame({'topics':np.concatenate(df.topics.values),'id':df.id.repeat(df.topics.apply(len)),'score':df.score.repeat(df.topics.apply(len))})

New_df.groupby('topics').score.agg(['count','mean'])

Out[1256]: 
        count  mean
topics             
1           2   5.0
11          2   8.0
12          1   6.0
18          2   5.5
22          1   9.0
30          4   6.5

answered Jan 29, 2018 at 19:32

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Richard Over a year ago

Thanks! Unfortunately with my real data I get an error on 'topics': np.concatenate(df.topics.values) - the error is ValueError: all the input arrays must have same number of dimensions. I think this is because the split arrays are of variable length - how to deal with?

BENY Over a year ago

@Richard did you assign it back after split?

Richard Over a year ago

It was because I had some NaN values in my data - replacing those fixed the problem. Thanks!

BENY Over a year ago

@Richard aha , np.nan will cause the problem , you can replace the nan :-)

MaxU - stand with Ukraine · Accepted Answer · 2018-01-29 20:03:30Z

3

In [111]: def mean1(x): return np.array(x).astype(int).mean()

In [112]: df.topics.str.split(',', expand=False).agg([mean1, len])
Out[112]:
       mean1  len
0  21.000000       3
1  19.666667       3
2  14.333333       3
3  16.333333       3

answered Jan 29, 2018 at 20:03

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Comments

jpp · Accepted Answer · 2018-01-29 19:36:33Z

1

This is one way. Reindex & stack, then groupby & agg.

import pandas as pd

data = [
  { 'id': 1, 'score': 9, 'topics': '11,22,30' },
  { 'id': 2, 'score': 7, 'topics': '11,18,30' },
  { 'id': 3, 'score': 6, 'topics': '1,12,30' },
  { 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)
df.topics = df.topics.str.split(',')
df2 = pd.DataFrame(df.topics.tolist(), index=[df.id, df.score])\
                   .stack()\
                   .reset_index(name='topics')\
                   .drop('level_2', 1)

df2.groupby('topics').score.agg(['count', 'mean']).reset_index()

answered Jan 29, 2018 at 19:36

jpp

166k37 gold badges301 silver badges362 bronze badges

2 Comments

Scott Boston Over a year ago

All in one line.

(df.set_index(['id','score']).topics.str.split(',', expand=True)   .stack().reset_index(name='Topic')   .groupby('Topic').agg({'id':'size','score':'mean'}))

jpp Over a year ago

@ScottBoston That might work too. More than one way!

Collectives™ on Stack Overflow

pandas: split string, and count values? [duplicate]

3 Answers 3

4 Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

2 Comments

Linked

Related