3

I've got a pandas dataset with a column that's a comma-separated string, e.g. 1,2,3,10:

data = [
  { 'id': 1, 'score': 9, 'topics': '11,22,30' },
  { 'id': 2, 'score': 7, 'topics': '11,18,30' },
  { 'id': 3, 'score': 6, 'topics': '1,12,30' },
  { 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)

I'd like to get a count and a mean score for each value in topics. So:

topic_id,count,mean
1,2,5
11,2,8
12,1,6

et cetera. How can I do this?

I've got as far as:

df['topic_ids'] = df.topics.str.split()

But now I guess I want to explode topic_ids out, so there's a column for each unique value in the entire set of values...?

1
  • 2
    By mean score, do you mean df.topics.str.split(',',expand=True).astype(int).mean(axis=1)? Commented Jan 29, 2018 at 19:31

3 Answers 3

5

unnest then groupby and agg

df.topics=df.topics.str.split(',')
New_df=pd.DataFrame({'topics':np.concatenate(df.topics.values),'id':df.id.repeat(df.topics.apply(len)),'score':df.score.repeat(df.topics.apply(len))})

New_df.groupby('topics').score.agg(['count','mean'])

Out[1256]: 
        count  mean
topics             
1           2   5.0
11          2   8.0
12          1   6.0
18          2   5.5
22          1   9.0
30          4   6.5
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! Unfortunately with my real data I get an error on 'topics': np.concatenate(df.topics.values) - the error is ValueError: all the input arrays must have same number of dimensions. I think this is because the split arrays are of variable length - how to deal with?
@Richard did you assign it back after split?
It was because I had some NaN values in my data - replacing those fixed the problem. Thanks!
@Richard aha , np.nan will cause the problem , you can replace the nan :-)
3
In [111]: def mean1(x): return np.array(x).astype(int).mean()

In [112]: df.topics.str.split(',', expand=False).agg([mean1, len])
Out[112]:
       mean1  len
0  21.000000       3
1  19.666667       3
2  14.333333       3
3  16.333333       3

Comments

1

This is one way. Reindex & stack, then groupby & agg.

import pandas as pd

data = [
  { 'id': 1, 'score': 9, 'topics': '11,22,30' },
  { 'id': 2, 'score': 7, 'topics': '11,18,30' },
  { 'id': 3, 'score': 6, 'topics': '1,12,30' },
  { 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)
df.topics = df.topics.str.split(',')
df2 = pd.DataFrame(df.topics.tolist(), index=[df.id, df.score])\
                   .stack()\
                   .reset_index(name='topics')\
                   .drop('level_2', 1)

df2.groupby('topics').score.agg(['count', 'mean']).reset_index()

2 Comments

All in one line. (df.set_index(['id','score']).topics.str.split(',', expand=True) .stack().reset_index(name='Topic') .groupby('Topic').agg({'id':'size','score':'mean'}))
@ScottBoston That might work too. More than one way!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.