5

I have a large pandas dataframe that can be represented structurally as:

      id          date    status
0     12    2015-05-01         0
1     12    2015-05-22         1
2     12    2015-05-14         1
3     12    2015-05-06         0
4     45    2015-05-03         1
5     45    2015-05-12         1
6     45    2015-05-02         0
7     51    2015-05-05         1
8     51    2015-05-01         0
9     51    2015-05-23         1
10    51    2015-05-17         1
11    51    2015-05-03         0
12    51    2015-05-05         0
13    76    2015-05-04         1
14    76    2015-05-22         1
15    76    2015-05-08         0

And can be created in Python 3.4 using:

tempDF = pd.DataFrame({ 'id': [12,12,12,12,45,45,45,51,51,51,51,51,51,76,76,76],
                        'date': ['2015-05-01','2015-05-22','2015-05-14','2015-05-06','2015-05-03','2015-05-12','2015-05-02','2015-05-05','2015-05-01','2015-05-23','2015-05-17','2015-05-03','2015-05-05','2015-05-04','2015-05-22','2015-05-08'],
                        'status': [0,1,1,0,1,1,0,1,0,1,1,0,0,1,1,0]})
tempDF['date'] = pd.to_datetime(tempDF['date'])

I would like to divide the dataframe into groups based on variable 'id', sort within groups based on 'date' and then get the last 'status' value within each group.

So far, I have:

tempGrouped = tempDF.groupby('id')
tempGrouped['status'].last()

which produces:

id
12    0
45    0
51    0
76    0

However, the status should be 1 in each case (the value associated with the latest date). I can't work out how to sort the groups by date before selecting the last value. It's likely I'm a little snow-blind after trying to work this out for a while, so I apologise in advance if the solution is obvious.

1 Answer 1

14

you can sort and group like this :

tempDF.sort(['id','date']).groupby('id')['status'].last()
Sign up to request clarification or add additional context in comments.

3 Comments

Of course! I was stuck in the mindset of groupby-ing first and then sorting within groups. Thanks for heads-up!
Why is the opposite approach (sort, then group) any different than (group, then sort)? I wonder why the operations is not commutative.
DataFrame.sort is deprecated, use: tempDF.sort_values(by=['id', 'date'])

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.