Pandas index is sorting on its own

Question

I have a df sorted by person and time. The index is not duplicated, nor is it continuous from 0. I check the difference in time against a threshold depending on row above

   person   time_bought  product    
42 abby     2:21         fruit        
12 abby     2:55         fruit      
10 abby     10:35        other
3  barry    12:00        fruit      
...

thresh = {'fruit': pd.Timedelta('10min'), 'other': pd.Timedelta('2min')}
# map custom threshold based on previous row product
ref = df.groupby('person')['product'].shift().map(thresh)

I don't understand why m1 does not retain df's index: I get a sorted index from the lowest index value onwards.

# compare each delta to the custom threshold. 
m1 = df.loc[df.product=="fruit", 'time_bought'].groupby(df['person']).diff().gt(ref)

3 False
4 False

If I remove .gt(ref), I only see the filtered rows and original index is retained.

df.loc[df.product=="fruit", 'time_bought'].groupby(df['person']).diff()

42 NaT
12 0 Days 00:34:00
...

A sorted index messes up my next line:m1.cumsum()

jezrael · Accepted Answer · 2022-07-08 11:51:54Z

1

Use:

df['time_bought'] = pd.to_timedelta('00:' + df['time_bought'])

Idea is not filter rows, but set NaT to unmatched rows:

print (df['time_bought'].where(df['product']=="fruit", None))
42   0 days 00:02:21
12   0 days 00:02:55
10               NaT
3    0 days 00:12:00
Name: time_bought, dtype: timedelta64[ns]

So original order is not changed:

m1 = (df['time_bought'].where(df['product']=="fruit", None)
                       .groupby(df['person']).diff().gt(ref))
print (m1)
42    False
12    False
10    False
3     False
dtype: bool

answered Jul 8, 2022 at 11:51

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

asd Over a year ago

Thanks, so was it confused because I am comparing a filtered database to an unfiltered database? Why does the index not change if I end at the .diff() bit. If I add m1 to the dataframe, df["m1"].. it also works.

jezrael Over a year ago

@asd - I guess because different index of df.loc[df.product=="fruit", 'time_bought'] vs df.index.

sherdim · Accepted Answer · 2022-07-07 15:48:24Z

Introductory remarks:

Unreliable assumption about the previous row. Are you relying on time, despite the indexes in reverse order?
df.product is not column as you hope, but a method of DataFrame object.
I guess converting to Datetime for a time column:

df['time_bought'] = pd.to_datetime(df['time_bought'], format='%H:%M')

In general you want to do per-person analysis, so do not repeat grouping in your code, but operate on a single group a time.

GroupBy iterator is just a sequence of subtables for an every person. Do your operations on a "personal" table. To begin with, practice on a subsample...

_df = df[df['person']=='abby']

ref = _df['product'].shift().map(thresh).fillna(thresh['other'])
ref

42   0 days 00:02:00
12   0 days 00:10:00
10   0 days 00:10:00

m1 = _df.loc[_df['product']=="fruit", 'time_bought'].diff().gt(ref)
m1.sum()

1

Is it all right?

So the concise solution is:

def count_purchases(_df, product='other'):
    ref = _df['product'].shift().map(thresh).fillna(thresh['other'])
    m1 = _df.loc[_df['product']==product, 'time_bought'].diff().gt(ref)
    return m1.sum()

df.groupby('person').apply(count_purchases, 'fruit')

person
abby     1
barry    0
dtype: int64

Collectives™ on Stack Overflow

Pandas index is sorting on its own

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related