0

I have a df sorted by person and time. The index is not duplicated, nor is it continuous from 0. I check the difference in time against a threshold depending on row above

   person   time_bought  product    
42 abby     2:21         fruit        
12 abby     2:55         fruit      
10 abby     10:35        other
3  barry    12:00        fruit      
...
thresh = {'fruit': pd.Timedelta('10min'), 'other': pd.Timedelta('2min')}
# map custom threshold based on previous row product
ref = df.groupby('person')['product'].shift().map(thresh)

I don't understand why m1 does not retain df's index: I get a sorted index from the lowest index value onwards.

# compare each delta to the custom threshold. 
m1 = df.loc[df.product=="fruit", 'time_bought'].groupby(df['person']).diff().gt(ref)

3 False
4 False

If I remove .gt(ref), I only see the filtered rows and original index is retained.

df.loc[df.product=="fruit", 'time_bought'].groupby(df['person']).diff()

42 NaT
12 0 Days 00:34:00
...

A sorted index messes up my next line:m1.cumsum()

2 Answers 2

1

Use:

df['time_bought'] = pd.to_timedelta('00:' + df['time_bought'])

Idea is not filter rows, but set NaT to unmatched rows:

print (df['time_bought'].where(df['product']=="fruit", None))
42   0 days 00:02:21
12   0 days 00:02:55
10               NaT
3    0 days 00:12:00
Name: time_bought, dtype: timedelta64[ns]

So original order is not changed:

m1 = (df['time_bought'].where(df['product']=="fruit", None)
                       .groupby(df['person']).diff().gt(ref))
print (m1)
42    False
12    False
10    False
3     False
dtype: bool
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, so was it confused because I am comparing a filtered database to an unfiltered database? Why does the index not change if I end at the .diff() bit. If I add m1 to the dataframe, df["m1"].. it also works.
@asd - I guess because different index of df.loc[df.product=="fruit", 'time_bought'] vs df.index.
0

Introductory remarks:

  1. Unreliable assumption about the previous row. Are you relying on time, despite the indexes in reverse order?

  2. df.product is not column as you hope, but a method of DataFrame object.

  3. I guess converting to Datetime for a time column:

    df['time_bought'] = pd.to_datetime(df['time_bought'], format='%H:%M')

In general you want to do per-person analysis, so do not repeat grouping in your code, but operate on a single group a time.

GroupBy iterator is just a sequence of subtables for an every person. Do your operations on a "personal" table. To begin with, practice on a subsample...

_df = df[df['person']=='abby']

ref = _df['product'].shift().map(thresh).fillna(thresh['other'])
ref
42   0 days 00:02:00
12   0 days 00:10:00
10   0 days 00:10:00
m1 = _df.loc[_df['product']=="fruit", 'time_bought'].diff().gt(ref)
m1.sum()

1

Is it all right?

So the concise solution is:

def count_purchases(_df, product='other'):
    ref = _df['product'].shift().map(thresh).fillna(thresh['other'])
    m1 = _df.loc[_df['product']==product, 'time_bought'].diff().gt(ref)
    return m1.sum()

df.groupby('person').apply(count_purchases, 'fruit')
person
abby     1
barry    0
dtype: int64

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.