1

I created a data frame that consists of a Country, deal_category, and some_metric.

It looks like

    Country     metric_count    channel
0   Country1    123472          c1
1   Country1    159392          c2
2   Country2    14599           c3
3   Country2    17382           c4

I indexed according to Country and channel using the command

df2 = df.set_index(["Country", "channel"])

This creates the following dataframe.

            metric_count
Country     channel     
Country1    category1   12347
            category2   159392
            category3   14599
            category4   17382

Country2    category1   1234

Here's what I want to do. I'd like to keep this structure the same and sort according to the metric counts. In other words, I'd like to display for each country, the top 3 channels based on the metric count.

For instance, I'd like a dataframe to display for each country, the top 3 categories ordered by descending metric_counts.

Country2    top category1   12355555
            top category2   159393
            top category3   16759

I've tried sorting first, then indexing, but the resulting data frame no longer partitions based on country. Any tips would be greatly appreciated. Thanks!

2 Answers 2

1

After some taxing experimentation, I was able to get exactly what I wanted. I outline my steps below

  1. Groupby Country

    group = df.groupby("Country")
    

    At a high-level, this indicates that we would like to look at each country differently. Now our goal is to determine the top 3 metric counts and report the corresponding channel. To do this, we will apply a sort to the resulting data-frame and then only return the top 3 results. We can do this by defining a sort function that returns only the top 3 results and use the apply function in pandas. This indicates to panda that "I want to apply this sort function to each of our groups and return the top 3 results for each group".

  2. Sort and return top 3

    sort_function = lambda x: x.sort("metric_count", ascending = False)[:3]
    desired_df = group.apply(sort_function)
    
Sign up to request clarification or add additional context in comments.

Comments

0

Use groupby/apply to sort each group individually, and pick off just the top three rows:

def top_three(grp):
    grp.sort(ascending=False)
    return grp[:3]
df = df.set_index(['channel'])
result = df.groupby('Country', group_keys=False).apply(top_three)

For example,

import numpy as np
import pandas as pd
np.random.seed(2015)
N = 100
df = pd.DataFrame({
    'Country': np.random.choice(['Country{}'.format(i) for i in range(3)], size=N),
    'channel': np.random.choice(['channel{}'.format(i) for i in range(4)], size=N),
    'metric_count': np.random.randint(100, size=N)
})

def top_three(grp):
    grp.sort(ascending=False)
    return grp[:3]

df = df.set_index(['channel'])
result = df.groupby('Country', group_keys=False).apply(top_three)
result = result.set_index(['Country'], append=True)
result = result.reorder_levels(['Country', 'channel'], axis=0)
print(result)

yields

                   metric_count
Country  channel               
Country0 channel3            93
         channel3             0
         channel1             5
Country1 channel0            46
         channel2            86
         channel2            41
Country2 channel0             4
         channel0            51
         channel3            36

1 Comment

Thank you for the help. I didn't get the exact right answer with your approach, but it provided the necessary insight for me to make some tweaks and ultimately get the right answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.