0

I am trying to sort each row of pandas dataframe and get the index of sorted values in a new dataframe. I could do it in a slow way. Can anyone suggest improvements using parallelization or vectorized code for this. I have posted an example below.

data_url = 'https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'

# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)

# drop categorical column
gapminder.drop(['country', 'continent'], axis=1, inplace=True) 

# print the first three rows
print(gapminder.head(n=3))

   year         pop  lifeExp   gdpPercap
0  1952   8425333.0   28.801  779.445314
1  1957   9240934.0   30.332  820.853030
2  1962  10267083.0   31.997  853.100710

The result I am looking for is this

tag_0   tag_1   tag_2   tag_3
0   pop year    gdpPercap   lifeExp
1   pop year    gdpPercap   lifeExp
2   pop year    gdpPercap   lifeExp

In this case, since pop is always higher than gdpPercap and lifeExp, it always comes first.

I could achieve the required output by using the following code. But the computation takes longer time if the df has lot of rows/columns.

Can anyone suggest an improvement over this

def sort_df(df):
    sorted_tags = pd.DataFrame(index = df.index, columns = ['tag_{}'.format(i) for i in range(df.shape[1])])
    for i in range(df.shape[0]):
        sorted_tags.iloc[i,:] = list( df.iloc[i, :].sort_values(ascending=False).index)
    return sorted_tags

sort_df(gapminder)

1 Answer 1

2

This is probably as fast as it gets with numpy:

def sort_df(df):
    return pd.DataFrame(
        data=df.columns.values[np.argsort(-df.values, axis=1)],
        columns=['tag_{}'.format(i) for i in range(df.shape[1])]
    )

print(sort_df(gapminder.head(3)))

  tag_0 tag_1      tag_2    tag_3
0   pop  year  gdpPercap  lifeExp
1   pop  year  gdpPercap  lifeExp
2   pop  year  gdpPercap  lifeExp

Explanation: np.argsort sorts the values along rows, but returns the indices that sort the array instead of sorted values, which can be used for co-sorting arrays. The minus sorts in descending order. In your case, you use the indices to sort the columns. numpy broadcasting takes care of returning the correct shape.

Runtime is around 3ms for your example vs 2.5s with your function.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot. argsort helped to reduce the time because of broadcasting. I didn't know before that a single array (df.columns) can be produced multiple times if I give an 2D index np.argsort(-df.values, axis=1)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.