I am trying to sort each row of pandas dataframe and get the index of sorted values in a new dataframe. I could do it in a slow way. Can anyone suggest improvements using parallelization or vectorized code for this. I have posted an example below.
data_url = 'https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)
# drop categorical column
gapminder.drop(['country', 'continent'], axis=1, inplace=True)
# print the first three rows
print(gapminder.head(n=3))
year pop lifeExp gdpPercap
0 1952 8425333.0 28.801 779.445314
1 1957 9240934.0 30.332 820.853030
2 1962 10267083.0 31.997 853.100710
The result I am looking for is this
tag_0 tag_1 tag_2 tag_3
0 pop year gdpPercap lifeExp
1 pop year gdpPercap lifeExp
2 pop year gdpPercap lifeExp
In this case, since pop is always higher than gdpPercap and lifeExp, it always comes first.
I could achieve the required output by using the following code. But the computation takes longer time if the df has lot of rows/columns.
Can anyone suggest an improvement over this
def sort_df(df):
sorted_tags = pd.DataFrame(index = df.index, columns = ['tag_{}'.format(i) for i in range(df.shape[1])])
for i in range(df.shape[0]):
sorted_tags.iloc[i,:] = list( df.iloc[i, :].sort_values(ascending=False).index)
return sorted_tags
sort_df(gapminder)