1

I have a (very large) table using pandas.DataFrame. It contains wordcounts from texts; the index is the wordlist:

             one.txt  third.txt  two.txt
a               1          1        0
i               0          0        1
is              1          1        1
no              0          0        1
not             0          1        0
really          1          0        0
sentence        1          1        1
short           2          0        0
think           0          0        1 

I want to sort the wordlist on the frequency of words in all texts. So I can easily create a Series which contains the frequency sum for each word (using the words as index). But how how can I sort on this list?

One easy way would be to add the list to the dataframe as column, sort on it and then delete it. For performance reasons I would like to avoid this.

Two other ways are described here, but the one duplicates the dataframe which is a problem because of its size, and the other creates a new index, but I need the information about the words further down the line.

1 Answer 1

2

You could compute the frequency and use the sort method to find the desired order of the index. Then use df.loc[order.index] to reorder the original DataFrame:

order = df.sum(axis=1).sort(inplace=False)
result = df.loc[order.index]

For example,

import pandas as pd

df = pd.DataFrame({
    'one.txt': [1, 0, 1, 0, 0, 1, 1, 2, 0],
    'third.txt': [1, 0, 1, 0, 1, 0, 1, 0, 0],
    'two.txt': [0, 1, 1, 1, 0, 0, 1, 0, 1]}, 
    index=['a', 'i', 'is', 'no', 'not', 'really', 'sentence', 'short', 'think'])

order = df.sum(axis=1).sort(inplace=False, ascending=False)
print(df.loc[order.index])

yields

          one.txt  third.txt  two.txt
sentence        1          1        1
is              1          1        1
short           2          0        0
a               1          1        0
think           0          0        1
really          1          0        0
not             0          1        0
no              0          0        1
i               0          0        1
Sign up to request clarification or add additional context in comments.

2 Comments

this solution does not work with the current version of pandas (0.16.2). I tested it with the same data with an earlier version, so I gather some recent change in pandas broke it. It will produce a key error.
@fotisj: Thanks for the warning. I've modified the answer to work with pandas 0.16.2.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.