3

I have a matrix obtained with pandas.dataframe in this way:

tfidf = TfidfVectorizer()
x = tfidf.fit_transform(corpus)
df_tfidf = pd.DataFrame(x.toarray(),columns=tfidf.get_feature_names())

It seems like the matrix below:

enter image description here

My matrix has more columns and more rows. It has 7180 rows and 10390 columns. Is there a way to print the header of col and its value is this value is greater than 0 ? somethink like this and: 0.511859, document: 0.46,0.68 ..

I tried in this way but it take a lot of time:

for col in df_tfidf.columns:
   for row in df_tfidf.index:
     if df_tfidf[col][row] > 0:
        print str(df_tfidf[col][row]) + ' ' + col.encode('utf8')

Is there a way to do this faster ?

4
  • What's your expected output? Commented Sep 6, 2020 at 16:17
  • i want to iterate the matrix to obtain just the word and its tfidf value. But it is to big Commented Sep 6, 2020 at 17:02
  • By tfidf values you mean positive values for each column, right? And do you need to store this values maybe in some dictionary or just wants to print them? Commented Sep 6, 2020 at 17:04
  • 1
    yes they are all positive value. I want to save it in a sort of dictionary word: value. I'm tring to work on the answer given by chris Commented Sep 6, 2020 at 17:07

2 Answers 2

1

You can use boolean masking with numpy array to filter positive values inside a dict comprehension:

r = {c: s[s > 0] for c, s in zip(df, df.T.to_numpy())}

EDIT: DataFrame.to_numpy() is available in pandas version >= 0.24, if you are using pandas version below 0.24 then use:

r = {c: s[s > 0] for c, s in zip(df, df.T.values)}

Example:

# Sample dataframe
       col0      col1      col2
0  0.392938 -0.427721 -0.546297
1  0.102630  0.438938 -0.153787
2  0.961528  0.369659 -0.038136
3 -0.215765 -0.313644  0.458099
4 -0.122856 -0.880644 -0.203911

# Result
{'col0': array([0.39293837, 0.10262954, 0.9615284 ]),
 'col1': array([0.43893794, 0.36965948]),
 'col2': array([0.45809941])}
Sign up to request clarification or add additional context in comments.

6 Comments

I tried with you method but I get DataFrame' object has no attribute 'to_numpy'
@Lx2pwn Whats your pandas version? If lower than 0.24 then you can replace to_numpy() with values..
it is 0.22.0 on python 2.7
to_numpy() is introduced in pandas version 0.24..Try with .values
it works perfectly with r = {c: s[s > 0] for c, s in zip(df, df.T.values)} in 13 second with a dataframe matrix 7180 x 10390.
|
1
data = [[0.85, 0.0], [0.2, 0.7], [0.0, 14]] 
df = pd.DataFrame(data, columns = ['and', 'document']) 
output = df.apply(lambda x: list(x.dropna())).to_dict()

for k,v in output.items():
    print(f'{k}: {v}')

Output

and: [0.85, 0.2]
document: [0.7, 14.0]

1 Comment

your method works. I need to modify a bit the loop in this way for word, dict in output.items(): for (dict, values) in dict.items(): if values > 0: print str(values) + ' ' +str(word.encode('utf8')) It works, but it take a lot of time. 8 minutes

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.