print header-value from dataframe matrix in python

Question

I have a matrix obtained with pandas.dataframe in this way:

tfidf = TfidfVectorizer()
x = tfidf.fit_transform(corpus)
df_tfidf = pd.DataFrame(x.toarray(),columns=tfidf.get_feature_names())

It seems like the matrix below:

My matrix has more columns and more rows. It has 7180 rows and 10390 columns. Is there a way to print the header of col and its value is this value is greater than 0 ? somethink like this and: 0.511859, document: 0.46,0.68 ..

I tried in this way but it take a lot of time:

for col in df_tfidf.columns:
   for row in df_tfidf.index:
     if df_tfidf[col][row] > 0:
        print str(df_tfidf[col][row]) + ' ' + col.encode('utf8')

Is there a way to do this faster ?

i want to iterate the matrix to obtain just the word and its tfidf value. But it is to big — Lx2pwn
– Lx2pwn, Commented Sep 6, 2020 at 17:02
By tfidf values you mean positive values for each column, right? And do you need to store this values maybe in some dictionary or just wants to print them? — Shubham Sharma
– Shubham Sharma, Commented Sep 6, 2020 at 17:04
yes they are all positive value. I want to save it in a sort of dictionary word: value. I'm tring to work on the answer given by chris — Lx2pwn
– Lx2pwn, Commented Sep 6, 2020 at 17:07

Shubham Sharma · Accepted Answer · 2020-09-06 17:59:24Z

1

You can use boolean masking with numpy array to filter positive values inside a dict comprehension:

r = {c: s[s > 0] for c, s in zip(df, df.T.to_numpy())}

EDIT: DataFrame.to_numpy() is available in pandas version >= 0.24, if you are using pandas version below 0.24 then use:

r = {c: s[s > 0] for c, s in zip(df, df.T.values)}

Example:

# Sample dataframe
       col0      col1      col2
0  0.392938 -0.427721 -0.546297
1  0.102630  0.438938 -0.153787
2  0.961528  0.369659 -0.038136
3 -0.215765 -0.313644  0.458099
4 -0.122856 -0.880644 -0.203911

# Result
{'col0': array([0.39293837, 0.10262954, 0.9615284 ]),
 'col1': array([0.43893794, 0.36965948]),
 'col2': array([0.45809941])}

edited Sep 6, 2020 at 17:59

answered Sep 6, 2020 at 17:10

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Lx2pwn Over a year ago

I tried with you method but I get DataFrame' object has no attribute 'to_numpy'

Shubham Sharma Over a year ago

@Lx2pwn Whats your pandas version? If lower than 0.24 then you can replace to_numpy() with values..

Lx2pwn Over a year ago

it is 0.22.0 on python 2.7

Shubham Sharma Over a year ago

to_numpy() is introduced in pandas version 0.24..Try with .values

Lx2pwn Over a year ago

it works perfectly with r = {c: s[s > 0] for c, s in zip(df, df.T.values)} in 13 second with a dataframe matrix 7180 x 10390.

|

Chris · Accepted Answer · 2020-09-06 15:46:22Z

1

data = [[0.85, 0.0], [0.2, 0.7], [0.0, 14]] 
df = pd.DataFrame(data, columns = ['and', 'document']) 
output = df.apply(lambda x: list(x.dropna())).to_dict()

for k,v in output.items():
    print(f'{k}: {v}')

Output

and: [0.85, 0.2]
document: [0.7, 14.0]

answered Sep 6, 2020 at 15:46

Chris

16.3k3 gold badges26 silver badges41 bronze badges

1 Comment

Lx2pwn Over a year ago

your method works. I need to modify a bit the loop in this way

for word, dict in output.items(): for (dict, values) in dict.items(): if values > 0: print str(values) + ' ' +str(word.encode('utf8'))

It works, but it take a lot of time. 8 minutes

Collectives™ on Stack Overflow

print header-value from dataframe matrix in python

2 Answers 2

6 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related