6

I have a pandas dataframe with a column of vectors that I would like to perform matrix arithmetic on. However, upon closer inspection the vectors are all wrapped as strings with new line characters seemingly embedded in them:

enter image description here

How do I convert each vector in this column into numpy arrays? I've tried

df['Word Vector'].as_matrix

and

np.array(df['Word Vector'])

as well as

df['Word Vector'] = df['Word Vector'].astype(np.array)

but none produced the desired result. Any pointers would be appreciated!

2
  • profide an example of your data that we can experiment with. Commented Aug 16, 2017 at 8:38
  • @MedAli what would be the best way to do so? I wasn't sure of the process was that generated this format, how can I upload a sample of the dataframe to stackoverflow? Commented Aug 16, 2017 at 17:26

3 Answers 3

15

Hope the following works as what you expected

import pandas as pd
import numpy as np

x = str(np.arange(1,100))
df = pd.DataFrame([x,x,x,x])
df.columns = ['words']
print 'sample'
print df.head()
result = df['words'].apply(lambda x: 
                           np.fromstring(
                               x.replace('\n','')
                                .replace('[','')
                                .replace(']','')
                                .replace('  ',' '), sep=' '))
print 'result'
print result

output as following

    sample
                                               words
0  [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 ...
1  [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 ...
2  [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 ...
3  [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 ...
result
0    [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...
1    [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...
2    [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...
3    [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...

It is not elegant to call replace function so many times. However I did not find better approach. Anyway it should help you to convert string to vectors.

A side note, as data is presented in picture, You'd better check whether your data separation is done by space or tab. If it is tab, change sep=' ' to sep='\t'

Sign up to request clarification or add additional context in comments.

Comments

2

This worked for me for string lists in a Pandas column:

df['Numpy Word Vector'] = df['Word Vector'].apply(eval).apply(np.array)

Comments

0

The solution below is shorter:

df[col_name] = df[col_name].apply(lambda x: np.array(eval(x)), 0)

Example:

df = pd.DataFrame(['[0., 1., 2., 3.]', '[1., 2., 3., 4.]'], columns=['Word Vector'])
df['Word Vector'][0] # '[0., 1., 2., 3.]'

df['Word Vector'] = df['Word Vector'].apply(lambda x: np.array(eval(x)), 0)
df['Word Vector'][0] # array([0., 1., 2., 3.])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.