Trouble converting matrix stored as string in excel to numpy array when using pandas dataframe

Question

I am having a hard time reading in an excel file with pandas DataFrame and converting a stored matrix to a numpy array. I think part of the issue is that the matrix is improperly stored. I have no control over the spreadsheet however, this is how it was sent to me.

For instance this is the string stored in a cell

[[[ 0.        0.        0.107851]
  [ 0.        0.       -0.862809]]]

I read in the row with DataFrame, and save each cell to a variable. I then try to convert this particular variable to a np.array since those number represent two sets of x, y, z coordinates.

I have tried np.fromstring and np.asarray to no avail. It will convert the string to a numpy array but it will be a terrible mess with the brackets still inside as characters. I have tried using np.squeeze to get rid of the brackets but it says dimension is not 1.

if I use np.asarray(item._coord, dtype=float) then it fails saying it cannot convert the string to float.

ValueError: could not convert string to float: '[[[ 0. 0. 0.107851] [ 0. 0. -0.862809]]]'

There is a '\n' that shows up in the middle of it, between the two lists. I use df = df.replace(r'\n', ' ',regex=True)' to clean out the\n`'s prior to data conversion attempts.

I am stuck

jezrael · Accepted Answer · 2020-05-20 06:23:22Z

4

Use custom function for convert to numpy array after read_excel:

a= np.array([[[ 0.,        0.,        0.107851],
              [ 0.,        0.,       -0.862809]]])
print (a)
[[[ 0.        0.        0.107851]
  [ 0.        0.       -0.862809]]]

df = pd.DataFrame({'col':[a,a,a]})
print (df)
                                               col
0  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]
1  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]
2  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]

df.to_excel('test.xlsx', index=False)

import re
import ast
import numpy as np

#https://stackoverflow.com/a/44323021
def str2array(s):
    # Remove space after [
    s=re.sub('\[ +', '[', s.strip())
    # Replace commas and spaces
    s=re.sub('[,\s]+', ', ', s)
    return np.array(ast.literal_eval(s))

df = pd.read_excel('test.xlsx')

df['col'] = df['col'].apply(str2array)
print (df)
                                               col
0  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]
1  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]
2  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]

edited May 20, 2020 at 6:23

answered May 20, 2020 at 6:12

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Charlie Crown Over a year ago

Thanks, just trying to get this working. In making a MWE I had to leave out some things. I think this is going to work though. It will be a few more minutes before my incompetent self gets to a conclusion.

Charlie Crown Over a year ago

Awwwww Yeahhhhh, it runs like a well oiled machine!

Charlie Crown Over a year ago

Final comment - I used the very last option df['col'] = df['col'].apply(str2array)

jezrael Over a year ago

@CharlieCrown - Mea culpa, I later realised it is not csv, but excel, so converter cannot be used.

Collectives™ on Stack Overflow

Trouble converting matrix stored as string in excel to numpy array when using pandas dataframe

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related