Convert list from dataframe into numpy arrays

Question

I am accessing some vectors stored as arrays in Google BigQuery, using the python client:

df = client.query(sql).to_dataframe()

The resulting dataframe has a single column which looks like the following:

    page_vector
0   [0.11585406959056854, 0.4495273232460022, -0.0...
1   [0.3589635491371155, 0.529633104801178, 0.3646...
2   [0.05760370194911957, 0.10355205088853836, 0.7...
3   [0.12493933737277985, 0.7082784175872803, 0.26...
4   [-0.660869300365448, -0.5055545568466187, -0.2...

Now I would like to do various calculations on these vector values such as calculation of the mean, cosine similarity etc..

My issue is that the values are stored as an array of lists ( I believe) and I am not able to convert them into regular numpy arrays.

df.values

array([[list([0.11585406959056854, 0.4495273232460022, -0.06741087883710861, 0.009115549735724926, 0.03358231857419014, 0.3813880980014801, 0.5367750525474548, 0.1125958263874054, -0.04873140528798103, -0.15494178235530853])],
       [list([0.3589635491371155, 0.529633104801178, 0.3646768629550934, -0.5236702561378479, -0.20803043246269226, -0.40205657482147217, 0.9097139835357666, 0.3311547636985779, -0.10366004705429077, -0.31620144844055176])],
       [list([0.05760370194911957, 0.10355205088853836, 0.7606179118156433, -0.40389031171798706, -0.4287498891353607, -0.5946164727210999, 1.470175862312317, 0.12346278876066208, -0.13954032957553864, -0.4611101448535919])],
       [list([0.12493933737277985, 0.7082784175872803, 0.26176416873931885, 0.04834984615445137, -0.1890079379081726, -0.2270711362361908, 0.8319875597953796, 0.39853358268737793, -0.11916585266590118, -0.5312120318412781])],
       [list([-0.660869300365448, -0.5055545568466187, -0.260611891746521, 0.6198488473892212, 0.07465806603431702, 0.6059150099754333, -0.548044741153717, 0.38490045070648193, -0.49995312094688416, 0.1975364089012146])]],
      dtype=object)

How should I manipulate the results from BigQuery into something I can then use for various calculations?

I have tried many avenues such as : df.apply(lambda x: np.asarray(x, dtype=float))

hpaulj · Accepted Answer · 2019-09-29 16:56:15Z

1

Check your df info or dtypes. That column is object dtype.

df.values produces a 2d array, in this case (n,1) shape, rows by columns.

df.values[:,0] should be a (n,) shape array. You could also select the column before using values. Series.values produces a 1d array (still object dtype).

np.stack(df.values[:,0]) should produce a 2d array, provided the lists are all the same size. This concatenates the lists row by row.

And do compare this with the tolist approach. Look at the resulting list of lists.

Look at pandas documentation you'll see that while Series has a tolist method, DataFrame does not.

In [60]: df1                                                                    
Out[60]: 
           1
0  [1, 2, 3]
1  [2, 3, 4]
2  [3, 4, 5]

In [62]: df1.values                                                             
Out[62]: 
array([[list([1, 2, 3])],
       [list([2, 3, 4])],
       [list([3, 4, 5])]], dtype=object)

In [63]: df1.values.shape                                                       
Out[63]: (3, 1)

In [64]: df1.values[:,0]                                                        
Out[64]: array([list([1, 2, 3]), list([2, 3, 4]), list([3, 4, 5])], dtype=object)
In [65]: np.stack(df1.values[:,0])                                              
Out[65]: 
array([[1, 2, 3],
       [2, 3, 4],
       [3, 4, 5]])

tolist doesn't work for the dataframe, just for a Series:

In [69]: df1.tolist()                                                           
AttributeError: 'DataFrame' object has no attribute 'tolist'

In [70]: df1[1].tolist()                                                        
Out[70]: [[1, 2, 3], [2, 3, 4], [3, 4, 5]]

values from the Series:

In [72]: df1[1].values                                                          
Out[72]: array([list([1, 2, 3]), list([2, 3, 4]), list([3, 4, 5])], dtype=object)

edited Sep 29, 2019 at 16:56

answered Sep 29, 2019 at 16:25

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

dendog Over a year ago

Thanks @hpaulj would you recommend converting them in situ or into a new col in the df?

hpaulj Over a year ago

I don't follow. That column is object dtype with one list per row. You could replace each list with an equivalent 1d array. But that still doesn't give a 2d array. You could make a new dataframe from the 2d array, one df column per column of the array.

dendog Over a year ago

I would like to operate within the df if possible, as there are other columns which describe the data.

BENY · Accepted Answer · 2019-09-29 16:30:26Z

0

We can convert to list first then make it to numpy array

np.array(df.page_vector.tolist())

edited Sep 29, 2019 at 16:30

answered Sep 29, 2019 at 15:42

BENY

324k22 gold badges176 silver badges250 bronze badges

1 Comment

dendog Over a year ago

This is great thanks, could you add a little more explanation for other who are reading this? Also is it possible to do this on the df itself?

Collectives™ on Stack Overflow

Convert list from dataframe into numpy arrays

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related