1

I am accessing some vectors stored as arrays in Google BigQuery, using the python client:

df = client.query(sql).to_dataframe()

The resulting dataframe has a single column which looks like the following:

    page_vector
0   [0.11585406959056854, 0.4495273232460022, -0.0...
1   [0.3589635491371155, 0.529633104801178, 0.3646...
2   [0.05760370194911957, 0.10355205088853836, 0.7...
3   [0.12493933737277985, 0.7082784175872803, 0.26...
4   [-0.660869300365448, -0.5055545568466187, -0.2...

Now I would like to do various calculations on these vector values such as calculation of the mean, cosine similarity etc..

My issue is that the values are stored as an array of lists ( I believe) and I am not able to convert them into regular numpy arrays.

df.values

array([[list([0.11585406959056854, 0.4495273232460022, -0.06741087883710861, 0.009115549735724926, 0.03358231857419014, 0.3813880980014801, 0.5367750525474548, 0.1125958263874054, -0.04873140528798103, -0.15494178235530853])],
       [list([0.3589635491371155, 0.529633104801178, 0.3646768629550934, -0.5236702561378479, -0.20803043246269226, -0.40205657482147217, 0.9097139835357666, 0.3311547636985779, -0.10366004705429077, -0.31620144844055176])],
       [list([0.05760370194911957, 0.10355205088853836, 0.7606179118156433, -0.40389031171798706, -0.4287498891353607, -0.5946164727210999, 1.470175862312317, 0.12346278876066208, -0.13954032957553864, -0.4611101448535919])],
       [list([0.12493933737277985, 0.7082784175872803, 0.26176416873931885, 0.04834984615445137, -0.1890079379081726, -0.2270711362361908, 0.8319875597953796, 0.39853358268737793, -0.11916585266590118, -0.5312120318412781])],
       [list([-0.660869300365448, -0.5055545568466187, -0.260611891746521, 0.6198488473892212, 0.07465806603431702, 0.6059150099754333, -0.548044741153717, 0.38490045070648193, -0.49995312094688416, 0.1975364089012146])]],
      dtype=object)

How should I manipulate the results from BigQuery into something I can then use for various calculations?

I have tried many avenues such as : df.apply(lambda x: np.asarray(x, dtype=float))

2 Answers 2

1

Check your df info or dtypes. That column is object dtype.

df.values produces a 2d array, in this case (n,1) shape, rows by columns.

df.values[:,0] should be a (n,) shape array. You could also select the column before using values. Series.values produces a 1d array (still object dtype).

np.stack(df.values[:,0]) should produce a 2d array, provided the lists are all the same size. This concatenates the lists row by row.

And do compare this with the tolist approach. Look at the resulting list of lists.

Look at pandas documentation you'll see that while Series has a tolist method, DataFrame does not.

In [60]: df1                                                                    
Out[60]: 
           1
0  [1, 2, 3]
1  [2, 3, 4]
2  [3, 4, 5]

In [62]: df1.values                                                             
Out[62]: 
array([[list([1, 2, 3])],
       [list([2, 3, 4])],
       [list([3, 4, 5])]], dtype=object)

In [63]: df1.values.shape                                                       
Out[63]: (3, 1)

In [64]: df1.values[:,0]                                                        
Out[64]: array([list([1, 2, 3]), list([2, 3, 4]), list([3, 4, 5])], dtype=object)
In [65]: np.stack(df1.values[:,0])                                              
Out[65]: 
array([[1, 2, 3],
       [2, 3, 4],
       [3, 4, 5]])

tolist doesn't work for the dataframe, just for a Series:

In [69]: df1.tolist()                                                           
AttributeError: 'DataFrame' object has no attribute 'tolist'

In [70]: df1[1].tolist()                                                        
Out[70]: [[1, 2, 3], [2, 3, 4], [3, 4, 5]]

values from the Series:

In [72]: df1[1].values                                                          
Out[72]: array([list([1, 2, 3]), list([2, 3, 4]), list([3, 4, 5])], dtype=object)
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks @hpaulj would you recommend converting them in situ or into a new col in the df?
I don't follow. That column is object dtype with one list per row. You could replace each list with an equivalent 1d array. But that still doesn't give a 2d array. You could make a new dataframe from the 2d array, one df column per column of the array.
I would like to operate within the df if possible, as there are other columns which describe the data.
0

We can convert to list first then make it to numpy array

np.array(df.page_vector.tolist())

1 Comment

This is great thanks, could you add a little more explanation for other who are reading this? Also is it possible to do this on the df itself?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.