Storing a Sparse Numpy Array

Question

I have a 20,000 x 20,000 Numpy matrix that I wish to store by file, where the average volumn only has 12 values in it.

What would be the most efficient way to store only the values in the format of

if array[i][j] == 1:
   file.write("{} {} {{}}\n".format(i, j)

where (i, j) are the indices for the array?

Does the sparse matrix implementation you're using have its own serialization code (e.g. for use with pickle)? That might be easier to learn and use than learning enough of its implementation to write your own. — Blckknght
– Blckknght, Commented Aug 19, 2020 at 19:34
To be clear: you're willing to sacrifice memory for performance (hence loading the values into a normal Numpy array), but wish to conserve disk space? — Karl Knechtel
– Karl Knechtel, Commented Aug 19, 2020 at 19:41
@Blckknght Right now it's just a numpy array, so I actually don't know, soprry! — TheAkashain
– TheAkashain, Commented Aug 19, 2020 at 19:46
@KarlKnechtel Exactly! I can sacrifice as much memory as necessary to get maximum performance here. It takes only 1 second to generate the array, but a full minute to store it. — TheAkashain
– TheAkashain, Commented Aug 19, 2020 at 19:47

Akshay Sehgal · Accepted Answer · 2020-08-20 23:39:39Z

You can use scipy to create sparse matrices from dense numpy arrays that only store values with nonzero entries against their indices.

import scipy
import pickle

I = np.eye(10000)  #Had 10000 nonzero values along diagonal
S = scipy.sparse.csr_matrix(I)
S

<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
    with 10000 stored elements in Compressed Sparse Row format>

This is highly memory efficient and you can use pickle to dump / load this sparse matrix when you need it.

#Pickle dump
file = open("S.pickle",'wb') #160kb
pickle.dump(S, file)

#Pickle load
file = open("S.pickle",'rb') 
S = pickle.load(file)

To get back a dense representation you can simply use .toarray() to get back a NumPy array or .todense() to get back a matrix type object.

S.toarray()

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

darthbith · Accepted Answer · 2023-10-06 21:30:13Z

2

For those reading after the fact: @hpaulj's comment of using np.nonzero effectively solves the problem!

Edit: Here is the code I used to solve it!

array1, array2 = np.nonzero(array)
    for i in range(0, array1.size):
        file.write("{} {} {{}}\n".format(array1[i], array2[i]))

edited Oct 6, 2023 at 21:30

darthbith

19.8k11 gold badges64 silver badges81 bronze badges

answered Aug 19, 2020 at 20:29

TheAkashain

491 silver badge6 bronze badges

1 Comment

Akshay Sehgal Over a year ago

@Gad this is the author himself posting an answer based on one of the comments.

Collectives™ on Stack Overflow

Storing a Sparse Numpy Array

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related