1

I have a 20,000 x 20,000 Numpy matrix that I wish to store by file, where the average volumn only has 12 values in it.

What would be the most efficient way to store only the values in the format of

if array[i][j] == 1:
   file.write("{} {} {{}}\n".format(i, j)

where (i, j) are the indices for the array?

6
  • 1
    Does the sparse matrix implementation you're using have its own serialization code (e.g. for use with pickle)? That might be easier to learn and use than learning enough of its implementation to write your own. Commented Aug 19, 2020 at 19:34
  • To be clear: you're willing to sacrifice memory for performance (hence loading the values into a normal Numpy array), but wish to conserve disk space? Commented Aug 19, 2020 at 19:41
  • @Blckknght Right now it's just a numpy array, so I actually don't know, soprry! Commented Aug 19, 2020 at 19:46
  • @KarlKnechtel Exactly! I can sacrifice as much memory as necessary to get maximum performance here. It takes only 1 second to generate the array, but a full minute to store it. Commented Aug 19, 2020 at 19:47
  • 1
    np.nonzero gives the indices of nonzero elements. Commented Aug 19, 2020 at 20:07

2 Answers 2

5

You can use scipy to create sparse matrices from dense numpy arrays that only store values with nonzero entries against their indices.

import scipy
import pickle

I = np.eye(10000)  #Had 10000 nonzero values along diagonal
S = scipy.sparse.csr_matrix(I)
S
<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
    with 10000 stored elements in Compressed Sparse Row format>

This is highly memory efficient and you can use pickle to dump / load this sparse matrix when you need it.

#Pickle dump
file = open("S.pickle",'wb') #160kb
pickle.dump(S, file)

#Pickle load
file = open("S.pickle",'rb') 
S = pickle.load(file)

To get back a dense representation you can simply use .toarray() to get back a NumPy array or .todense() to get back a matrix type object.

S.toarray()
array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])
Sign up to request clarification or add additional context in comments.

Comments

2

For those reading after the fact: @hpaulj's comment of using np.nonzero effectively solves the problem!

Edit: Here is the code I used to solve it!

array1, array2 = np.nonzero(array)
    for i in range(0, array1.size):
        file.write("{} {} {{}}\n".format(array1[i], array2[i]))

1 Comment

@Gad this is the author himself posting an answer based on one of the comments.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.