1

I have several different 'columns' I need to save to a CSV. Currently I do this:

f = open(out_csv, 'w', newline='') 
w = csv.writer(f, delimiter=",", )
w.writerow(['id_a', 'id_b',
            'lat_a','lon_a',
            'lat_b','lon_b',
            'proj_metres'])
w.writerows(np.column_stack((
            id_labels[udist.row],
            id_labels[udist.col],
            points[udist.row],
            points[udist.col],
            udist.data)))

Perhaps not important but for completeness:

tree_dist = tree.sparse_distance_matrix(tree)
udist = sparse.tril(tree_dist, k=-1)

The dimensions are around 30 million by 7 columns (two of which are strings: id_labels) - so this takes a while (around 8 minutes) and uses a lot of RAM as I think python creates a new temporary object when I call np.column_stack so at a one point in time it holds double the data it needs.

I was hoping was there was a better way to create the CSV I need?

3 Answers 3

2

you can open a file with append option and then use np.savetxt

import numpy as np
array1 = np.arange(12).reshape((2, 6))
array2 = np.ones(18).reshape((3, 6))
with open('outputfile.csv', 'ab') as f:
    np.savetxt(f, array1, delimiter=',')
    np.savetxt(f, array2, delimiter=',')
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks, this looks great. I'm having a small problem with savetxt: TypeError: Mismatch between array dtype ('float64') and format specifier ('%f,%f'). I import my array as dtype=(float, float), however can't save it as such
i can't understand your input array. it has 2 columns? can you write down it?
Sure if I do points[:3] I get: [[ 5.15317040e+01 -3.31830000e-02] [ 5.10514740e+01 -4.04532300e+00] [ 5.38018130e+01 -1.77162300e+00]]
try with fmt='%f %f' instead of fmt=('%f', '%f')
Unfortunately no difference, if it helps this is how I import points = np.genfromtxt(path_to_csv, delimiter=',', skip_header=1, usecols=(0,1), dtype=(float, float))
|
1

first, save file one by one to avoid memory problems.

let's consider 3 solutions:

a=np.rand(10000,7)

import csv
def testfile():
    with open('test.csv','w') as f :
        w=csv.writer(f)
        w.writerows(a)

def testsavetxt():
    np.savetxt('test.csv',a)


def testpickle():
    with open('test.pickle','wb') as f:
        pickle.dump(a,f)    

Some tests:

In [43]: %timeit testfile()
1 loops, best of 3: 576 ms per loop

In [44]: %timeit testsavetxt()
1 loops, best of 3: 442 ms per loop

In [45]: %timeit testpickle()
100 loops, best of 3: 12.3 ms per loop

so savetext is slighty faster.

If csv is not requisite, pickle offer a binary protocol, wich is 40x faster.

3 Comments

I like the np.savetxt option - what would you suggest I do to handle several arrays - use the same np.column_stack(.. ? I'm having a bit of bother using 'ab' mode
what you do is not very readable, and you will have to do the same when loading. perhaps pandas can help you to collect your data ?
That would mean converting data to pandas, however. I think I will go with: np.savetxt(out_csv,np.column_stack(( id_labels[udist.row], id_labels[udist.col], points[udist.row], points[udist.col], udist.data)) Thanks
0

Not necessarily fast, but:

import numpy as np
arr1 = np.array([1,2,3,4])
arr2 = np.array([11,12,13,14])
arr3 = np.array([21,22,23,24])
numpy_arrays = [arr1, arr2, arr3]

with open(out_csv, "w") as f:
    for values in zip(*numpy_arrays): # or just zip(arr1, arr2, arr3)
        for value in values:
            f.write(str(value) + ",")
        f.write("\n")

This won't use up much more memory than just the memory needed for your separate arrays.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.