I have a numpy array, with (8000000, 7) shape.
I want to keep the first 6 columns of the numpy array as float32 data type, and last column as int8 type.
And at the end, I want to save it as a csv file.
How can I manage this?
You could construct a structured array, but I wonder if you need to, especially if all you want is a csv file. The fmt parameter controls how savetxt writes the columns.
First with the default fmt and column_stack:
In [1484]: a=np.random.rand(5,3)
In [1485]: b=np.arange(5,dtype=np.int8)
In [1486]: np.savetxt('test.txt',np.column_stack((a,b)))
In [1487]: cat test.txt
3.513972543477327237e-01 8.468274950931957701e-01 6.587019305719005180e-01 0.000000000000000000e+00
...
With a simpler float format:
In [1492]: np.savetxt('test.txt',np.column_stack((a,b)),fmt='%f')
In [1493]: cat test.txt
0.351397 0.846827 0.658702 0.000000
0.566257 0.419570 0.183939 1.000000
0.276351 0.341277 0.706639 2.000000
0.515183 0.296801 0.321054 3.000000
0.305349 0.407097 0.328825 4.000000
Or by specifying format for each column:
In [1496]: np.savetxt('test.txt',np.column_stack((a,b)),fmt=['%f']*3+['%d'])
In [1497]: cat test.txt
0.351397 0.846827 0.658702 0
0.566257 0.419570 0.183939 1
0.276351 0.341277 0.706639 2
0.515183 0.296801 0.321054 3
0.305349 0.407097 0.328825 4
==============================
A nice way of constructing a structured array with data like this is to define 2 fields, and make the first an array:
In [1503]: dt=np.dtype('(3)f,i8')
In [1504]: A=np.empty((5,),dtype=dt)
In [1505]: A['f0']=a
In [1506]: A['f1']=b
In [1507]: A
Out[1507]:
array([([0.35139724612236023, 0.846827507019043, 0.6587019562721252], 0),
([0.566256582736969, 0.41956955194473267, 0.18393920361995697], 1),
([0.27635079622268677, 0.3412773013114929, 0.706638514995575], 2),
([0.5151825547218323, 0.29680076241493225, 0.32105395197868347], 3),
([0.30534881353378296, 0.4070965051651001, 0.3288247585296631], 4)],
dtype=[('f0', '<f4', (3,)), ('f1', '<i8')])
Unfortunately savetxt can't handle that kind of 'nested' dtype. The best I can do is format the first field as a string, with []
In [1509]: np.savetxt('test.txt',A,fmt=['%s','%d'])
In [1511]: cat test.txt
[ 0.35139725 0.84682751 0.65870196] 0
[ 0.56625658 0.41956955 0.1839392 ] 1
[ 0.2763508 0.3412773 0.70663851] 2
[ 0.51518255 0.29680076 0.32105395] 3
[ 0.30534881 0.40709651 0.32882476] 4
Instead I need to make a flat dtype; with the same bytes layout I can apply it with a view (or construct the array from scratch)
In [1512]: dt1=np.dtype('f,f,f,i8')
In [1514]: A.view(dt1)
Out[1514]:
array([(0.35139724612236023, 0.846827507019043, 0.6587019562721252, 0),
(0.566256582736969, 0.41956955194473267, 0.18393920361995697, 1),
(0.27635079622268677, 0.3412773013114929, 0.706638514995575, 2),
(0.5151825547218323, 0.29680076241493225, 0.32105395197868347, 3),
(0.30534881353378296, 0.4070965051651001, 0.3288247585296631, 4)],
dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<i8')])
Now I can write it with the same fmt as before:
In [1515]: np.savetxt('test.txt',A.view(dt1),fmt=['%f']*3+['%d'])
In [1516]: cat test.txt
0.351397 0.846828 0.658702 0
0.566257 0.419570 0.183939 1
0.276351 0.341277 0.706639 2
0.515183 0.296801 0.321054 3
0.305349 0.407097 0.328825 4
If one or more of your columns was strings then you would need to use structured array. But as long as all the columns are numbers, you can get by with an all-float array, and control the print with the fmt.
I thought it would be relatively easy to break up the array into floats and ints and then use a combination of zip and np.savetxt to put it all back together in the csv. But Support zip input in savetxt in Python 3 suggests that way lies madness.
However, being stuck on the zip idea, I just moved the work to the standard csv module. Since numpy data needs to be converted to python types it may be a bit slower. But we're talking csv writing here so hopefully its just lost in the noise.
First, generate the test array
>>> import numpy as np
>>> array = np.arange(0., 18.*5, 5., dtype=np.float32).reshape((3,6))
>>> array
array([[ 0., 5., 10., 15., 20., 25.],
[ 30., 35., 40., 45., 50., 55.],
[ 60., 65., 70., 75., 80., 85.]], dtype=float32)
Split out the final column and recast as uint8
>>> floats, ints, _after = np.hsplit(array, (5,6))
>>> ints=ints.astype(np.uint8)
>>> floats
array([[ 0., 5., 10., 15., 20.],
[ 30., 35., 40., 45., 50.],
[ 60., 65., 70., 75., 80.]], dtype=float32)
>>> ints
array([[25],
[55],
[85]], dtype=uint8)
Use the python csv module to do the writes. You need to cast the zipped array rows to tuples and add them together to go from np.array to python data types.
>>> import csv
>>> writer = csv.writer(open('test.csv', 'w'))
>>> writer.writerows(tuple(f)+tuple(i) for f,i in zip(floats, ints))
>>> del writer
>>> print(open('test.csv').read())
0.0,5.0,10.0,15.0,20.0,25
30.0,35.0,40.0,45.0,50.0,55
60.0,65.0,70.0,75.0,80.0,85
Well you can construct the dtype then use zeros or empty to get an empty shell ready for data. Hopefully this will give you a few ideas
>>> import numpy as np
>>>
>>> flds = ["f{:0>{}}".format(i,2) for i in range(7)]
>>> dt = [(fld, 'float32') for fld in flds]
>>> dt.append(('i01', 'int8'))
>>> a = np.zeros((10,), dtype=dt)
>>> a
array([(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0)],
dtype=[('f00', '<f4'), ('f01', '<f4'), ('f02', '<f4'), ('f03', '<f4'), ('f04', '<f4'), ('f05', '<f4'), ('f06', '<f4'), ('i01', 'i1')])
>>>
Mess around with this example def
def num_45():
"""(num_45)...
"""
import numpy as np
flds = ["f{:0>{}}".format(i,2) for i in range(7)]
dt = [(fld, 'float32') for fld in flds]
dt.append(('i01', 'int8'))
a = np.zeros((10,), dtype=dt)
b = np.arange(10*8).reshape(10,8)
c = np.copy(a)
names = a.dtype.names
N = len(names)
for i in range(N):
c[names[i]] = b[:,i]
return a, b, c
Result
>>> a
array([(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0)],
dtype=[('f00', '<f4'), ('f01', '<f4'), ('f02', '<f4'), ('f03', '<f4'), ('f04', '<f4'), ('f05', '<f4'), ('f06', '<f4'), ('i01', 'i1')])
>>> b
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63],
[64, 65, 66, 67, 68, 69, 70, 71],
[72, 73, 74, 75, 76, 77, 78, 79]])
>>> c
array([(0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7),
(8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15),
(16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23),
(24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31),
(32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39),
(40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47),
(48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55),
(56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63),
(64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71),
(72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79)],
dtype=[('f00', '<f4'), ('f01', '<f4'), ('f02', '<f4'), ('f03', '<f4'), ('f04', '<f4'), ('f05', '<f4'), ('f06', '<f4'), ('i01', 'i1')])
Another example with a few lines of manual code to see the construction
n = ['It', 'is', 'easy']
dt = [(n[0], '<f8'), (n[1], '<i8'), (n[2], 'U5')]
d = np.zeros((10,), dtype=dt)
for i in range(len(n)):
d[n[i]] = b[:, i]
yields
>>> d.dtype.names
('It', 'is', 'easy')
>>> d.reshape(10,-1)
array([[(0.0, 1, '2')],
[(8.0, 9, '10')],
[(16.0, 17, '18')],
[(24.0, 25, '26')],
[(32.0, 33, '34')],
[(40.0, 41, '42')],
[(48.0, 49, '50')],
[(56.0, 57, '58')],
[(64.0, 65, '66')],
[(72.0, 73, '74')]],
dtype=[('It', '<f8'), ('is', '<i8'), ('easy', '<U5')])
unit8for calculations or do you just want to restrict the final csv to 0..255? If the latter, what happens to larger numbers? How do you convert from float to int?