assigning different data types for different columns in a numpy array

Question

I have a numpy array, with (8000000, 7) shape.

I want to keep the first 6 columns of the numpy array as float32 data type, and last column as int8 type.

And at the end, I want to save it as a csv file.

How can I manage this?

Do you need unit8 for calculations or do you just want to restrict the final csv to 0..255? If the latter, what happens to larger numbers? How do you convert from float to int? — tdelaney
– tdelaney, Commented Oct 25, 2016 at 20:19

hpaulj · Accepted Answer · 2016-10-25 20:13:11Z

You could construct a structured array, but I wonder if you need to, especially if all you want is a csv file. The fmt parameter controls how savetxt writes the columns.

First with the default fmt and column_stack:

In [1484]: a=np.random.rand(5,3)
In [1485]: b=np.arange(5,dtype=np.int8)

In [1486]: np.savetxt('test.txt',np.column_stack((a,b)))
In [1487]: cat test.txt
3.513972543477327237e-01 8.468274950931957701e-01 6.587019305719005180e-01 0.000000000000000000e+00
...

With a simpler float format:

In [1492]: np.savetxt('test.txt',np.column_stack((a,b)),fmt='%f')
In [1493]: cat test.txt
0.351397 0.846827 0.658702 0.000000
0.566257 0.419570 0.183939 1.000000
0.276351 0.341277 0.706639 2.000000
0.515183 0.296801 0.321054 3.000000
0.305349 0.407097 0.328825 4.000000

Or by specifying format for each column:

In [1496]: np.savetxt('test.txt',np.column_stack((a,b)),fmt=['%f']*3+['%d'])
In [1497]: cat test.txt
0.351397 0.846827 0.658702 0
0.566257 0.419570 0.183939 1
0.276351 0.341277 0.706639 2
0.515183 0.296801 0.321054 3
0.305349 0.407097 0.328825 4

==============================

A nice way of constructing a structured array with data like this is to define 2 fields, and make the first an array:

In [1503]: dt=np.dtype('(3)f,i8')
In [1504]: A=np.empty((5,),dtype=dt)
In [1505]: A['f0']=a
In [1506]: A['f1']=b
In [1507]: A
Out[1507]: 
array([([0.35139724612236023, 0.846827507019043, 0.6587019562721252], 0),
       ([0.566256582736969, 0.41956955194473267, 0.18393920361995697], 1),
       ([0.27635079622268677, 0.3412773013114929, 0.706638514995575], 2),
       ([0.5151825547218323, 0.29680076241493225, 0.32105395197868347], 3),
       ([0.30534881353378296, 0.4070965051651001, 0.3288247585296631], 4)], 
      dtype=[('f0', '<f4', (3,)), ('f1', '<i8')])

Unfortunately savetxt can't handle that kind of 'nested' dtype. The best I can do is format the first field as a string, with []

In [1509]: np.savetxt('test.txt',A,fmt=['%s','%d'])
In [1511]: cat test.txt
[ 0.35139725  0.84682751  0.65870196] 0
[ 0.56625658  0.41956955  0.1839392 ] 1
[ 0.2763508   0.3412773   0.70663851] 2
[ 0.51518255  0.29680076  0.32105395] 3
[ 0.30534881  0.40709651  0.32882476] 4

Instead I need to make a flat dtype; with the same bytes layout I can apply it with a view (or construct the array from scratch)

In [1512]: dt1=np.dtype('f,f,f,i8')
In [1514]: A.view(dt1)
Out[1514]: 
array([(0.35139724612236023, 0.846827507019043, 0.6587019562721252, 0),
       (0.566256582736969, 0.41956955194473267, 0.18393920361995697, 1),
       (0.27635079622268677, 0.3412773013114929, 0.706638514995575, 2),
       (0.5151825547218323, 0.29680076241493225, 0.32105395197868347, 3),
       (0.30534881353378296, 0.4070965051651001, 0.3288247585296631, 4)], 
      dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<i8')])

Now I can write it with the same fmt as before:

In [1515]: np.savetxt('test.txt',A.view(dt1),fmt=['%f']*3+['%d'])
In [1516]: cat test.txt
0.351397 0.846828 0.658702 0
0.566257 0.419570 0.183939 1
0.276351 0.341277 0.706639 2
0.515183 0.296801 0.321054 3
0.305349 0.407097 0.328825 4

If one or more of your columns was strings then you would need to use structured array. But as long as all the columns are numbers, you can get by with an all-float array, and control the print with the fmt.

The example: In [1496]: np.savetxt('test.txt',np.column_stack((a,b)),fmt=['%f']*3+['%d']) was great. Thanks!

tdelaney · Accepted Answer · 2016-10-25 21:53:11Z

I thought it would be relatively easy to break up the array into floats and ints and then use a combination of zip and np.savetxt to put it all back together in the csv. But Support zip input in savetxt in Python 3 suggests that way lies madness.

However, being stuck on the zip idea, I just moved the work to the standard csv module. Since numpy data needs to be converted to python types it may be a bit slower. But we're talking csv writing here so hopefully its just lost in the noise.

First, generate the test array

>>> import numpy as np
>>> array = np.arange(0., 18.*5, 5., dtype=np.float32).reshape((3,6))
>>> array
array([[  0.,   5.,  10.,  15.,  20.,  25.],
       [ 30.,  35.,  40.,  45.,  50.,  55.],
       [ 60.,  65.,  70.,  75.,  80.,  85.]], dtype=float32)

Split out the final column and recast as uint8

>>> floats, ints, _after = np.hsplit(array, (5,6))
>>> ints=ints.astype(np.uint8)
>>> floats
array([[  0.,   5.,  10.,  15.,  20.],
       [ 30.,  35.,  40.,  45.,  50.],
       [ 60.,  65.,  70.,  75.,  80.]], dtype=float32)
>>> ints
array([[25],
       [55],
       [85]], dtype=uint8)

Use the python csv module to do the writes. You need to cast the zipped array rows to tuples and add them together to go from np.array to python data types.

>>> import csv
>>> writer = csv.writer(open('test.csv', 'w'))
>>> writer.writerows(tuple(f)+tuple(i) for f,i in zip(floats, ints))
>>> del writer
>>> print(open('test.csv').read())
0.0,5.0,10.0,15.0,20.0,25
30.0,35.0,40.0,45.0,50.0,55
60.0,65.0,70.0,75.0,80.0,85

NaN · Accepted Answer · 2016-10-25 20:02:25Z

Well you can construct the dtype then use zeros or empty to get an empty shell ready for data. Hopefully this will give you a few ideas

>>> import numpy as np
>>> 
>>> flds = ["f{:0>{}}".format(i,2) for i in range(7)]
>>> dt = [(fld, 'float32') for fld in flds]
>>> dt.append(('i01', 'int8'))
>>> a = np.zeros((10,), dtype=dt)
>>> a
array([(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0)], 
      dtype=[('f00', '<f4'), ('f01', '<f4'), ('f02', '<f4'), ('f03', '<f4'), ('f04', '<f4'), ('f05', '<f4'), ('f06', '<f4'), ('i01', 'i1')])
>>>

Mess around with this example def

def num_45():
    """(num_45)...
    """
    import numpy as np
    flds = ["f{:0>{}}".format(i,2) for i in range(7)]
    dt = [(fld, 'float32') for fld in flds]
    dt.append(('i01', 'int8'))
    a = np.zeros((10,), dtype=dt)
    b = np.arange(10*8).reshape(10,8)
    c = np.copy(a)
    names = a.dtype.names
    N = len(names)
    for i in range(N):
        c[names[i]] = b[:,i]
    return a, b, c

Result

>>> a
array([(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0)], 
      dtype=[('f00', '<f4'), ('f01', '<f4'), ('f02', '<f4'), ('f03', '<f4'), ('f04', '<f4'), ('f05', '<f4'), ('f06', '<f4'), ('i01', 'i1')])
>>> b
array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29, 30, 31],
       [32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47],
       [48, 49, 50, 51, 52, 53, 54, 55],
       [56, 57, 58, 59, 60, 61, 62, 63],
       [64, 65, 66, 67, 68, 69, 70, 71],
       [72, 73, 74, 75, 76, 77, 78, 79]])
>>> c
array([(0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7),
       (8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15),
       (16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23),
       (24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31),
       (32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39),
       (40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47),
       (48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55),
       (56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63),
       (64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71),
       (72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79)], 
      dtype=[('f00', '<f4'), ('f01', '<f4'), ('f02', '<f4'), ('f03', '<f4'), ('f04', '<f4'), ('f05', '<f4'), ('f06', '<f4'), ('i01', 'i1')])

Another example with a few lines of manual code to see the construction

n = ['It', 'is', 'easy']
dt = [(n[0], '<f8'), (n[1], '<i8'), (n[2], 'U5')]
d = np.zeros((10,), dtype=dt)
for i in range(len(n)):
    d[n[i]] = b[:, i]

yields

>>> d.dtype.names
('It', 'is', 'easy')
>>> d.reshape(10,-1)
array([[(0.0, 1, '2')],
       [(8.0, 9, '10')],
       [(16.0, 17, '18')],
       [(24.0, 25, '26')],
       [(32.0, 33, '34')],
       [(40.0, 41, '42')],
       [(48.0, 49, '50')],
       [(56.0, 57, '58')],
       [(64.0, 65, '66')],
       [(72.0, 73, '74')]], 
      dtype=[('It', '<f8'), ('is', '<i8'), ('easy', '<U5')])

Can you show me how to implement it to an existing array, which involves the datatype I have specified? Because your solution doesn't work.

Collectives™ on Stack Overflow

assigning different data types for different columns in a numpy array

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related