4

I'm getting data from sensors and storing it in HDF5 files using h5py. The sensordata comes in as a bytes object, I use numpy to convert it into a structured array. Then I write the structured array into the HDF5 file. This all works as intended.

Now I want to read the data back from the HDF5 file and I'm only interested in some parts of it. For example if I only want to read one column. The problem is that directly writing the structured numpy array to the HDF5 file writes all the data as a single block, for example with shape (10,). The default chunksize is set to (256,). This means I read 256 rows and all columns of data for each chunk. However, this gets really slow when the number of columns increases.

Is there a way to modify the data or change the chunking parameters so I can read a single column of data instead of a whole block in each chunk?

A minimal example of what I'm using is shown below:

import h5py
import ctypes
import numpy as np

class SensorStruct(ctypes.Structure):
    _pack_ = 4
    _fields_ = [('tc_time',ctypes.c_int64),
                ('pc_time',ctypes.c_double),
                ('nSample', ctypes.c_ushort),
                ('fMean', ctypes.c_float),
                ('fLowerbound', ctypes.c_float),
                ('fUpperbound', ctypes.c_float)]

def CreateFile(filename):
    #Create a new HDF5 file
    with h5py.File(filename, 'w', libver='latest') as f:
        f.swmr_mode = True

def AddDataset(filename, dsetname, struct):
    #Add a dataset to an existing HDF5 file
    with h5py.File(filename, 'r+', libver='latest', swmr=True) as f:
        f.create_dataset(dsetname, 
                         dtype = struct, 
                         shape = (0,), #Shape will update each time data is added
                         maxshape = (60480000,),
                         chunks = True, #Need to modify this somehow
                         compression = 'gzip')

def WriteData(filename, data):
    #Append an existing dataset with new data
    with h5py.File(filename, 'r+', libver='latest', swmr=True) as f:
        dset = f[dsetname]

        length = dset.shape[0]
        maxlength = dset.maxshape[0]

        newlength = length + len(data)
        if newlength < maxlength:
            dset.resize((newlength,))
            dset[length:newlength] = data

filename = 'TESTFILE.h5'
dsetname = 'Sensor1'
struct_dt = np.dtype(SensorStruct)

#Rawdata comes in from a sensor every few seconds, returns as bytes object
rawdata1 = b"\x15\xcd[\x07\x00\x00\x00\x00 x\x81BA\x02\xd7A\x00\x00\x00\x00'u\x1fA\xf4Q\x01AbY?A\x16\xcd[\x07\x00\x00\x00\x00 x\x81BA\x02\xd7A\x01\x00\x00\x00\x1c\xe4&A[\x85\x0bA\x97\x96=A\x17\xcd[\x07\x00\x00\x00\x00 x\x81BA\x02\xd7A\x02\x00\x00\x00\xf6\x8b\x02A\xe5\xd5\xd5@\xc1Y\x1bA\x18\xcd[\x07\x00\x00\x00\x00 x\x81BA\x02\xd7A\x03\x00\x00\x00 \xec9A?W\x17A\xd0vRA\x19\xcd[\x07\x00\x00\x00\x00 x\x81BA\x02\xd7A\x04\x00\x00\x00\xf2\t/A\x83U\x19A\r&[A\x1a\xcd[\x07\x00\x00\x00\x00 x\x81BA\x02\xd7A\x05\x00\x00\x00s\x8a\x18A\xb0\x19\x04A\xc6\xb51A\x1b\xcd[\x07\x00\x00\x00\x00 x\x81BA\x02\xd7A\x06\x00\x00\x00P\xb6>A6\xc5 A)erA\x1c\xcd[\x07\x00\x00\x00\x00 x\x81BA\x02\xd7A\x07\x00\x00\x00\xe5e\x11A\x17\x9c\xff@^\xbf5A\x1d\xcd[\x07\x00\x00\x00\x00 x\x81BA\x02\xd7A\x08\x00\x00\x00*\xbe\x19At\xd5\x04AN\x919A\x1e\xcd[\x07\x00\x00\x00\x00 x\x81BA\x02\xd7A\t\x00\x00\x00\xa2* A(-\x03AE\xedFA"
rawdata2 = b'\x1f\xcd[\x07\x00\x00\x00\x00 x\x01EA\x02\xd7A\n\x00\x00\x00\xb6\x89&A\xd7\x8f\x07A\xe9\x00SA \xcd[\x07\x00\x00\x00\x00 x\x01EA\x02\xd7A\x0b\x00\x00\x00\x91I\xfd@*\\\xdc@<\x17!A!\xcd[\x07\x00\x00\x00\x00 x\x01EA\x02\xd7A\x0c\x00\x00\x00,q\x12A\x81\x1f\xfa@\x81\xfe(A"\xcd[\x07\x00\x00\x00\x00 x\x01EA\x02\xd7A\r\x00\x00\x00\x04@\x1cA\x03p\x05A\x05\xb03A#\xcd[\x07\x00\x00\x00\x00 x\x01EA\x02\xd7A\x0e\x00\x00\x00\xad\x89:A8h#A\xab\x88SA$\xcd[\x07\x00\x00\x00\x00 x\x01EA\x02\xd7A\x0f\x00\x00\x00I\x0f\xf5@\x15\xaa\xca@Rk\x0cA%\xcd[\x07\x00\x00\x00\x00 x\x01EA\x02\xd7A\x10\x00\x00\x00\xab\xeb\x1dA\x86 \x05A\x1807A&\xcd[\x07\x00\x00\x00\x00 x\x01EA\x02\xd7A\x11\x00\x00\x00Q\xda3A\xdc\xa6\x1cAT)ZA\'\xcd[\x07\x00\x00\x00\x00 x\x01EA\x02\xd7A\x12\x00\x00\x00U\xb3=A\xae\xcb\x1aA\xebmQA(\xcd[\x07\x00\x00\x00\x00 x\x01EA\x02\xd7A\x13\x00\x00\x00\x8f\x82\x0cA\x11\x15\xf3@$]&A'
data1 = np.frombuffer(rawdata1, dtype=SensorStruct)
data2 = np.frombuffer(rawdata2, dtype=SensorStruct)

CreateFile(filename)
AddDataset(filename, dsetname, SensorStruct)
WriteData(filename, data1)
WriteData(filename, data2)

Here I'm trying to read a single column of data:

import time
t0 = time.time()

with h5py.File(filename, 'r', libver='latest', swmr=True) as f:
    dset = f[dsetname]

    #Optimize chunking so I can read one column
    #My real dataset contains hundreds of columns and milions of rows
    #So this minimal example may look slightly trivial
    print('Chunksize: {}'.format(dset.chunks))
    t = dset['pc_time']

print('Reading the time column took {} seconds'.format(time.time()-t0))
2
  • I thought chunking was set at file creation not at read, but I haven't focused on it. For faster access to fields you probably should put them in separate datasets, Commented Dec 6, 2018 at 14:09
  • Yes chunking is defined at dataset creation, thats why I also uploaded the part where I create the file. I can still edit it now. Commented Dec 6, 2018 at 14:21

1 Answer 1

5
In [551]: dt = np.dtype([('a',int),('b','uint8'),('c','float32'),('d','float64')])
In [552]: x = np.ones(10, dt)
In [553]: x.dtype
Out[553]: dtype([('a', '<i8'), ('b', 'u1'), ('c', '<f4'), ('d', '<f8')])
In [554]: x.itemsize
Out[554]: 21
In [555]: x.__array_interface__
Out[555]: 
{'data': (40185408, False),
 'strides': None,
 'descr': [('a', '<i8'), ('b', '|u1'), ('c', '<f4'), ('d', '<f8')],
 'typestr': '|V21',
 'shape': (10,),
 'version': 3}

Each record this array takes up 21 bytes, 'V21'.

In [557]: f = h5py.File('vtype.h5','w')
In [558]: ds = f.create_dataset('data', data=x)
In [559]: ds
Out[559]: <HDF5 dataset "data": shape (10,), type "|V21">
In [560]: ds.dtype
Out[560]: dtype([('a', '<i8'), ('b', 'u1'), ('c', '<f4'), ('d', '<f8')])

In h5dump this dataset displays as

  DATATYPE  H5T_COMPOUND {
     H5T_STD_I64LE "a";
     H5T_STD_U8LE "b";
     H5T_IEEE_F32LE "c";
     H5T_IEEE_F64LE "d";
  }

The docs on chunking show a chunking tuple with the same number of elements as the array's shape.

http://docs.h5py.org/en/stable/high/dataset.html#chunked-storage

Here I created a 1d array, so chunking, if specified, only applies to that dimension, not to the Compound datatype.

For numpy arrays, accessing a single field of a structured array is relatively fast, comparable to accessing a column of a 2d array, or a stand along 1d array. It is a view.

But loading from a h5 dataset is a copy. With this small example, loading ds[:] is faster than ds['a']. And ds[:n]['a'] is faster than ds['a'][:n].

I don't have a sense of how these timings compare with column access of a simple 2d array. And I don't know if the timings depend on the size of the dtype.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.