I am storing a large text file (10 GBs, N rows and 4 columns) in an HDF5 file using h5py package. Primarily because I do not want to use my RAM.
I would like to sort the items in the file based on second column. Any suggestions on how to do that?
I also heard that it can be done in chunks, any help on that please?
Thanks!
h5py, use Pytables (akatables). It has optimized sort and search algorithms. Both can create and operate on an HDF5 file. (Obviously, you will have to read your text data into the HDF5 file first. There are other SO posts that show how to do that.)