How to write a large csv file to hdf5 in python?

Question

I have a dataset that is too large to directly read into memory. And I don't want to upgrade the machine. From my readings, HDF5 may be a suitable solution for my problem. But I am not sure how to iteratively write the dataframe into the HDF5 file since I can not load the csv file as a dataframe object.

So my question is how to write a large CSV file into HDF5 file with python pandas.

MaxU - stand with Ukraine · Accepted Answer · 2017-10-07 13:11:44Z

11

You can read CSV file in chunks using chunksize parameter and append each chunk to the HDF file:

hdf_key = 'hdf_key'
df_cols_to_index = [...] # list of columns (labels) that should be indexed
store = pd.HDFStore(hdf_filename)

for chunk in pd.read_csv(csv_filename, chunksize=500000):
    # don't index data columns in each iteration - we'll do it later ...
    store.append(hdf_key, chunk, data_columns=df_cols_to_index, index=False)
    # index data columns in HDFStore

store.create_table_index(hdf_key, columns=df_cols_to_index, optlevel=9, kind='full')
store.close()

answered Oct 7, 2017 at 13:11

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Yan Song Over a year ago

Thanks for the answer. I am not familiar with the pytables package. Is it possible to use h5py?

MaxU - stand with Ukraine Over a year ago

Pandas implements it's own HDF API based on pytables - we should use that API for compatibility reasons...

MaxU - stand with Ukraine Over a year ago

@YanSong, but frankly speaking i don't understand what's wrong with using internal Pandas methods, that are based on pytables - you don't need to know anything about pytables in order to use Pandas HDF methods...

kkkobelief24 Over a year ago

if cols number > 2000, this way will fail

DavidC Over a year ago

@G_KOBELIEF please say how the failure presents. Thanks!

|

Collectives™ on Stack Overflow

How to write a large csv file to hdf5 in python?

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related