7

I have a dataset that is too large to directly read into memory. And I don't want to upgrade the machine. From my readings, HDF5 may be a suitable solution for my problem. But I am not sure how to iteratively write the dataframe into the HDF5 file since I can not load the csv file as a dataframe object.

So my question is how to write a large CSV file into HDF5 file with python pandas.

1 Answer 1

11

You can read CSV file in chunks using chunksize parameter and append each chunk to the HDF file:

hdf_key = 'hdf_key'
df_cols_to_index = [...] # list of columns (labels) that should be indexed
store = pd.HDFStore(hdf_filename)

for chunk in pd.read_csv(csv_filename, chunksize=500000):
    # don't index data columns in each iteration - we'll do it later ...
    store.append(hdf_key, chunk, data_columns=df_cols_to_index, index=False)
    # index data columns in HDFStore

store.create_table_index(hdf_key, columns=df_cols_to_index, optlevel=9, kind='full')
store.close()
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for the answer. I am not familiar with the pytables package. Is it possible to use h5py?
Pandas implements it's own HDF API based on pytables - we should use that API for compatibility reasons...
@YanSong, but frankly speaking i don't understand what's wrong with using internal Pandas methods, that are based on pytables - you don't need to know anything about pytables in order to use Pandas HDF methods...
if cols number > 2000, this way will fail
@G_KOBELIEF please say how the failure presents. Thanks!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.