6

I have a simple question, I cannot help but feel like I am missing something obvious.

I have read data from a source table (SQL Server) and have created an HDF5 file to store the data via the following:

output.to_hdf('h5name', 'df', format='table', data_columns=True, append=True, complib='blosc', min_itemsize = 10)

The dataset is ~50 million rows and 11 columns.

If I read the entire HDF5 back into a dataframe (through HDFStore.select or read_hdf), it consumes about ~24GB of RAM. If I parse specific columns into the read statements (e.g. selecting 2 or 3 columns), the dataframe now only returns those columns, however the same amount of memory is consumed (24GB).

This is running on Python 2.7 with Pandas 0.14.

Am I missing something obvious?

EDIT: I think I answered my own question. While I did a ton of searching before posting, obviously once posted I found a useful link: https://github.com/pydata/pandas/issues/6379

Any suggestions on how to optimize this process would be great, due to memory limitations I cannot hit peak memory required to release via gc.

2
  • "Python 2.4" you should definitely consider updating, this is not supported (or do you mean 3.4??). Commented Sep 18, 2014 at 0:40
  • Sorry, 2.7. Tired eyes. Commented Sep 18, 2014 at 13:11

1 Answer 1

5

HDFStore in table format is a row oriented store. When selecting the query indexes on the rows, but for each row you get every column. selecting a subset of columns does a reindex at the end.

There are several ways to approach this:

  • use a column store, like bcolz; this is currently not implemented by PyTables so this would involve quite a bit of work
  • chunk thru the table, see here and concat at the end - this will use constant memory
  • store as a fixed format - this is a more efficient storage format so will use less memory (but cannot be appended)
  • create your own column store-like by storing to multiple sub tables and use select_as_multiple see here

which options you choose depend on the nature of your data access

note: you may not want to have all of the columns as data_columns unless you are really going to select from the all (you can only query ON a data_column or an index) this will make store/query faster

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.