3

I use pandas and hdf5 files in order to handle big amounts of data (e.g. 10GB and more). I would like to use the table format in order to be able to query the data efficiently when reading it. However, when I want to write my data to an hdf store (using DataFrame.to_hdf()), it has a enormous memory overhead. Consider the following example:

import pandas as pd
import numpy as np
from random import sample

nrows   = 1000000
ncols   = 500

# create a big dataframe
df = pd.DataFrame(np.random.rand(nrows,ncols)) 

# 1) Desired table format: uses huge memory overhead
df.to_hdf('test.hdf', 'random', format='table') # use lots of additional memory

# 2) Fixed format,  uses no additional memory
df.to_hdf('test2.hdf', 'random') 

When I make df.info() I see that the storage have a size of 3.7GB. When executing version one, with the table format, the memory usage of my system suddenly goes up with approximately 8.5GB, which is more than twice the size of my DataFrame. On the other had, for version two with the fixed format there is no additional memory overhead.

Does anyone know what the issue is or how I can prevent this enormous memory overhead? If I have larger dataframes of around 10GB I always run out of memory because of this overhead. I know that in therms of speed and other stuff that the performance of the table format is not that great, but I don't see why it should use so much additional memory (this is enough for more than 2 full copies of the dataframe).

It would be great if anoyone have an explanation and/or a solution to this problem.

Thanks, Markus

1 Answer 1

2

In <= 0.15.2

In [1]: df = pd.DataFrame(np.random.rand(1000000,500))
df.info()

In [2]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Columns: 500 entries, 0 to 499
dtypes: float64(500)
memory usage: 3.7 GB
In [3]: %memit -r 1 df.to_hdf('test.h5','df',format='table',mode='w')
peak memory: 11029.49 MiB, increment: 7130.57 MiB

There is an inefficiency in <= 0.15.2 that ends up copying data 1-2 times. You can do this as a work-around until 0.16.0 (where this is fixed)

In [9]: %cpaste
def f(df):
  g = df.groupby(np.arange(len(df))/10000)
  store = pd.HDFStore('test.h5',mode='w')
  for _, grp in g:
    store.append('df',grp, index=False)
  store.get_storer('df').create_index()

In [11]: %memit -r 1 f(df)
peak memory: 7977.26 MiB, increment: 4079.32 MiB

In 0.16.0 (coming 3rd week of March 2015), after this PR

In [2]: %memit -r 1 df.to_hdf('test.h5','df',format='table',mode='w')
peak memory: 4669.21 MiB, increment: 794.57 MiB  
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.