I use pandas and hdf5 files in order to handle big amounts of data (e.g. 10GB and more). I would like to use the table format in order to be able to query the data efficiently when reading it. However, when I want to write my data to an hdf store (using DataFrame.to_hdf()), it has a enormous memory overhead. Consider the following example:
import pandas as pd
import numpy as np
from random import sample
nrows = 1000000
ncols = 500
# create a big dataframe
df = pd.DataFrame(np.random.rand(nrows,ncols))
# 1) Desired table format: uses huge memory overhead
df.to_hdf('test.hdf', 'random', format='table') # use lots of additional memory
# 2) Fixed format, uses no additional memory
df.to_hdf('test2.hdf', 'random')
When I make df.info() I see that the storage have a size of 3.7GB. When executing version one, with the table format, the memory usage of my system suddenly goes up with approximately 8.5GB, which is more than twice the size of my DataFrame. On the other had, for version two with the fixed format there is no additional memory overhead.
Does anyone know what the issue is or how I can prevent this enormous memory overhead? If I have larger dataframes of around 10GB I always run out of memory because of this overhead. I know that in therms of speed and other stuff that the performance of the table format is not that great, but I don't see why it should use so much additional memory (this is enough for more than 2 full copies of the dataframe).
It would be great if anoyone have an explanation and/or a solution to this problem.
Thanks, Markus