4

When querying data in memory form a subset form CSV, I always do it this way:

df = pd.read_csv('data.csv', chunksize=10**3)

chunk1 = df.get_chunk()
chunk1 = chunk1[chunk1['Col1'] > someval]

for chunk in df:
    chunk1.append(chunk[chunk['Col1'] >someval])

I recently started playing around with HDF5, and am not able to do this because the TableIterator object does not have a get_chunk() method or accept next().

df = pd.read_hdf('data.h5', chunksize=10**3)
df.get_chunk()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-xxxxxxxx> in <module>()
----> 1 df.get_chunk()

AttributeError: 'TableIterator' object has no attribute 'get_chunk'

Any ideas for a workaround? (I know that I can query from hdf5 on disk using pandas but for this purpose would like to try it this way)

2 Answers 2

12

It really does make sense to use HDF indexing in this case as it's much more efficient.

Here is a small demo:

generate test DataFrame (10M rows, 3 columns):

In [1]: df = pd.DataFrame(np.random.randint(0,10**7,(10**7,3)),columns=list('abc'))

In [2]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 3 columns):
a    int32
b    int32
c    int32
dtypes: int32(3)
memory usage: 114.4 MB

In [3]: df.shape
Out[3]: (10000000, 3)

save DF to HDF file. Make sure column a is indexed (data_columns=['a',...] or data_columns=True - to index all columns)

fn = r'c:/tmp/test.h5'
store = pd.HDFStore(fn)
store.append('test', df, data_columns=['a'])
store.close()
del df

test reading from HDF file:

fn = r'c:/tmp/test.h5'
chunksize = 10**6
someval = 100

Timing:

read HDF in chunks and concatenate filtered chunks into resulting DF

In [18]: %%timeit
    ...: df = pd.DataFrame()
    ...: for chunk in pd.read_hdf(fn, 'test', chunksize=chunksize):
    ...:     df = pd.concat([df, chunk.ix[chunk.a < someval]], ignore_index=True)
    ...:
1 loop, best of 3: 2min 22s per loop

read HDF in chunks (conditionally - filtering data by HDF index) and concatenate chunks into resulting DF:

In [19]: %%timeit
    ...: df = pd.DataFrame()
    ...: for chunk in pd.read_hdf(fn, 'test', chunksize=chunksize, where='a < someval'):
    ...:     df = pd.concat([df, chunk], ignore_index=True)
    ...:
10 loops, best of 3: 79.1 ms per loop

Conclusion: searching HDF by index (using where=<terms>) is 1795 times faster compared to reading everything and filtering in memory:

In [20]: (2*60+22)*1000/79.1
Out[20]: 1795.19595448799
Sign up to request clarification or add additional context in comments.

4 Comments

How did you make this work? When I try the code above I am getting errors like: ValueError: Shape of passed values is (1, 100000), indices imply (1, 1551440685) - I tried every possible combination of chunksize, iterator, etc, and it is failing all the time...
@DejanLekic, did you try to execute the code from my answer? what is your Pandas version?
pandas 0.20.1, tables 3.4.2 ... Getting this weird error... If I omit chunksize it runs out of memory trying to load everything into memory... If you could give me your exact versions that would be great so I can pin my requirements to those and try again... BTW, I can't use where. I need to load ALL records, because I am trying to convert HDF5 files to Parquet table-by-table...
@DejanLekic, just tested it against Pandas: 0.20.1, tables: 3.2.2, numpy: 1.12.1
1

Simply:

chunk1 = pd.concat([chunk[chunk['Col1'] > someval] for chunk in df])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.