Pandas HDFStore for out-of-core Sequential read/write of sets with variable sizes

Question

I want to read and write data to hdf5 file incrementally because I can't fit the data into memory.

The data to read/write is sets of integers. I only need to read/write the sets sequentially. No need for random access. Like I read set1, then set2, then set3, etc.

The problem is that I can't retrieve the sets by index.

import pandas as pd    
x = pd.HDFStore('test.hf', 'w', append=True)
a = pd.Series([1])
x.append('dframe', a, index=True)
b = pd.Series([10,2])
x.append('dframe', b, index=True)
x.close()

x = pd.HDFStore('test.hf', 'r')
print(x['dframe'])
y=x.select('dframe',start=0,stop=1)
print("selected:", y)
x.close()

Output:

0     1
0    10
1     2
dtype: int64
selected: 0    1
dtype: int64

It doesn't select my 0th set, which is {1,10}

you can simply do this: y=x.select('dframe',start=0,stop=1+1) — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Mar 25, 2017 at 14:19
@MaxU. But that means I know that the set has two elements before I read from the file, which is not the case. I don't know the size of the set when I read the file. — dot dot dot
– dot dot dot, Commented Mar 25, 2017 at 14:22
in this case you should use store.select('dframe', where="...") as you did in your answer — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Mar 25, 2017 at 14:23

dot dot dot · Accepted Answer · 2017-03-25 14:29:25Z

1

This way works. But I really don't know how fast is this.

And does this scan the whole file to find rows with the index?

That would be quite a waste of time.

import pandas as pd

x = pd.HDFStore('test.hf', 'w', append=True, format="table", complevel=9)
a = pd.Series([1])
x.append('dframe', a, index=True)
b = pd.Series([10,2])
x.append('dframe', b, index=True)
x.close()

x = pd.HDFStore('test.hf', 'r')
print(x['dframe'])
y=x.select('dframe','index == 0')
print('selected:')
for i in y:
    print(i)
x.close()

Output:

0     1
0    10
1     2
dtype: int64
selected:
1
10

edited Mar 25, 2017 at 14:29

answered Mar 25, 2017 at 14:07

dot dot dot

2411 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

MaxU - stand with Ukraine Over a year ago

using data_columns=True - is a correct approach, but you should also create your HDF store with table format - pd.HDFStore('test.hf', mode='w', format='table', append=True)

MaxU - stand with Ukraine Over a year ago

you may want to check this answer for some performance testing...

dot dot dot Over a year ago

@MaxU 755ms per cycle is just so bad... had to do like 759997 cycles, and I only need to read the sets sequentially instead of random access. If I write my own code for saving/reading sequentially, it can be faster.

MaxU - stand with Ukraine Over a year ago

I'd suggest you to open a new question, provide reproducible sample data set (are you working with series in real life or with data frames?), explain what are you going to do. What cycles are you talking about - are you sure you need cycles at all?

dot dot dot Over a year ago

yup. writing the binary io code now. probably will take half a day. oh, but then it will get moved to code review because it is "code that works as intended"

Collectives™ on Stack Overflow

Pandas HDFStore for out-of-core Sequential read/write of sets with variable sizes

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related