1

I want to read and write data to hdf5 file incrementally because I can't fit the data into memory.

The data to read/write is sets of integers. I only need to read/write the sets sequentially. No need for random access. Like I read set1, then set2, then set3, etc.

The problem is that I can't retrieve the sets by index.

import pandas as pd    
x = pd.HDFStore('test.hf', 'w', append=True)
a = pd.Series([1])
x.append('dframe', a, index=True)
b = pd.Series([10,2])
x.append('dframe', b, index=True)
x.close()

x = pd.HDFStore('test.hf', 'r')
print(x['dframe'])
y=x.select('dframe',start=0,stop=1)
print("selected:", y)
x.close()

Output:

0     1
0    10
1     2
dtype: int64
selected: 0    1
dtype: int64

It doesn't select my 0th set, which is {1,10}

4
  • index=False stackoverflow.com/questions/25714549/… Commented Mar 25, 2017 at 14:15
  • you can simply do this: y=x.select('dframe',start=0,stop=1+1) Commented Mar 25, 2017 at 14:19
  • @MaxU. But that means I know that the set has two elements before I read from the file, which is not the case. I don't know the size of the set when I read the file. Commented Mar 25, 2017 at 14:22
  • in this case you should use store.select('dframe', where="...") as you did in your answer Commented Mar 25, 2017 at 14:23

1 Answer 1

1

This way works. But I really don't know how fast is this.

And does this scan the whole file to find rows with the index?

That would be quite a waste of time.

import pandas as pd

x = pd.HDFStore('test.hf', 'w', append=True, format="table", complevel=9)
a = pd.Series([1])
x.append('dframe', a, index=True)
b = pd.Series([10,2])
x.append('dframe', b, index=True)
x.close()

x = pd.HDFStore('test.hf', 'r')
print(x['dframe'])
y=x.select('dframe','index == 0')
print('selected:')
for i in y:
    print(i)
x.close()

Output:

0     1
0    10
1     2
dtype: int64
selected:
1
10
Sign up to request clarification or add additional context in comments.

5 Comments

using data_columns=True - is a correct approach, but you should also create your HDF store with table format - pd.HDFStore('test.hf', mode='w', format='table', append=True)
you may want to check this answer for some performance testing...
@MaxU 755ms per cycle is just so bad... had to do like 759997 cycles, and I only need to read the sets sequentially instead of random access. If I write my own code for saving/reading sequentially, it can be faster.
I'd suggest you to open a new question, provide reproducible sample data set (are you working with series in real life or with data frames?), explain what are you going to do. What cycles are you talking about - are you sure you need cycles at all?
yup. writing the binary io code now. probably will take half a day. oh, but then it will get moved to code review because it is "code that works as intended"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.