Pandas: in memory sorting hdf5 files

I have the following problem:

I have a set several hdf5 files with similar data frames which I want to sort globally based on multiple columns.

My input is the file names and an ordered list of columns I want to use for sorting. The output should be a single hdf5 file containing all the sorted data.

Each file can contain millions of rows. I can afford loading a single file in memory but not the entire dataset.

Naively I would like first to copy all the data in a single hdf5 file (which is not difficult) and then find out a way to do in memory sorting of this huge file.

Is there a quick way to sort in memory a pandas datastructure stored in an hdf5 file based on multiple columns?

I have already seen ptrepack but it seems to allow you sorting only on a single column.

asked Jul 2, 2014 at 8:22

Luca Fiaschi

3,2157 gold badges34 silver badges47 bronze badges

Why do you actually need to sort the resulting file by more than one column? You can simply index in a query.

Jeff
– Jeff

2014-07-02 11:35:13 +00:00
Commented Jul 2, 2014 at 11:35
Hi @Jeff, what do you mean exactly with index in a query? Thanks

Luca Fiaschi
– Luca Fiaschi

2014-08-05 06:41:34 +00:00
Commented Aug 5, 2014 at 6:41
why don't u show some code of what you are doing

Jeff
– Jeff

2014-08-05 09:21:46 +00:00
Commented Aug 5, 2014 at 9:21
I am currently using a completely different strategy; I push all files one by one in a database table and then I retrieve the data I want sorted on the fly with pandas sql interface and an sql query.

Luca Fiaschi
– Luca Fiaschi

2014-08-06 10:46:33 +00:00
Commented Aug 6, 2014 at 10:46
if that works for you gr8

Jeff
– Jeff

2014-08-06 11:24:00 +00:00
Commented Aug 6, 2014 at 11:24

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Pandas: in memory sorting hdf5 files

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked