2

I have the following problem:

I have a set several hdf5 files with similar data frames which I want to sort globally based on multiple columns.

My input is the file names and an ordered list of columns I want to use for sorting. The output should be a single hdf5 file containing all the sorted data.

Each file can contain millions of rows. I can afford loading a single file in memory but not the entire dataset.

Naively I would like first to copy all the data in a single hdf5 file (which is not difficult) and then find out a way to do in memory sorting of this huge file.

Is there a quick way to sort in memory a pandas datastructure stored in an hdf5 file based on multiple columns?

I have already seen ptrepack but it seems to allow you sorting only on a single column.

5
  • Why do you actually need to sort the resulting file by more than one column? You can simply index in a query. Commented Jul 2, 2014 at 11:35
  • Hi @Jeff, what do you mean exactly with index in a query? Thanks Commented Aug 5, 2014 at 6:41
  • why don't u show some code of what you are doing Commented Aug 5, 2014 at 9:21
  • I am currently using a completely different strategy; I push all files one by one in a database table and then I retrieve the data I want sorted on the fly with pandas sql interface and an sql query. Commented Aug 6, 2014 at 10:46
  • if that works for you gr8 Commented Aug 6, 2014 at 11:24

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.