2

I'm about to try out Pytables for the first time and I need to write my data to the hdf file per time step. I'll have over 100,000 time steps. When I'm done, I would like to sort my 100,000+ x 6 array by column 2, i.e., I currently have everything sorted by time but now I need to sort the array by order of decreasing rain rates (col 2). I'm unsure how to even begin here. I know that having the entire array in memory is unwise. Any ideas how to doe this fast and efficiently?

Appreciate any advice.

2
  • 2
    What's wrong with Table.readSorted() or Table.iterSorted()? And by the way, 100k rows with 6 fields each is not much, as long as your fields are numeric (about 5 MB). Commented Jan 21, 2013 at 12:30
  • I was not aware of these sorting functions. Thanks for the tip. I'll give it a try. Commented Jan 21, 2013 at 14:18

2 Answers 2

4

I know that having the entire array in memory is unwise.

You might be overthinking it. A 100K x 6 array of float64 takes just ~5MB of RAM. On my computer, sorting such an array takes about 27ms:

In [37]: a = np.random.rand(100000, 6)

In [38]: %timeit a[a[:,1].argsort()]
10 loops, best of 3: 27.2 ms per loop
Sign up to request clarification or add additional context in comments.

Comments

2

Unless you have a very old computer, you should put the entire array in memory. Assuming they are floats, it will only take 100000*6*4./2**20 = 2.29 Mb. Twice as much for doubles. You can use numpy's sort or argsort for sorting. For example, you can get the sorting indices from your second column:

import numpy as np
a = np.random.normal(0, 1, size=(100000,6))
idx = a[:, 1].argsort()

And then use these to index the columns you want, or the whole array:

b = a[idx]

You can even use different types of sort and check their speed:

In [33]: %timeit idx = a[:, 1].argsort(kind='quicksort')
100 loops, best of 3: 12.6 ms per loop

In [34]: %timeit idx = a[:, 1].argsort(kind='mergesort')
100 loops, best of 3: 14.4 ms per loop

In [35]: %timeit idx = a[:, 1].argsort(kind='heapsort')
10 loops, best of 3: 21.4 ms per loop

So you see that for an array of this size it doesn't really matter.

1 Comment

Thanks. I think I may have under-estimated the array size. I have done this in numpy before the time index was over 330 million. I forgot to multiply it by the lat and lone. Also the size will increase as I process data with higher and higher horizontal resolution. Appreciate the tip. I'm in the process of writing the code and adapting my old numpy scripts.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.