Sorting very large 1D arrays

Question

I'm about to try out Pytables for the first time and I need to write my data to the hdf file per time step. I'll have over 100,000 time steps. When I'm done, I would like to sort my 100,000+ x 6 array by column 2, i.e., I currently have everything sorted by time but now I need to sort the array by order of decreasing rain rates (col 2). I'm unsure how to even begin here. I know that having the entire array in memory is unwise. Any ideas how to doe this fast and efficiently?

Appreciate any advice.

What's wrong with Table.readSorted() or Table.iterSorted()? And by the way, 100k rows with 6 fields each is not much, as long as your fields are numeric (about 5 MB). — Carsten
– Carsten, Commented Jan 21, 2013 at 12:30
I was not aware of these sorting functions. Thanks for the tip. I'll give it a try. — Shejo284
– Shejo284, Commented Jan 21, 2013 at 14:18

NPE · Accepted Answer · 2013-01-21 12:32:58Z

4

I know that having the entire array in memory is unwise.

You might be overthinking it. A 100K x 6 array of float64 takes just ~5MB of RAM. On my computer, sorting such an array takes about 27ms:

In [37]: a = np.random.rand(100000, 6)

In [38]: %timeit a[a[:,1].argsort()]
10 loops, best of 3: 27.2 ms per loop

answered Jan 21, 2013 at 12:32

NPE

503k114 gold badges970 silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

tiago · Accepted Answer · 2013-01-21 12:36:54Z

2

Unless you have a very old computer, you should put the entire array in memory. Assuming they are floats, it will only take 100000*6*4./2**20 = 2.29 Mb. Twice as much for doubles. You can use numpy's sort or argsort for sorting. For example, you can get the sorting indices from your second column:

import numpy as np
a = np.random.normal(0, 1, size=(100000,6))
idx = a[:, 1].argsort()

And then use these to index the columns you want, or the whole array:

b = a[idx]

You can even use different types of sort and check their speed:

In [33]: %timeit idx = a[:, 1].argsort(kind='quicksort')
100 loops, best of 3: 12.6 ms per loop

In [34]: %timeit idx = a[:, 1].argsort(kind='mergesort')
100 loops, best of 3: 14.4 ms per loop

In [35]: %timeit idx = a[:, 1].argsort(kind='heapsort')
10 loops, best of 3: 21.4 ms per loop

So you see that for an array of this size it doesn't really matter.

answered Jan 21, 2013 at 12:36

tiago

23.7k13 gold badges75 silver badges89 bronze badges

1 Comment

Shejo284 Over a year ago

Thanks. I think I may have under-estimated the array size. I have done this in numpy before the time index was over 330 million. I forgot to multiply it by the lat and lone. Also the size will increase as I process data with higher and higher horizontal resolution. Appreciate the tip. I'm in the process of writing the code and adapting my old numpy scripts.

Collectives™ on Stack Overflow

Sorting very large 1D arrays

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related