Ranking of numpy array with possible duplicates, in pure numpy/scipy

Question

I have a numpy array of floats/ints and want to map each elements to their rank.

If an array doesn't have duplicates the problem can be solved by the following code:

In [49]: a1
Out[49]: array([ 0.1,  5.1,  2.1,  3.1,  4.1,  1.1,  6.1,  8.1,  7.1,  9.1])

In [50]: a1.argsort()
Out[50]: array([0, 5, 2, 3, 4, 1, 6, 8, 7, 9])

...or .argsort().argsort() for a 2D array.

Now I want to extend this method to arrays with possible duplicates, so that duplicates are mapped to the same value. For example, I want this array:

a2 = np.array([0.1, 1.1, 2.1, 3.1, 4.1, 1.1, 6.1, 7.1, 7.1, 1.1])

to be mapped to any of the following three:

0 1 4 5 6 1 7 8 8 1       # a) minimum rank
0 3 4 5 6 3 7 9 9 3       # b) maximum rank
0 2 4 5 6 2 7 8.5 8.5 2   # c) average rank

In case a),b) we map duplicates to the minimum/maximum rank among them if we just apply a2.argsort(). Case c) is just the average of ranks from cases a)+b).

Any suggestions?

EDIT (efficiency requirements)

In the initial description I forgot to mention my speed requirement. I want a solution in pure numpy/scipy functions, which avoids the overhead of native Python. Example: consider the solution proposed by Richard which actually solves the problem but is quite slow:

def argsortdup(a1):
  sorted = np.sort(a1)
  ranked = []
  for item in a1:
    ranked.append(sorted.searchsorted(item))
  return np.array(ranked)

In [86]: a2 = np.array([ 0.1,  1.1,  2.1,  3.1,  4.1,  1.1,  6.1,  7.1,  7.1,  1.1])

In [87]: %timeit a2.argsort().argsort()
1000000 loops, best of 3: 1.55 us per loop

In [88]: %timeit argsortdup(a2)
10000 loops, best of 3: 25.6 us per loop

In [89]: a = np.arange(0.1, 1000.1)

In [90]: %timeit a.argsort().argsort()
10000 loops, best of 3: 24.5 us per loop

In [91]: %timeit argsortdup(a)
1000 loops, best of 3: 1.14 ms per loop

In [92]: a = np.arange(0.1, 10000.1)

In [93]: %timeit a.argsort().argsort()
1000 loops, best of 3: 303 us per loop

In [94]: %timeit argsortdup(a)
100 loops, best of 3: 11.9 ms per loop

It is clear from the analysis above that argsortdup is 30-50 times slower than a.argsort().argsort(). The main reason is the use of python loops and lists.

Why are you comparing with a.argsort().argsort()? That doesn't give you the answer. — Avaris
– Avaris, Commented Feb 3, 2013 at 10:51
Correct, this doesn't give me the answer in the presence of duplicates. However I wanted to emphasize the performance difference when one uses numpy functions vs numpy + pure python loops. And the difference is huge. As far as I don't know solution which is faster than the one provided by Richard I use "a.argsort().argsort()" :) — Mikhail Shevelev
– Mikhail Shevelev, Commented Feb 3, 2013 at 11:03
Richard's answer is right on track, but needs a bit more vectorizing. See my comment on his answer. That should give you the answer in a fairly reasonable time. — Avaris
– Avaris, Commented Feb 3, 2013 at 11:08
Check out scipy.stats.rankdata. Also, take a look at the ranking functions in the pandas package (pandas.pydata.org/pandas-docs/stable/…). — Warren Weckesser
– Warren Weckesser, Commented Feb 3, 2013 at 12:50
(Calling .argsort().argsort() the second time does nothing on a 1D array, it's only needed for 2D arrays.) — smci
– smci, Commented Apr 19, 2024 at 0:10

ecatmur · Accepted Answer · 2013-02-03 13:08:30Z

4

You can do reasonably well using unique and bincount:

>>> u, v = np.unique(a2, return_inverse=True)
>>> (np.cumsum(np.bincount(v)) - 1)[v]
array([0, 3, 4, 5, 6, 3, 7, 9, 9, 3])

Or, for the minimum rank:

>>> (np.cumsum(np.concatenate(([0], np.bincount(v)))))[v]
array([0, 1, 4, 5, 6, 1, 7, 8, 8, 1])

There's a minor speedup by giving bincount the number of bins to provide:

(np.cumsum(np.bincount(v, minlength=u.size)) - 1)[v]

answered Feb 3, 2013 at 13:08

ecatmur

158k28 gold badges311 silver badges387 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Mikhail Shevelev Over a year ago

Very beautiful solution, works correctly and very fast on large arrays. Thank you a lot!

Merlin Over a year ago

Any thoughts on stackoverflow.com/questions/45132656/…

ecatmur Over a year ago

@Merlin sure, using scipy.stats.rankdata it's just "ordinal" - "min". My code will also work if you can't use scipy for some reason. Answered on your question.

Merlin Over a year ago

@ecatmur How could I use your code for my Question? scipy works, looking at yours to learn something..

ecatmur Over a year ago

@Merlin sure, just use my (second) solution to replace rankdata(..., "min") and the argsort().argsort() technique for rankdata(..., "ordinal"). Updated my answer.

root · Accepted Answer · 2013-02-03 13:20:03Z

3

After upgrading to a latest version of scipy as suggested @WarrenWeckesser in the comments, scipy.stats.rankdata seems to be faster than both scipy.stats.mstats.rankdata and np.searchsorted being the fastet way to do it on larger arrays.

In [1]: import numpy as np

In [2]: from scipy.stats import rankdata as rd
   ...: from scipy.stats.mstats import rankdata as rd2
   ...: 

In [3]: array = np.arange(0.1, 1000000.1)

In [4]: %timeit np.searchsorted(np.sort(array), array)
1 loops, best of 3: 385 ms per loop

In [5]: %timeit rd(array)
10 loops, best of 3: 109 ms per loop

In [6]: %timeit rd2(array)
1 loops, best of 3: 205 ms per loop

edited Feb 3, 2013 at 13:20

answered Feb 3, 2013 at 10:45

root

81.1k25 gold badges111 silver badges120 bronze badges

6 Comments

Mikhail Shevelev Over a year ago

Thanks, your anwer is useful. However, as far as I can see internally it implements the logic in pure python (very similar to what Richard proposed but a little bit more logic to handle ties). As a result it is even slower than Richard's solution. But anyways it's good to have a ready-to-use solution.

root Over a year ago

@MikhailShevelev -- the speed depends on the size of the array, try something big, and it will outperform both of the others.

Warren Weckesser Over a year ago

You could use scipy.stats.rankdata, which is faster than scipy.stats.mstats.rankdata.

root Over a year ago

@WarrenWeckesser -- I actually checked it, but it seems to do (significally) worse on large arrays, that is somewhat confusing...

Warren Weckesser Over a year ago

@root: What version of scipy are you using? In a recent version of scipy, stats.rankdata was rewritten in cython, so it will be much faster than the older python+numpy version.

|

richardhsu · Accepted Answer · 2013-02-03 09:57:51Z

2

Here is a function that can return the output you desire (in the first case)

def argsortdup(a1):
  sorted = sort(a1)
  ranked = []
  for item in a1:
    ranked.append(sorted.searchsorted(item))
  return array(ranked)

Basically you sort it and then you search for the index the item is at. Assuming duplicates the first instance index should be returned. I tested it with your a2 example and doing something like

a3 = argsortdup(a2)

Yields

array([0, 1, 4, 5, 6, 1, 7, 8, 8, 1])

"Test with a2":

>>> a2
array([ 0.1,  1.1,  2.1,  3.1,  4.1,  1.1,  6.1,  7.1,  7.1,  1.1])
>>> def argsortdup(a1):
...   sorted = sort(a1)
...   ranked = []
...   for item in a1:
...     ranked.append(sorted.searchsorted(item))
...   return array(ranked)
...
>>> a3 = argsortdup(a2)
>>> a2
array([ 0.1,  1.1,  2.1,  3.1,  4.1,  1.1,  6.1,  7.1,  7.1,  1.1])
>>> a3
array([0, 1, 4, 5, 6, 1, 7, 8, 8, 1])
>>>

answered Feb 3, 2013 at 9:57

richardhsu

8885 silver badges13 bronze badges

2 Comments

Mikhail Shevelev Over a year ago

Richard, thank you for a quick answer. You function solves the problem, however I'm seeking for more efficient solution in terms of execution time. I forgot to mention this in the initial description -- this is my fault and I'm sorry. Please see the updated description for more details. Once again, thank you for your response!

Avaris Over a year ago

Correct approach but you could do better if you used numpy better. That function is just: np.searchsorted(np.sort(a1), a1)

Collectives™ on Stack Overflow

Ranking of numpy array with possible duplicates, in pure numpy/scipy

3 Answers 3

5 Comments

6 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related