9

I have a numpy array of floats/ints and want to map each elements to their rank.

If an array doesn't have duplicates the problem can be solved by the following code:

In [49]: a1
Out[49]: array([ 0.1,  5.1,  2.1,  3.1,  4.1,  1.1,  6.1,  8.1,  7.1,  9.1])

In [50]: a1.argsort()
Out[50]: array([0, 5, 2, 3, 4, 1, 6, 8, 7, 9])

...or .argsort().argsort() for a 2D array.

Now I want to extend this method to arrays with possible duplicates, so that duplicates are mapped to the same value. For example, I want this array:

a2 = np.array([0.1, 1.1, 2.1, 3.1, 4.1, 1.1, 6.1, 7.1, 7.1, 1.1])

to be mapped to any of the following three:

0 1 4 5 6 1 7 8 8 1       # a) minimum rank
0 3 4 5 6 3 7 9 9 3       # b) maximum rank
0 2 4 5 6 2 7 8.5 8.5 2   # c) average rank

In case a),b) we map duplicates to the minimum/maximum rank among them if we just apply a2.argsort(). Case c) is just the average of ranks from cases a)+b).

Any suggestions?

EDIT (efficiency requirements)

In the initial description I forgot to mention my speed requirement. I want a solution in pure numpy/scipy functions, which avoids the overhead of native Python. Example: consider the solution proposed by Richard which actually solves the problem but is quite slow:

def argsortdup(a1):
  sorted = np.sort(a1)
  ranked = []
  for item in a1:
    ranked.append(sorted.searchsorted(item))
  return np.array(ranked)

In [86]: a2 = np.array([ 0.1,  1.1,  2.1,  3.1,  4.1,  1.1,  6.1,  7.1,  7.1,  1.1])

In [87]: %timeit a2.argsort().argsort()
1000000 loops, best of 3: 1.55 us per loop

In [88]: %timeit argsortdup(a2)
10000 loops, best of 3: 25.6 us per loop

In [89]: a = np.arange(0.1, 1000.1)

In [90]: %timeit a.argsort().argsort()
10000 loops, best of 3: 24.5 us per loop

In [91]: %timeit argsortdup(a)
1000 loops, best of 3: 1.14 ms per loop

In [92]: a = np.arange(0.1, 10000.1)

In [93]: %timeit a.argsort().argsort()
1000 loops, best of 3: 303 us per loop

In [94]: %timeit argsortdup(a)
100 loops, best of 3: 11.9 ms per loop

It is clear from the analysis above that argsortdup is 30-50 times slower than a.argsort().argsort(). The main reason is the use of python loops and lists.

5
  • 1
    Why are you comparing with a.argsort().argsort()? That doesn't give you the answer. Commented Feb 3, 2013 at 10:51
  • 1
    Correct, this doesn't give me the answer in the presence of duplicates. However I wanted to emphasize the performance difference when one uses numpy functions vs numpy + pure python loops. And the difference is huge. As far as I don't know solution which is faster than the one provided by Richard I use "a.argsort().argsort()" :) Commented Feb 3, 2013 at 11:03
  • 1
    Richard's answer is right on track, but needs a bit more vectorizing. See my comment on his answer. That should give you the answer in a fairly reasonable time. Commented Feb 3, 2013 at 11:08
  • 1
    Check out scipy.stats.rankdata. Also, take a look at the ranking functions in the pandas package (pandas.pydata.org/pandas-docs/stable/…). Commented Feb 3, 2013 at 12:50
  • (Calling .argsort().argsort() the second time does nothing on a 1D array, it's only needed for 2D arrays.) Commented Apr 19, 2024 at 0:10

3 Answers 3

4

You can do reasonably well using unique and bincount:

>>> u, v = np.unique(a2, return_inverse=True)
>>> (np.cumsum(np.bincount(v)) - 1)[v]
array([0, 3, 4, 5, 6, 3, 7, 9, 9, 3])

Or, for the minimum rank:

>>> (np.cumsum(np.concatenate(([0], np.bincount(v)))))[v]
array([0, 1, 4, 5, 6, 1, 7, 8, 8, 1])

There's a minor speedup by giving bincount the number of bins to provide:

(np.cumsum(np.bincount(v, minlength=u.size)) - 1)[v]
Sign up to request clarification or add additional context in comments.

5 Comments

Very beautiful solution, works correctly and very fast on large arrays. Thank you a lot!
@Merlin sure, using scipy.stats.rankdata it's just "ordinal" - "min". My code will also work if you can't use scipy for some reason. Answered on your question.
@ecatmur How could I use your code for my Question? scipy works, looking at yours to learn something..
@Merlin sure, just use my (second) solution to replace rankdata(..., "min") and the argsort().argsort() technique for rankdata(..., "ordinal"). Updated my answer.
3

After upgrading to a latest version of scipy as suggested @WarrenWeckesser in the comments, scipy.stats.rankdata seems to be faster than both scipy.stats.mstats.rankdata and np.searchsorted being the fastet way to do it on larger arrays.

In [1]: import numpy as np

In [2]: from scipy.stats import rankdata as rd
   ...: from scipy.stats.mstats import rankdata as rd2
   ...: 

In [3]: array = np.arange(0.1, 1000000.1)

In [4]: %timeit np.searchsorted(np.sort(array), array)
1 loops, best of 3: 385 ms per loop

In [5]: %timeit rd(array)
10 loops, best of 3: 109 ms per loop

In [6]: %timeit rd2(array)
1 loops, best of 3: 205 ms per loop

6 Comments

Thanks, your anwer is useful. However, as far as I can see internally it implements the logic in pure python (very similar to what Richard proposed but a little bit more logic to handle ties). As a result it is even slower than Richard's solution. But anyways it's good to have a ready-to-use solution.
@MikhailShevelev -- the speed depends on the size of the array, try something big, and it will outperform both of the others.
You could use scipy.stats.rankdata, which is faster than scipy.stats.mstats.rankdata.
@WarrenWeckesser -- I actually checked it, but it seems to do (significally) worse on large arrays, that is somewhat confusing...
@root: What version of scipy are you using? In a recent version of scipy, stats.rankdata was rewritten in cython, so it will be much faster than the older python+numpy version.
|
2

Here is a function that can return the output you desire (in the first case)

def argsortdup(a1):
  sorted = sort(a1)
  ranked = []
  for item in a1:
    ranked.append(sorted.searchsorted(item))
  return array(ranked)

Basically you sort it and then you search for the index the item is at. Assuming duplicates the first instance index should be returned. I tested it with your a2 example and doing something like

a3 = argsortdup(a2)

Yields

array([0, 1, 4, 5, 6, 1, 7, 8, 8, 1])

"Test with a2":

>>> a2
array([ 0.1,  1.1,  2.1,  3.1,  4.1,  1.1,  6.1,  7.1,  7.1,  1.1])
>>> def argsortdup(a1):
...   sorted = sort(a1)
...   ranked = []
...   for item in a1:
...     ranked.append(sorted.searchsorted(item))
...   return array(ranked)
...
>>> a3 = argsortdup(a2)
>>> a2
array([ 0.1,  1.1,  2.1,  3.1,  4.1,  1.1,  6.1,  7.1,  7.1,  1.1])
>>> a3
array([0, 1, 4, 5, 6, 1, 7, 8, 8, 1])
>>>

2 Comments

Richard, thank you for a quick answer. You function solves the problem, however I'm seeking for more efficient solution in terms of execution time. I forgot to mention this in the initial description -- this is my fault and I'm sorry. Please see the updated description for more details. Once again, thank you for your response!
Correct approach but you could do better if you used numpy better. That function is just: np.searchsorted(np.sort(a1), a1)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.