0

I have a pretty simple operation involving two not so large arrays:

  1. For every element in the first (larger) array, located in position i
  2. Find if it exists in the second (smaller) array
  3. If it does, find its index in the second array: j
  4. Store a float taken from a third array (same length as first array) in the position i, in the position j of a fourth array (same length as second array)

The for block below works, but gets very slow for not so large arrays (>10000).

Can this implementation be made faster?

import numpy as np
import random

##############################################
# Generate some random data.
#'Nb' is always smaller then 'Na
Na, Nb = 50000, 40000

# List of IDs (could be any string, I use integers here for simplicity)
ids_a = random.sample(range(1, Na * 10), Na)
ids_a = [str(_) for _ in ids_a]
random.shuffle(ids_a)
# Some floats associated to these IDs
vals_in_a = np.random.uniform(0., 1., Na)

# Smaller list of repeated IDs from 'ids_a'
ids_b = random.sample(ids_a, Nb)
# Array to be filled
vals_in_b = np.zeros(Nb)
##############################################

# This block needs to be *a lot* more efficient
#
# For each string in 'ids_a'
for i, id_a in enumerate(ids_a):
    # if it exists in 'ids_b'
    if id_a in ids_b:
        # find where in 'ids_b' this element is located
        j = ids_b.index(id_a)
        # store in that position the value taken from 'ids_a'
        vals_in_b[j] = vals_in_a[i]
5
  • yes, this is a polynomial time algorithm. It's going to get slower fast as you increase the the size of the inputs. Use a better algorithm, likely using different data structures that support constant time (instead of linear time) membership testing. Commented Sep 26, 2019 at 0:31
  • Any suggestions regarding a better algorithm Juan? Commented Sep 26, 2019 at 1:06
  • 1
    create a dict of the items in ids_b to their index, then just use that dictionary instead of checking membership on an array. Commented Sep 26, 2019 at 1:07
  • 1
    You can 1) concatenate ids_a and ids_b, 2) apply np.unique with keyword return_inverse=True, 3) split the inverse into inv_a and inv_b, 4) map the values to match the order of uniques: vals[inv_a] = vals_in_a, and 5) use inv_b to pick the correct values: result = vals[inv_b] Commented Sep 26, 2019 at 1:30
  • I'll try both approaches tomorrow morning and see what comes out of it. Thank you. Commented Sep 26, 2019 at 2:09

2 Answers 2

1

In defense of my approach, here is the authoritative implementation:

import itertools as it

def pp():
    la,lb = len(ids_a),len(ids_b)
    ids = np.fromiter(it.chain(ids_a,ids_b),'<S6',la+lb)
    unq,inv = np.unique(ids,return_inverse=True)
    vals = np.empty(la,vals_in_a.dtype)
    vals[inv[:la]] = vals_in_a
    return vals[inv[la:]]

(juanpa()==pp()).all()
# True

timeit(juanpa,number=100)
# 3.1373191522434354
timeit(pp,number=100)
# 2.5256317732855678

That said, @juanpa.arrivillaga's suggestion can also be implemented better:

import operator as op

def ja():
    return op.itemgetter(*ids_b)(dict(zip(ids_a,vals_in_a)))

(ja()==pp()).all()
# True
timeit(ja,number=100)
# 2.015202699229121
Sign up to request clarification or add additional context in comments.

1 Comment

Indeed, this implementation of your answer works as expected. Also, your implementation of Juan's answer is the fastest. Thank you very much Paul!
0

I tried the approaches by juanpa.arrivillaga and Paul Panzer. The first one is the fastest by far. It is also the simplest. The second one is faster than my original approach, but considerably slower than the first one. It also has the drawback that this line vals[inv_a] = vals_in_a stores floats into a U5 array, thus converting them into strings. It can be converted back to floats at the end, but I lose digits (unless I'm missing something obvious of course.

Here are the implementations:

def juanpa():
    dict_ids_b = {_: i for i, _ in enumerate(ids_b)}
    for i, id_a in enumerate(ids_a):
        try:
            vals_in_b[dict_ids_b[id_a]] = vals_in_a[i]
        except KeyError:
            pass

    return vals_in_b


def Paul():
    # 1) concatenate ids_a and ids_b
    ids_ab = ids_a + ids_b
    # 2) apply np.unique with keyword return_inverse=True
    vals, idxs = np.unique(ids_ab, return_inverse=True)
    # 3) split the inverse into inv_a and inv_b
    inv_a, inv_b = idxs[:len(ids_a)], idxs[len(ids_a):]
    # 4) map the values to match the order of uniques: vals[inv_a] = vals_in_a
    vals[inv_a] = vals_in_a
    # 5) use inv_b to pick the correct values: result = vals[inv_b]
    vals_in_b = vals[inv_b].astype(float)

    return vals_in_b

3 Comments

try ... except has a cost; you should probably replace this by a simpler if id_a in dict_ids_b.
I've always been under the impression that a try..except block is cheaper than an if..else block. You mean this is not true?
@OneLyner, Gabriel, in cases like this an if...else expression uses one more dictionary lookup; as these are not cheap this typically tips the balance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.