Speed up search of array element in second array

Question

I have a pretty simple operation involving two not so large arrays:

For every element in the first (larger) array, located in position i
Find if it exists in the second (smaller) array
If it does, find its index in the second array: j
Store a float taken from a third array (same length as first array) in the position i, in the position j of a fourth array (same length as second array)

The for block below works, but gets very slow for not so large arrays (>10000).

Can this implementation be made faster?

import numpy as np
import random

##############################################
# Generate some random data.
#'Nb' is always smaller then 'Na
Na, Nb = 50000, 40000

# List of IDs (could be any string, I use integers here for simplicity)
ids_a = random.sample(range(1, Na * 10), Na)
ids_a = [str(_) for _ in ids_a]
random.shuffle(ids_a)
# Some floats associated to these IDs
vals_in_a = np.random.uniform(0., 1., Na)

# Smaller list of repeated IDs from 'ids_a'
ids_b = random.sample(ids_a, Nb)
# Array to be filled
vals_in_b = np.zeros(Nb)
##############################################

# This block needs to be *a lot* more efficient
#
# For each string in 'ids_a'
for i, id_a in enumerate(ids_a):
    # if it exists in 'ids_b'
    if id_a in ids_b:
        # find where in 'ids_b' this element is located
        j = ids_b.index(id_a)
        # store in that position the value taken from 'ids_a'
        vals_in_b[j] = vals_in_a[i]

yes, this is a polynomial time algorithm. It's going to get slower fast as you increase the the size of the inputs. Use a better algorithm, likely using different data structures that support constant time (instead of linear time) membership testing. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Sep 26, 2019 at 0:31
create a dict of the items in ids_b to their index, then just use that dictionary instead of checking membership on an array. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Sep 26, 2019 at 1:07
You can 1) concatenate ids_a and ids_b, 2) apply np.unique with keyword return_inverse=True, 3) split the inverse into inv_a and inv_b, 4) map the values to match the order of uniques: vals[inv_a] = vals_in_a, and 5) use inv_b to pick the correct values: result = vals[inv_b] — Paul Panzer
– Paul Panzer, Commented Sep 26, 2019 at 1:30
I'll try both approaches tomorrow morning and see what comes out of it. Thank you. — Gabriel
– Gabriel, Commented Sep 26, 2019 at 2:09

Paul Panzer · Accepted Answer · 2019-09-26 21:27:38Z

1

In defense of my approach, here is the authoritative implementation:

import itertools as it

def pp():
    la,lb = len(ids_a),len(ids_b)
    ids = np.fromiter(it.chain(ids_a,ids_b),'<S6',la+lb)
    unq,inv = np.unique(ids,return_inverse=True)
    vals = np.empty(la,vals_in_a.dtype)
    vals[inv[:la]] = vals_in_a
    return vals[inv[la:]]

(juanpa()==pp()).all()
# True

timeit(juanpa,number=100)
# 3.1373191522434354
timeit(pp,number=100)
# 2.5256317732855678

That said, @juanpa.arrivillaga's suggestion can also be implemented better:

import operator as op

def ja():
    return op.itemgetter(*ids_b)(dict(zip(ids_a,vals_in_a)))

(ja()==pp()).all()
# True
timeit(ja,number=100)
# 2.015202699229121

edited Sep 26, 2019 at 21:27

answered Sep 26, 2019 at 21:10

Paul Panzer

53.3k3 gold badges59 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Gabriel Over a year ago

Indeed, this implementation of your answer works as expected. Also, your implementation of Juan's answer is the fastest. Thank you very much Paul!

Gabriel · Accepted Answer · 2019-09-26 14:30:34Z

0

I tried the approaches by juanpa.arrivillaga and Paul Panzer. The first one is the fastest by far. It is also the simplest. The second one is faster than my original approach, but considerably slower than the first one. It also has the drawback that this line vals[inv_a] = vals_in_a stores floats into a U5 array, thus converting them into strings. It can be converted back to floats at the end, but I lose digits (unless I'm missing something obvious of course.

Here are the implementations:

def juanpa():
    dict_ids_b = {_: i for i, _ in enumerate(ids_b)}
    for i, id_a in enumerate(ids_a):
        try:
            vals_in_b[dict_ids_b[id_a]] = vals_in_a[i]
        except KeyError:
            pass

    return vals_in_b


def Paul():
    # 1) concatenate ids_a and ids_b
    ids_ab = ids_a + ids_b
    # 2) apply np.unique with keyword return_inverse=True
    vals, idxs = np.unique(ids_ab, return_inverse=True)
    # 3) split the inverse into inv_a and inv_b
    inv_a, inv_b = idxs[:len(ids_a)], idxs[len(ids_a):]
    # 4) map the values to match the order of uniques: vals[inv_a] = vals_in_a
    vals[inv_a] = vals_in_a
    # 5) use inv_b to pick the correct values: result = vals[inv_b]
    vals_in_b = vals[inv_b].astype(float)

    return vals_in_b

edited Sep 26, 2019 at 14:30

answered Sep 26, 2019 at 14:21

Gabriel

43k83 gold badges246 silver badges433 bronze badges

3 Comments

One Lyner Over a year ago

try ... except has a cost; you should probably replace this by a simpler if id_a in dict_ids_b.

Gabriel Over a year ago

I've always been under the impression that a try..except block is cheaper than an if..else block. You mean this is not true?

Paul Panzer Over a year ago

@OneLyner, Gabriel, in cases like this an if...else expression uses one more dictionary lookup; as these are not cheap this typically tips the balance.

Collectives™ on Stack Overflow

Speed up search of array element in second array

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related