0

We have to perform the following operation around 400,000 times so I'm searching for the most efficient solution. I have tried several things but I'm curious whether there are even better approaches :)


Data example

We can use the following code to generate an example test set
random.seed(10)
np.random.seed(10)
def test_str():
    n = 10000000
    arr  = np.random.randint(10000, size=n)
    sign = np.random.choice(['+','-'], size=n)
    return 'ID1' + '\t' + ' '.join(["{}{}".format(a,b) for a,b in zip(arr, sign)])

Which looks like ID1\t7688+ 737+ 677+ 1508- 9251-......

The code where it is all about :)

Copy the code from google colab (P.s. running it there gave me a TypingError whereas it ran fine on my machine), or just see the functions below

General function
From this Numba issue , but based on @armamut answer this may introduce a lot of overhead with Numba, making native Numpy apparently faster..

@nb.jit(nopython=True)
    def str_to_int(s):
        final_index, result = len(s) - 1, 0
        for i,v in enumerate(s):
            result += (ord(v) - 48) * (10 ** (final_index - i))
        return result

Approach 1

@nb.jit(nopython=True)
def process_number(numb, identifier, i):
    sign = 1 if numb[-1] == '+' else -1
    return str_to_int(numb[:-1]), sign, i, identifier
    
@nb.jit(nopython=True)
def expand1(data):
    identifier, l = data.split('\t')
    identifier = str_to_int(identifier[-1])
    numbers = l.split()
    # init emtpy numpy array
    arr = np.empty(shape = (len(numbers), 4), dtype = np.int64)
    # Fill array    
    for i, numb in enumerate(numbers):
        arr[i,:] = process_number(numb, identifier, i)
    return arr

Approach 2

@nb.jit(nopython=True)
def expand2(data):
    identifier, l = data.split('\t')
    
    identifier = str_to_int(identifier[-1])
    numbers = l.split()
    size = len(numbers)
    
    numbs = [ str_to_int(numb[:-1]) for numb in numbers ]
    signs = [ 1 if numb[:-1] =='+' else -1 for numb in numbers ]
    
    arr = np.empty(shape = (size, 4), dtype = np.int64)
    arr[:,0] = numbs
    arr[:,1] = signs
    arr[:,2] = np.arange(0, size)
    arr[:,3] = np.repeat(identifier, size)
    return arr

Approach 3

@nb.jit(nopython=True)
def expand3(data):
    identifier, l = data.split('\t')
    identifier = str_to_int(identifier[-1])
    numbers = l.split()
    arr = np.empty(shape = (len(numbers), 4), dtype = np.int64)
    for i, numb in enumerate(numbers):
        arr[i,:] = str_to_int(numb[:-1]), 1 if numb[:-1] =='+' else -1, i, identifier
    return arr

Answer approach

def expand4(t):
    identifier, l = t.split('\t')
    identifier = np.int(identifier[-1])
    numbers = np.array([np.int(k[:-1]) for k in l.split(' ')])
    signs = np.array([(k[-1] == '+') for k in l.split(' ')]) * 2 - 1

    N = len(numbers)
    arr = np.empty(shape = (N, 4), dtype = np.int64)
    arr[:, 0] = numbers
    arr[:, 1] = signs
    arr[:, 2] = identifier
    arr[:, 3] = np.arange(N)
    return arr

Test results:

Expand 1
72.7 ms ± 177 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
Expand 2
27.9 ms ± 67.1 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
Expand 3
8.81 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
Expand 4 ANSWER 1
429 µs ± 63.4 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

5
  • In your code expand1 does not yield the same results as expand2 and expand3. Also you need numpy's seed (and to use the same test_str() in all experiments) to get deterministic and identical results. Commented Jan 17, 2021 at 17:30
  • @DavidM. Thanks! I switched identifier and sign in the process_number function and added the seeds Commented Jan 17, 2021 at 17:42
  • In your Google Colab code, you also need to set s = test_str() and then pass it to your expand functions, otherwise each will process different data. Commented Jan 17, 2021 at 17:44
  • @DavidM. whoops.. thanks again! Commented Jan 17, 2021 at 17:47
  • It would have been nice if you'd shown a sample string, such as: 'ID1\t7688+ 737+ 677+ 1508- 9251-' Commented Jan 17, 2021 at 17:49

1 Answer 1

1

I cannot replicate your code, as I also got "ord" is not implemented error for numba.

But why are you using numba? Your str_to_int operation seems to be very expensive and unoptimized for vector operations etc. Why not (without numba):

def expand(t):
    identifier, l = t.split('\t')
    identifier = np.int(identifier[-1])
    numbers = np.array([np.int(k[:-1]) for k in l.split(' ')])
    signs = np.array([(k[-1] == '+') for k in l.split(' ')]) * 2 - 1

    N = len(numbers)
    arr = np.empty(shape = (N, 4), dtype = np.int64)
    arr[:, 0] = numbers
    arr[:, 1] = signs
    arr[:, 2] = identifier
    arr[:, 3] = np.arange(N)
    return arr

t = test_str()
%timeit expand(t)

>>>

1.01 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Sign up to request clarification or add additional context in comments.

4 Comments

Nice! and cool trick with the boolean * 2 - 1 :). This indeed is faster than using Numba presumably cause of the str_to_int function. [But why are you using numba?] Normally I would prefer to use Numba as it skips the compilation after the first run speeding up things quite a lot
yes, you're right. I'd wanted to know if there is a special reason to use numba. If it's not strictly necessary, its ok :)
btw, I'd be happy if you accept this solution, thx!
Yeah I will wait for a little to see if others have ideas

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.