Fastest way to parse this string to a numpy array

Question

We have to perform the following operation around 400,000 times so I'm searching for the most efficient solution. I have tried several things but I'm curious whether there are even better approaches :)

Data example

We can use the following code to generate an example test set

random.seed(10)
np.random.seed(10)
def test_str():
    n = 10000000
    arr  = np.random.randint(10000, size=n)
    sign = np.random.choice(['+','-'], size=n)
    return 'ID1' + '\t' + ' '.join(["{}{}".format(a,b) for a,b in zip(arr, sign)])

Which looks like ID1\t7688+ 737+ 677+ 1508- 9251-......

The code where it is all about :)

Copy the code from google colab (P.s. running it there gave me a TypingError whereas it ran fine on my machine), or just see the functions below

General function
From this Numba issue , but based on @armamut answer this may introduce a lot of overhead with Numba, making native Numpy apparently faster..

@nb.jit(nopython=True)
    def str_to_int(s):
        final_index, result = len(s) - 1, 0
        for i,v in enumerate(s):
            result += (ord(v) - 48) * (10 ** (final_index - i))
        return result

Approach 1

@nb.jit(nopython=True)
def process_number(numb, identifier, i):
    sign = 1 if numb[-1] == '+' else -1
    return str_to_int(numb[:-1]), sign, i, identifier
    
@nb.jit(nopython=True)
def expand1(data):
    identifier, l = data.split('\t')
    identifier = str_to_int(identifier[-1])
    numbers = l.split()
    # init emtpy numpy array
    arr = np.empty(shape = (len(numbers), 4), dtype = np.int64)
    # Fill array    
    for i, numb in enumerate(numbers):
        arr[i,:] = process_number(numb, identifier, i)
    return arr

Approach 2

@nb.jit(nopython=True)
def expand2(data):
    identifier, l = data.split('\t')
    
    identifier = str_to_int(identifier[-1])
    numbers = l.split()
    size = len(numbers)
    
    numbs = [ str_to_int(numb[:-1]) for numb in numbers ]
    signs = [ 1 if numb[:-1] =='+' else -1 for numb in numbers ]
    
    arr = np.empty(shape = (size, 4), dtype = np.int64)
    arr[:,0] = numbs
    arr[:,1] = signs
    arr[:,2] = np.arange(0, size)
    arr[:,3] = np.repeat(identifier, size)
    return arr

Approach 3

@nb.jit(nopython=True)
def expand3(data):
    identifier, l = data.split('\t')
    identifier = str_to_int(identifier[-1])
    numbers = l.split()
    arr = np.empty(shape = (len(numbers), 4), dtype = np.int64)
    for i, numb in enumerate(numbers):
        arr[i,:] = str_to_int(numb[:-1]), 1 if numb[:-1] =='+' else -1, i, identifier
    return arr

Answer approach

def expand4(t):
    identifier, l = t.split('\t')
    identifier = np.int(identifier[-1])
    numbers = np.array([np.int(k[:-1]) for k in l.split(' ')])
    signs = np.array([(k[-1] == '+') for k in l.split(' ')]) * 2 - 1

    N = len(numbers)
    arr = np.empty(shape = (N, 4), dtype = np.int64)
    arr[:, 0] = numbers
    arr[:, 1] = signs
    arr[:, 2] = identifier
    arr[:, 3] = np.arange(N)
    return arr

Test results:

Expand 1
72.7 ms ± 177 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
Expand 2
27.9 ms ± 67.1 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
Expand 3
8.81 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
Expand 4 ANSWER 1
429 µs ± 63.4 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

In your code expand1 does not yield the same results as expand2 and expand3. Also you need numpy's seed (and to use the same test_str() in all experiments) to get deterministic and identical results. — David M.
– David M., Commented Jan 17, 2021 at 17:30
@DavidM. Thanks! I switched identifier and sign in the process_number function and added the seeds — CodeNoob
– CodeNoob, Commented Jan 17, 2021 at 17:42
In your Google Colab code, you also need to set s = test_str() and then pass it to your expand functions, otherwise each will process different data. — David M.
– David M., Commented Jan 17, 2021 at 17:44
It would have been nice if you'd shown a sample string, such as: 'ID1\t7688+ 737+ 677+ 1508- 9251-' — hpaulj
– hpaulj, Commented Jan 17, 2021 at 17:49

armamut · Accepted Answer · 2021-01-17 17:13:19Z

1

I cannot replicate your code, as I also got "ord" is not implemented error for numba.

But why are you using numba? Your str_to_int operation seems to be very expensive and unoptimized for vector operations etc. Why not (without numba):

def expand(t):
    identifier, l = t.split('\t')
    identifier = np.int(identifier[-1])
    numbers = np.array([np.int(k[:-1]) for k in l.split(' ')])
    signs = np.array([(k[-1] == '+') for k in l.split(' ')]) * 2 - 1

    N = len(numbers)
    arr = np.empty(shape = (N, 4), dtype = np.int64)
    arr[:, 0] = numbers
    arr[:, 1] = signs
    arr[:, 2] = identifier
    arr[:, 3] = np.arange(N)
    return arr

t = test_str()
%timeit expand(t)

>>>

1.01 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

answered Jan 17, 2021 at 17:13

armamut

1,1166 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

CodeNoob Over a year ago

Nice! and cool trick with the boolean * 2 - 1 :). This indeed is faster than using Numba presumably cause of the str_to_int function. [But why are you using numba?] Normally I would prefer to use Numba as it skips the compilation after the first run speeding up things quite a lot

armamut Over a year ago

yes, you're right. I'd wanted to know if there is a special reason to use numba. If it's not strictly necessary, its ok :)

armamut Over a year ago

btw, I'd be happy if you accept this solution, thx!

CodeNoob Over a year ago

Yeah I will wait for a little to see if others have ideas

Collectives™ on Stack Overflow

Fastest way to parse this string to a numpy array

Data example

The code where it is all about :)

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Data example

The code where it is all about :)

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related