7

My pandas/numpy is rusty, and the code I have written feels inefficient.

I'm initializing a numpy array of zeros in Python3.x, length 1000. For my purpose, these are simply integers:

import numpy as np
array_of_zeros =  np.zeros((1000, ), )

I also have the following DataFrame (which is much smaller than my actual data)

import pandas as pd
dict1 = {'start' : [100, 200, 300], 'end':[400, 500, 600]}
df = pd.DataFrame(dict1)
print(df)
##
##    start     end
## 0    100     400
## 1    200     500
## 2    300     600

The DataFrame has two columns, start and end. These values represent a range of values, i.e. start will always be a smaller integer than end. Above, we see the first row has the range 100-400, next is 200-500, and then 300-600.

My goal is to iterate through the pandas DataFrame row by row, and increment the numpy array array_of_zeros based on these index positions. So, if there is a row in the dataframe of 10 to 20, I would like to increment the zero by +1 for the indices 10-20.

Here is the code which does what I would like:

import numpy as np
array_of_zeros =  np.zeros((1000, ), )

import pandas as pd
dict1 = {'start' : [100, 200, 300], 'end':[400, 500, 600]}
df = pd.DataFrame(dict1)
print(df)

for idx, row in df.iterrows():
    for i in range(int(row.start), int(row.end)+1):
        array_of_zeros[i]+=1

And it works!

print(array_of_zeros[15])
## output: 0.0
print(array_of_zeros[600])
## output: 1.0
print(array_of_zeros[400])
## output: 3.0
print(array_of_zeros[100])
## output: 1.0
print(array_of_zeros[200])
## output: 2.0

My questions: this is very clumsy code! I shouldn't be using so many for-loops with numpy arrays! This solution will be very inefficient if the input dataframe is quite large

Is there a more efficient (i.e. more numpy-based) method to avoid this for-loop?

for i in range(int(row.start), int(row.end)+1):
    array_of_zeros[i]+=1

Perhaps there is a pandas-oriented solution?

0

3 Answers 3

4

You can use NumPy array indexing to avoid the inner loop, i.e. res[np.arange(A[i][0], A[i][1]+1)] += 1, but this isn't efficient as it involves creating a new array and using advanced indexing.

Instead, you can use numba1 to optimize your algorithm, exactly as it stands. The below example shows a massive performance improvement by moving performance-critical logic to JIT-compiled code.

from numba import jit

@jit(nopython=True)
def jpp(A):
    res = np.zeros(1000)
    for i in range(A.shape[0]):
        for j in range(A[i][0], A[i][1]+1):
            res[j] += 1
    return res

Some benchmarking results:

# Python 3.6.0, NumPy 1.11.3

# check result the same
assert (jpp(df[['start', 'end']].values) == original(df)).all()
assert (pir(df) == original(df)).all()
assert (pir2(df) == original(df)).all()

# time results
df = pd.concat([df]*10000)

%timeit jpp(df[['start', 'end']].values)  # 64.6 µs per loop
%timeit original(df)                      # 8.25 s per loop
%timeit pir(df)                           # 208 ms per loop
%timeit pir2(df)                          # 1.43 s per loop

Code using for benchmarking:

def original(df):
    array_of_zeros = np.zeros(1000)
    for idx, row in df.iterrows():
        for i in range(int(row.start), int(row.end)+1):
            array_of_zeros[i]+=1   
    return array_of_zeros

def pir(df):
    return np.bincount(np.concatenate([np.arange(a, b + 1) for a, b in \
                       zip(df.start, df.end)]), minlength=1000)

def pir2(df):
    a = np.zeros((1000,), np.int64)
    for b, c in zip(df.start, df.end):
        np.add.at(a, np.arange(b, c + 1), 1)
    return a

1 For posterity, I'm including @piRSquared's excellent comment on why numba helps here:

numba's advantage is looping very efficiently. Though it can understand much of NumPy's API, it is often better to avoid creating NumPy objects within a loop. My code is creating a NumPy array for every row in the dataframe. Then concatenating them prior to using bincount. @jpp's numba code creates very little extra objects and utilizes much of what is already there. The difference between my NumPy solution and @jpp's numba solution is about 4-5 times. Both are linear and should be pretty quick.

Sign up to request clarification or add additional context in comments.

7 Comments

You know I like numba. Also from numba import njit -> njit is jit(nopython=True)
@piRSquared, Thanks! I hope you don't mind my adding timings from your solutions. Yeh, I know about njit, but I think sometimes it's more explicit for those new to numba.
Thanks! The jpp implementation is very impressive! My remaining question: Is it possible to optimize the @piRSquared code with numba?
@ShanZhengYang numba's advantage is looping very efficiently. Though it can understand much of Numpy's api, it is often better to avoid creating Numpy objects within a loop. My code is creating a Numpy array for every row in the dataframe. then concatenating them prior to using bincount. @jpp's numba code creates very little extra objects and utilizes much of what is already there. The difference between my Numpy solution and jpp's numba solution is about 4-5 times. Both are linear and should be pretty quick.
Why have you used res[np.arange(A[i][0], A[i][1]+1)] += 1 (allocating a temporary array, filling it in a loop with values, using it to index array res within another loop)? Is there a missleading tutorial which should be corrected? A simple nested loop is aprox. 2 times faster for quite obvious reasons (mentioned above). Nevertheless your timings are quite slow, even for this approach. Have you called the function once before using timeit? If not you measured a constant compilation overhead for the first time aprox. 0.2s, which could also be minimized with cache=True
|
4

numpy.bincount

np.bincount(np.concatenate(
    [np.arange(a, b + 1) for a, b in zip(df.start, df.end)]
), minlength=1000)

numpy.add.at

a = np.zeros((1000,), np.int64)
for b, c in zip(df.start, df.end):
  np.add.at(a, np.arange(b, c + 1), 1)

Comments

3

My solution

for x, y in zip(df.start, df.end):
    array_of_zeros[x:y+1]+=1

5 Comments

array_of_zeros[x:y] += 1?
@piRSquared I try it before , seems not work .. let me test it again
should you not include the y also? ie x:y excludes y, should it not be x:(y+1)??
@Onyambu hi double ?? fixed :-)
using loop is the way to do that sure. But it is not efficient it takes a lot more time. see this blog post for about analysis of different methods of iterating DataFrame and their efficiency and estimate of time. towardsdatascience.com/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.