How to efficiently iterate a pandas DataFrame and increment a NumPy array on these values?

Question

My pandas/numpy is rusty, and the code I have written feels inefficient.

I'm initializing a numpy array of zeros in Python3.x, length 1000. For my purpose, these are simply integers:

import numpy as np
array_of_zeros =  np.zeros((1000, ), )

I also have the following DataFrame (which is much smaller than my actual data)

import pandas as pd
dict1 = {'start' : [100, 200, 300], 'end':[400, 500, 600]}
df = pd.DataFrame(dict1)
print(df)
##
##    start     end
## 0    100     400
## 1    200     500
## 2    300     600

The DataFrame has two columns, start and end. These values represent a range of values, i.e. start will always be a smaller integer than end. Above, we see the first row has the range 100-400, next is 200-500, and then 300-600.

My goal is to iterate through the pandas DataFrame row by row, and increment the numpy array array_of_zeros based on these index positions. So, if there is a row in the dataframe of 10 to 20, I would like to increment the zero by +1 for the indices 10-20.

Here is the code which does what I would like:

import numpy as np
array_of_zeros =  np.zeros((1000, ), )

import pandas as pd
dict1 = {'start' : [100, 200, 300], 'end':[400, 500, 600]}
df = pd.DataFrame(dict1)
print(df)

for idx, row in df.iterrows():
    for i in range(int(row.start), int(row.end)+1):
        array_of_zeros[i]+=1

And it works!

print(array_of_zeros[15])
## output: 0.0
print(array_of_zeros[600])
## output: 1.0
print(array_of_zeros[400])
## output: 3.0
print(array_of_zeros[100])
## output: 1.0
print(array_of_zeros[200])
## output: 2.0

My questions: this is very clumsy code! I shouldn't be using so many for-loops with numpy arrays! This solution will be very inefficient if the input dataframe is quite large

Is there a more efficient (i.e. more numpy-based) method to avoid this for-loop?

for i in range(int(row.start), int(row.end)+1):
    array_of_zeros[i]+=1

Perhaps there is a pandas-oriented solution?

jpp · Accepted Answer · 2018-08-31 09:09:30Z

4

You can use NumPy array indexing to avoid the inner loop, i.e. res[np.arange(A[i][0], A[i][1]+1)] += 1, but this isn't efficient as it involves creating a new array and using advanced indexing.

Instead, you can use numba¹ to optimize your algorithm, exactly as it stands. The below example shows a massive performance improvement by moving performance-critical logic to JIT-compiled code.

from numba import jit

@jit(nopython=True)
def jpp(A):
    res = np.zeros(1000)
    for i in range(A.shape[0]):
        for j in range(A[i][0], A[i][1]+1):
            res[j] += 1
    return res

Some benchmarking results:

# Python 3.6.0, NumPy 1.11.3

# check result the same
assert (jpp(df[['start', 'end']].values) == original(df)).all()
assert (pir(df) == original(df)).all()
assert (pir2(df) == original(df)).all()

# time results
df = pd.concat([df]*10000)

%timeit jpp(df[['start', 'end']].values)  # 64.6 µs per loop
%timeit original(df)                      # 8.25 s per loop
%timeit pir(df)                           # 208 ms per loop
%timeit pir2(df)                          # 1.43 s per loop

Code using for benchmarking:

def original(df):
    array_of_zeros = np.zeros(1000)
    for idx, row in df.iterrows():
        for i in range(int(row.start), int(row.end)+1):
            array_of_zeros[i]+=1   
    return array_of_zeros

def pir(df):
    return np.bincount(np.concatenate([np.arange(a, b + 1) for a, b in \
                       zip(df.start, df.end)]), minlength=1000)

def pir2(df):
    a = np.zeros((1000,), np.int64)
    for b, c in zip(df.start, df.end):
        np.add.at(a, np.arange(b, c + 1), 1)
    return a

¹ For posterity, I'm including @piRSquared's excellent comment on why numba helps here:

numba's advantage is looping very efficiently. Though it can understand much of NumPy's API, it is often better to avoid creating NumPy objects within a loop. My code is creating a NumPy array for every row in the dataframe. Then concatenating them prior to using bincount. @jpp's numba code creates very little extra objects and utilizes much of what is already there. The difference between my NumPy solution and @jpp's numba solution is about 4-5 times. Both are linear and should be pretty quick.

edited Aug 31, 2018 at 9:09

answered Aug 30, 2018 at 16:24

jpp

166k37 gold badges301 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

piRSquared Over a year ago

You know I like numba. Also from numba import njit -> njit is jit(nopython=True)

jpp Over a year ago

@piRSquared, Thanks! I hope you don't mind my adding timings from your solutions. Yeh, I know about njit, but I think sometimes it's more explicit for those new to numba.

ShanZhengYang Over a year ago

Thanks! The jpp implementation is very impressive! My remaining question: Is it possible to optimize the @piRSquared code with numba?

piRSquared Over a year ago

@ShanZhengYang numba's advantage is looping very efficiently. Though it can understand much of Numpy's api, it is often better to avoid creating Numpy objects within a loop. My code is creating a Numpy array for every row in the dataframe. then concatenating them prior to using bincount. @jpp's numba code creates very little extra objects and utilizes much of what is already there. The difference between my Numpy solution and jpp's numba solution is about 4-5 times. Both are linear and should be pretty quick.

max9111 Over a year ago

Why have you used res[np.arange(A[i][0], A[i][1]+1)] += 1 (allocating a temporary array, filling it in a loop with values, using it to index array res within another loop)? Is there a missleading tutorial which should be corrected? A simple nested loop is aprox. 2 times faster for quite obvious reasons (mentioned above). Nevertheless your timings are quite slow, even for this approach. Have you called the function once before using timeit? If not you measured a constant compilation overhead for the first time aprox. 0.2s, which could also be minimized with cache=True

|

piRSquared · Accepted Answer · 2018-08-30 16:24:16Z

4

`numpy.bincount`

np.bincount(np.concatenate(
    [np.arange(a, b + 1) for a, b in zip(df.start, df.end)]
), minlength=1000)

`numpy.add.at`

a = np.zeros((1000,), np.int64)
for b, c in zip(df.start, df.end):
  np.add.at(a, np.arange(b, c + 1), 1)

answered Aug 30, 2018 at 16:24

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Comments

BENY · Accepted Answer · 2018-08-30 16:45:56Z

3

My solution

for x, y in zip(df.start, df.end):
    array_of_zeros[x:y+1]+=1

edited Aug 30, 2018 at 16:45

answered Aug 30, 2018 at 16:35

BENY

324k22 gold badges176 silver badges250 bronze badges

5 Comments

piRSquared Over a year ago

array_of_zeros[x:y] += 1?

BENY Over a year ago

@piRSquared I try it before , seems not work .. let me test it again

Onyambu Over a year ago

should you not include the y also? ie x:y excludes y, should it not be x:(y+1)??

BENY Over a year ago

@Onyambu hi double ?? fixed :-)

Ahwar Over a year ago

using loop is the way to do that sure. But it is not efficient it takes a lot more time. see this blog post for about analysis of different methods of iterating DataFrame and their efficiency and estimate of time. towardsdatascience.com/…

Collectives™ on Stack Overflow

How to efficiently iterate a pandas DataFrame and increment a NumPy array on these values?

3 Answers 3

7 Comments

`numpy.bincount`

`numpy.add.at`

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

numpy.bincount

numpy.add.at

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related

`numpy.bincount`

`numpy.add.at`