Pandas df.iterrows() parallelization

Question

I would like to parallelize the following code:

for row in df.iterrows():
    idx = row[0]
    k = row[1]['Chromosome']
    start,end = row[1]['Bin'].split('-')

    sequence = sequence_from_coordinates(k,1,start,end) #slow download form http

    df.set_value(idx,'GC%',gc_content(sequence,percent=False,verbose=False))
    df.set_value(idx,'G4 repeats', sum([len(list(i)) for i in g4_scanner(sequence)]))
    df.set_value(idx,'max flexibility',max([item[1] for item in dna_flex(sequence,verbose=False)]))

I have tried to use multiprocessing.Pool() since each row can be processed independently, but I can't figure out how to share the DataFrame. I am also not sure that this is the best approach to do parallelization with pandas. Any help?

Your row-wise iteration is slow by default. You can either try to find a way to vectorize your operations and do it without iteration, or you split up your dataframe into a few large chunks and iterate over each chunk parallelly. — Khris
– Khris, Commented Nov 3, 2016 at 12:06
Sure, that's a way to do it. But I am still looking for a better way, if it exist. — alec_djinn
– alec_djinn, Commented Nov 4, 2016 at 10:20
Have you considered using dask? It would do most of the parallelization for you — Zeugma
– Zeugma, Commented Nov 7, 2016 at 2:03

Jinhua Wang · Accepted Answer · 2021-09-18 15:55:00Z

76

+50

As @Khris said in his comment, you should split up your dataframe into a few large chunks and iterate over each chunk in parallel. You could arbitrarily split the dataframe into randomly sized chunks, but it makes more sense to divide the dataframe into equally sized chunks based on the number of processes you plan on using. Luckily someone else has already figured out how to do that part for us:

# don't forget to import
import pandas as pd
import multiprocessing

# create as many processes as there are CPUs on your machine
num_processes = multiprocessing.cpu_count()

# calculate the chunk size as an integer
chunk_size = int(df.shape[0]/num_processes)

# this solution was reworked from the above link.
# will work even if the length of the dataframe is not evenly divisible by num_processes
chunks = [df.iloc[df.index[i:i + chunk_size]] for i in range(0, df.shape[0], chunk_size)]

This creates a list that contains our dataframe in chunks. Now we need to pass it into our pool along with a function that will manipulate the data.

def func(d):
   # let's create a function that squares every value in the dataframe
   return d * d

# create our pool with `num_processes` processes
pool = multiprocessing.Pool(processes=num_processes)

# apply our function to each chunk in the list
result = pool.map(func, chunks)

At this point, result will be a list holding each chunk after it has been manipulated. In this case, all values have been squared. The issue now is that the original dataframe has not been modified, so we have to replace all of its existing values with the results from our pool.

for i in range(len(result)):
   # since result[i] is just a dataframe
   # we can reassign the original dataframe based on the index of each chunk
   df.iloc[result[i].index] = result[i]

Now, my function to manipulate my dataframe is vectorized and would likely have been faster if I had simply applied it to the entirety of my dataframe instead of splitting into chunks. However, in your case, your function would iterate over each row of the each chunk and then return the chunk. This allows you to process num_process rows at a time.

def func(d):
   for row in d.iterrow():
      idx = row[0]
      k = row[1]['Chromosome']
      start,end = row[1]['Bin'].split('-')

      sequence = sequence_from_coordinates(k,1,start,end) #slow download form http
      d.set_value(idx,'GC%',gc_content(sequence,percent=False,verbose=False))
      d.set_value(idx,'G4 repeats', sum([len(list(i)) for i in g4_scanner(sequence)]))
      d.set_value(idx,'max flexibility',max([item[1] for item in dna_flex(sequence,verbose=False)]))
   # return the chunk!
   return d

Then you reassign the values in the original dataframe, and you have successfully parallelized this process.

How Many Processes Should I Use?

Your optimal performance is going to depend on the answer to this question. While "ALL OF THE PROCESSES!!!!" is one answer, a better answer is much more nuanced. After a certain point, throwing more processes at a problem actually creates more overhead than it's worth. This is known as Amdahl's Law. Again, we are fortunate that others have already tackled this question for us:

A good default is to use multiprocessing.cpu_count(), which is the default behavior of multiprocessing.Pool. According to the documentation "If processes is None then the number returned by cpu_count() is used." That's why I set num_processes at the beginning to multiprocessing.cpu_count(). This way, if you move to a beefier machine, you get the benefits from it without having to change the num_processes variable directly.

edited Sep 18, 2021 at 15:55

Jinhua Wang

1,7791 gold badge19 silver badges46 bronze badges

answered Nov 4, 2016 at 17:15

TheF1rstPancake

2,37819 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Dat Over a year ago

Use chunks = [df.iloc[i:i + chunk_size,:] for i in range(0, df.shape[0], chunk_size)] if pandas shows warnings.

N4v Over a year ago

np.array_split() may be a better option than chunks = [df.ix[df.index[i:i + chunk_size]] for i in range(0, df.shape[0], chunk_size)]. The former will automatically handle the case where the number of rows is not evenly divisible, and the syntax is a little easier.

MrKingsley Over a year ago

np.array_split() worked for me as well

RF1991 Over a year ago

what should we do if we have a function with three parameter and we should use three column of dataframe?

ic_fl2 · Accepted Answer · 2017-08-28 11:51:44Z

34

A faster way (about 10% in my case):

Main differences to accepted answer: use pd.concat and np.array_split to split and join the dataframre.

import multiprocessing
import numpy as np


def parallelize_dataframe(df, func):
    num_cores = multiprocessing.cpu_count()-1  #leave one free to not freeze machine
    num_partitions = num_cores #number of partitions to split dataframe
    df_split = np.array_split(df, num_partitions)
    pool = multiprocessing.Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

where func is the function you want to apply to df. Use partial(func, arg=arg_val) for more that one argument.

edited Aug 28, 2017 at 11:51

answered Aug 9, 2017 at 14:54

ic_fl2

1,03913 silver badges32 bronze badges

6 Comments

TheF1rstPancake Over a year ago

Just curious, does pool.map maintain the order of the dataframe. In other words, is the output from pool.map in the same order as the chunks that were passed in? If not, then pd.concat may not rebuild the dataframe in the original order. I didn't know about np.aray_split, but I'm not surprised it's faster. pd.concat is also likely faster than reassigning with df.ix

ic_fl2 Over a year ago

@Jalepeno112 Yes, as far as I can tell the dataframe get's put back together in the correct order. I don't know if there is a way of enforcing it but I have timeseries data and it das not cause problems yet. Though as my index are timestamps it shouldn't be a problem to sort them again if the order got jumbled. Another trick I found was to use itertuples() which is another 30% faster.

ak3191 Over a year ago

can you please help me in answering this :- stackoverflow.com/questions/53561794/…

Jinhua Wang Over a year ago

This is a really nice answer!

RF1991 Over a year ago

what should we do if we have a function with three parameter and we should use three column of dataframe?

|

Robert · Accepted Answer · 2019-01-15 13:09:22Z

14

Consider using dask.dataframe, as e.g. shown in this example for a similar question: https://stackoverflow.com/a/53923034/4340584

import dask.dataframe as ddf
df_dask = ddf.from_pandas(df, npartitions=4)   # where the number of partitions is the number of cores you want to use
df_dask['output'] = df_dask.apply(lambda x: your_function(x), meta=('str')).compute(scheduler='multiprocessing')

answered Jan 15, 2019 at 13:09

Robert

1,66118 silver badges26 bronze badges

1 Comment

sophros Over a year ago

dask solution looks much less cumbersome than manual paralelization of calculation in pandas!

DSH · Accepted Answer · 2021-08-09 18:02:30Z

1

To use Dask over partitions of a dataframe (instead of dask.apply, which operates over axis), you could use map_partitions:

import multiprocessing
import dask.dataframe as ddf

# get num cpu cores
num_partitions = multiprocessing.cpu_count()

# create dask DF
df_dask = ddf.from_pandas(your_dataframe, npartitions=num_partitions)

# apply func to every partition in parallel
output = df_dask.map_partitions(func, meta=('output_col1_type','output_col2_type')).compute(scheduler='multiprocessing')

edited Aug 9, 2021 at 18:02

answered Aug 9, 2021 at 16:13

DSH

1,1692 gold badges16 silver badges30 bronze badges

Collectives™ on Stack Overflow

Pandas df.iterrows() parallelization

4 Answers 4

How Many Processes Should I Use?

4 Comments

6 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

How Many Processes Should I Use?

4 Comments

6 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related