Python, input rows of pandas data frame, output rows of a different data frame. Run in parallel

Question

I need some help starting running some parallel code in python. I do not think that for my problem I can share executable code but still you can help me conceptually solve my issue.

I have written a function that takes as input a panda dataframe row. That function makes some x calculations and returns back again a row from a panda data frame that has different column names as the input.

So far I have been using this in a for loop to get as input the rows and the after the function was returning I was appending its output to the new dataframe

new_df=pd.DataFrame(columns=['1','2','unique','occurence','timediff','ueid'], dtype='float')

for i in range(0,small_pd.shape[0]): #small_pd the input of the dataframe
    new_df=new_df.append(SequencesExtractTime(small_pd.loc[i]))

Now I have the issue that I want to run this code in parallel. I have found the multiprocessing package.

from joblib import Parallel, delayed
import multiprocessing

num_cores = multiprocessing.cpu_count()

results = Parallel(n_jobs=num_cores)(SequencesExtractTime(small_pd.loc)(i) for i in range(0,small_pd.shape[0]))

but unfortunately this does-not execute, since I do not know how to declare that the input is the separate rows of this dataframe.

Can you please help me on how I can achieve such parallelization in python? Inputs are the rows of a dataframe, the output are rows of a dataframe that need to be merged together.

Thanks a lot

Regards

Alex

Why are you choosing multiprocessing? What are you doing with SequencesExtractTime and what is in small_pd? Perhaps there is another avenue than multiprocessing to solve your problem if you could share this information? — run-out
– run-out, Commented Jul 9, 2019 at 8:23

j23 · Accepted Answer · 2019-07-09 10:46:56Z

1

You can use Pool object in Python multiprocessing.

import multiprocessing as mp
num_workers = mp.cpu_count()  
pool = mp.Pool(num_workers)
results_pool = []
for i in range(0,small_pd.shape[0]):    
results_pool.append(pool.apply_async(SequencesExtractTime,args=(i)))
pool.close()
pool.join()
multi_results = [r.get() for r in results_pool]
print (multi_results)

edited Jul 9, 2019 at 10:46

answered Jul 9, 2019 at 9:29

j23

3,5431 gold badge8 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Alex P Over a year ago

Thanks a lot. I am using it currently on small subset of the data frame to get accustomed to. what is not clear is how to get the output. This is what I get from the code pool Out[132]: <multiprocessing.pool.Pool at 0x149549d30> results_pool Out[133]: [<multiprocessing.pool.ApplyResult at 0x14154b160>, <multiprocessing.pool.ApplyResult at 0x14154bbe0>, <multiprocessing.pool.ApplyResult at 0x14154bef0>]

j23 Over a year ago

I have edited the answer. ‘ multi_results = [r.get() for r in results_pool] ‘ , multi_results will contain all the results.

Alex P Over a year ago

saw it thanks. TypeError: 'int' object is not subscriptable from the r.get() It might be how I call my function, that takes in reality two parameters? for i in range(0,small_pd.shape[0]): results_pool.append(pool.apply_async(SequencesExtractTime,args=(i,listOfUePatterns))) pool.close() pool.join() multi_results = [r.get() for r in results_pool]

j23 Over a year ago

I think, the issue is due to the return type from the function. Could you tell me the result of this ‘ for r in results_pool: print (r.to_string())

Alex P Over a year ago

I was able to run the code like that results_pool.append(pool.apply_async(SequencesExtractTime,kwds={'seqInput':small_pd.iloc[i]}))

Collectives™ on Stack Overflow

Python, input rows of pandas data frame, output rows of a different data frame. Run in parallel

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related