0

I need some help starting running some parallel code in python. I do not think that for my problem I can share executable code but still you can help me conceptually solve my issue.

I have written a function that takes as input a panda dataframe row. That function makes some x calculations and returns back again a row from a panda data frame that has different column names as the input.

So far I have been using this in a for loop to get as input the rows and the after the function was returning I was appending its output to the new dataframe

new_df=pd.DataFrame(columns=['1','2','unique','occurence','timediff','ueid'], dtype='float')

for i in range(0,small_pd.shape[0]): #small_pd the input of the dataframe
    new_df=new_df.append(SequencesExtractTime(small_pd.loc[i]))

Now I have the issue that I want to run this code in parallel. I have found the multiprocessing package.

from joblib import Parallel, delayed
import multiprocessing

num_cores = multiprocessing.cpu_count()

results = Parallel(n_jobs=num_cores)(SequencesExtractTime(small_pd.loc)(i) for i in range(0,small_pd.shape[0]))

but unfortunately this does-not execute, since I do not know how to declare that the input is the separate rows of this dataframe.

Can you please help me on how I can achieve such parallelization in python? Inputs are the rows of a dataframe, the output are rows of a dataframe that need to be merged together.

Thanks a lot

Regards

Alex

1
  • Why are you choosing multiprocessing? What are you doing with SequencesExtractTime and what is in small_pd? Perhaps there is another avenue than multiprocessing to solve your problem if you could share this information? Commented Jul 9, 2019 at 8:23

1 Answer 1

1

You can use Pool object in Python multiprocessing.

import multiprocessing as mp
num_workers = mp.cpu_count()  
pool = mp.Pool(num_workers)
results_pool = []
for i in range(0,small_pd.shape[0]):    
results_pool.append(pool.apply_async(SequencesExtractTime,args=(i)))
pool.close()
pool.join()
multi_results = [r.get() for r in results_pool]
print (multi_results)
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks a lot. I am using it currently on small subset of the data frame to get accustomed to. what is not clear is how to get the output. This is what I get from the code pool Out[132]: <multiprocessing.pool.Pool at 0x149549d30> results_pool Out[133]: [<multiprocessing.pool.ApplyResult at 0x14154b160>, <multiprocessing.pool.ApplyResult at 0x14154bbe0>, <multiprocessing.pool.ApplyResult at 0x14154bef0>]
I have edited the answer. ‘ multi_results = [r.get() for r in results_pool] ‘ , multi_results will contain all the results.
saw it thanks. TypeError: 'int' object is not subscriptable from the r.get() It might be how I call my function, that takes in reality two parameters? for i in range(0,small_pd.shape[0]): results_pool.append(pool.apply_async(SequencesExtractTime,args=(i,listOfUePatterns))) pool.close() pool.join() multi_results = [r.get() for r in results_pool]
I think, the issue is due to the return type from the function. Could you tell me the result of this ‘ for r in results_pool: print (r.to_string())
I was able to run the code like that results_pool.append(pool.apply_async(SequencesExtractTime,kwds={'seqInput':small_pd.iloc[i]}))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.