Parallelising Python code

Question

I have written a function that returns a Pandas data frame (sample as a row and descriptor as columns) and takes input as a list of peptides (a biological sequence as strings data). "my_function(pep_list)" takes pep_list as a parameter and return data frame. it iterates eache peptide sequence from pep_list and calculates descriptor and combined all the data as pandas data frame and returns df:

pep_list = [DAAAAEF,DAAAREF,DAAANEF,DAAADEF,DAAACEF,DAAAEEF,DAAAQEF,DAAAGEF,DAAAHEF,DAAAIEF,DAAALEF,DAAAKEF]

example:

I want to parallelising this code with the given algorithm bellow:

1. get the number of processor available as .
    n = multiprocessing.cpu_count()

2. split the pep_list  as 
     sub_list_of_pep_list = pep_list/n 

     sub_list_of_pep_list = [[DAAAAEF,DAAAREF,DAAANEF],[DAAADEF,DAAACEF,DAAAEEF],[DAAAQEF,DAAAGEF,DAAAHEF],[DAAAIEF,DAAALEF,DAAAKEF]]

4. run "my_function()" for each core as (example if 4 cores )

     df0 = my_function(sub_list_of_pep_list[0])
     df1 = my_function(sub_list_of_pep_list[1])
     df2 = my_functonn(sub_list_of_pep_list[2])
     df3 = my_functonn(sub_list_of_pep_list[4])

5. join all df = concat[df0,df1,df2,df3] 

6. returns df with nX speed.

Please suggest me the best suitable library to implement this method.

thanks and regards.

Updated

With some reading i am able to write down a code which work as per my expectation like 1. without parallelising it takes ~10 second for 10 peptide sequence 2. with two processes it takes ~6 second for 12 peptide 3. with four processes it takes ~4 second for 12 peptides

from multiprocessing import Process

def func1():
    structure_gen(pep_seq = ["DAAAAEF","DAAAREF","DAAANEF"])

def func2():
    structure_gen(pep_seq = ["DAAAQEF","DAAAGEF","DAAAHEF"])


def func3():
    structure_gen(pep_seq = ["DAAADEF","DAAALEF"])

def func4():
    structure_gen(pep_seq = ["DAAAIEF","DAAALEF"])

if __name__ == '__main__':
  p1 = Process(target=func1)
  p1.start()
  p2 = Process(target=func2)
  p2.start()
  p3 = Process(target=func1)
  p3.start()
  p4 = Process(target=func2)
  p4.start()
  p1.join()
  p2.join()
  p3.join()
  p4.join()

but this code easily work with 10 peptide but not able to implement it for a PEP_list contains 1 million peptide

thanks

Process(target=my_function, args=(each_item_in_sub_list,)).start() You can spawn more Processes than number of CPUs — akalikin
– akalikin, Commented Aug 19, 2015 at 8:58

swenzel · Accepted Answer · 2015-08-19 11:29:07Z

3

multiprocessing.Pool.map is what you're looking for.
Try this:

from multiprocessing import Pool

# I recommend using more partitions than processes,
# this way the work can be balanced.
# Of course this only makes sense if pep_list is bigger than
# the one you provide. If not, change this to 8 or so.
n = 50

# create indices for the partitions
ix = np.linspace(0, len(pep_list), n+1, endpoint=True, dtype=int)

# create partitions using the indices
sub_lists = [pep_list[i1:i2] for i1, i2 in zip(ix[:-1], ix[1:])]

p = Pool()
try:
    # p.map will return a list of dataframes which are to be
    # concatenated
    df = concat(p.map(my_function, sub_lists))
finally:
    p.close()

The pool will automatically contain as many processes as there are available cores. But you can overwrite this number if you want to, just have a look at the docs.

edited Aug 19, 2015 at 11:29

answered Aug 19, 2015 at 10:30

swenzel

7,2633 gold badges26 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

jax Over a year ago

stops with error "TypeError: linspace() got an unexpected keyword argument 'dtype' "

swenzel Over a year ago

@jax Which numpy version do you have? This argument is new in 1.9... but there is a workaround, try np.linspace(0, len(pep_list), n+1, endpoint=True).astype(int)

jax Over a year ago

thanks for your comment, i have to work around your code to optimize it for my case actually its working fast, normal code takes ~12.0 sec to compute the 12 peptide sequence but your code takes ~10.0sec to complete even if i increase the number of "n" . your code is fast but not very fast(or may be i am not able to implement it properly ). i have write down a code which works as per my expectation, i have update above please review it.

swenzel Over a year ago

@jax Without the complete code I can't really tell what the problem is. Usually if there is no large speedup after parallelization it means that there is a lot shared memory access or data exchange. I would rule out the former since there should be no shared memory on default and I think it would also affect your hardcoded approach. The latter should be mitigated by the chunking and also contradicts your results... unless... do you actually collect the returned dataframes in the end? Also more chunks is not always better since it increases exchangerate, sometimes you just have to play around.

jax Over a year ago

thanks for your explanation. I think first i should learn the basics and core concepts of parallelization. Can you suggest me any basic reference that helps me to understand basics of parallelization not very methematical or equipped with computational terminology but simply understandable for a biologist like me. thanks

|

Collectives™ on Stack Overflow

Parallelising Python code

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related