0

I have written a function that returns a Pandas data frame (sample as a row and descriptor as columns) and takes input as a list of peptides (a biological sequence as strings data). "my_function(pep_list)" takes pep_list as a parameter and return data frame. it iterates eache peptide sequence from pep_list and calculates descriptor and combined all the data as pandas data frame and returns df:

pep_list = [DAAAAEF,DAAAREF,DAAANEF,DAAADEF,DAAACEF,DAAAEEF,DAAAQEF,DAAAGEF,DAAAHEF,DAAAIEF,DAAALEF,DAAAKEF]

example:

I want to parallelising this code with the given algorithm bellow:

1. get the number of processor available as .
    n = multiprocessing.cpu_count()

2. split the pep_list  as 
     sub_list_of_pep_list = pep_list/n 

     sub_list_of_pep_list = [[DAAAAEF,DAAAREF,DAAANEF],[DAAADEF,DAAACEF,DAAAEEF],[DAAAQEF,DAAAGEF,DAAAHEF],[DAAAIEF,DAAALEF,DAAAKEF]]

4. run "my_function()" for each core as (example if 4 cores )

     df0 = my_function(sub_list_of_pep_list[0])
     df1 = my_function(sub_list_of_pep_list[1])
     df2 = my_functonn(sub_list_of_pep_list[2])
     df3 = my_functonn(sub_list_of_pep_list[4])

5. join all df = concat[df0,df1,df2,df3] 

6. returns df with nX speed. 

Please suggest me the best suitable library to implement this method.

thanks and regards.

Updated 

With some reading i am able to write down a code which work as per my expectation like 1. without parallelising it takes ~10 second for 10 peptide sequence 2. with two processes it takes ~6 second for 12 peptide 3. with four processes it takes ~4 second for 12 peptides

from multiprocessing import Process

def func1():
    structure_gen(pep_seq = ["DAAAAEF","DAAAREF","DAAANEF"])

def func2():
    structure_gen(pep_seq = ["DAAAQEF","DAAAGEF","DAAAHEF"])


def func3():
    structure_gen(pep_seq = ["DAAADEF","DAAALEF"])

def func4():
    structure_gen(pep_seq = ["DAAAIEF","DAAALEF"])

if __name__ == '__main__':
  p1 = Process(target=func1)
  p1.start()
  p2 = Process(target=func2)
  p2.start()
  p3 = Process(target=func1)
  p3.start()
  p4 = Process(target=func2)
  p4.start()
  p1.join()
  p2.join()
  p3.join()
  p4.join()

but this code easily work with 10 peptide but not able to implement it for a PEP_list contains 1 million peptide

thanks

2
  • Process(target=my_function, args=(each_item_in_sub_list,)).start() You can spawn more Processes than number of CPUs Commented Aug 19, 2015 at 8:58
  • Please explain in detail if possible thanks Commented Aug 19, 2015 at 9:57

1 Answer 1

3

multiprocessing.Pool.map is what you're looking for.
Try this:

from multiprocessing import Pool

# I recommend using more partitions than processes,
# this way the work can be balanced.
# Of course this only makes sense if pep_list is bigger than
# the one you provide. If not, change this to 8 or so.
n = 50

# create indices for the partitions
ix = np.linspace(0, len(pep_list), n+1, endpoint=True, dtype=int)

# create partitions using the indices
sub_lists = [pep_list[i1:i2] for i1, i2 in zip(ix[:-1], ix[1:])]

p = Pool()
try:
    # p.map will return a list of dataframes which are to be
    # concatenated
    df = concat(p.map(my_function, sub_lists))
finally:
    p.close()

The pool will automatically contain as many processes as there are available cores. But you can overwrite this number if you want to, just have a look at the docs.

Sign up to request clarification or add additional context in comments.

6 Comments

stops with error "TypeError: linspace() got an unexpected keyword argument 'dtype' "
@jax Which numpy version do you have? This argument is new in 1.9... but there is a workaround, try np.linspace(0, len(pep_list), n+1, endpoint=True).astype(int)
thanks for your comment, i have to work around your code to optimize it for my case actually its working fast, normal code takes ~12.0 sec to compute the 12 peptide sequence but your code takes ~10.0sec to complete even if i increase the number of "n" . your code is fast but not very fast(or may be i am not able to implement it properly ). i have write down a code which works as per my expectation, i have update above please review it.
@jax Without the complete code I can't really tell what the problem is. Usually if there is no large speedup after parallelization it means that there is a lot shared memory access or data exchange. I would rule out the former since there should be no shared memory on default and I think it would also affect your hardcoded approach. The latter should be mitigated by the chunking and also contradicts your results... unless... do you actually collect the returned dataframes in the end? Also more chunks is not always better since it increases exchangerate, sometimes you just have to play around.
thanks for your explanation. I think first i should learn the basics and core concepts of parallelization. Can you suggest me any basic reference that helps me to understand basics of parallelization not very methematical or equipped with computational terminology but simply understandable for a biologist like me. thanks
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.