Parallelizing a function with multiple lists arguments with python's multiprocessing

Question

I hope this isn't a duplicate, but I couldn't find a fully satisfying answer for this specific problem.

Given a function with multiple list arguments and one iterable, e.g. here with two lists

def function(list1, list2, iterable):
    i1 = 2*iterable
    i2 = 2*iterable+1
    list1[i1] *= 2
    list2[i2] += 2
    return(list1, list2)

Each list get accesed at different entries therefore the operations are seperated and can be parallized. What is the best way to do this with python's multiprocessing?

One easy way of parallelization would be by using the map-function:

import multiprocessing as mp
from functools import partial

list1, list2 = [1,1,1,1,1], [2,2,2,2,2]
func = partial(function, list1, list2)
pool = mp.Pool()
pool.map(func, [0,1])

The problem is if one does so one produces for every process a copy of the lists (if I understand the map-function right) and work then in parallel at different position in those copies. At the end (after the two iterables [0,1] has been touched) the result of pool.map is

[([3, 1, 1, 1, 1], [2, 4, 2, 2, 2]), ([1, 1, 3, 1, 1], [2, 2, 2, 4, 2])]

but I want

[([3, 1, 3, 1, 1], [2, 4, 2, 4, 2])].

How to achieve this? Should one split the list's by the iterable before, run the specific operations in parallel and then merge them again?

Thanks in advance and excuse please if I mix something up, I just started to use the multiprocessing-library.

EDIT: Operations on different parts on a list can be parallized without synchronization, operations on the whole list can not be parallized (without synchronization). Therefore a solution to my specific problem is to split the lists and the function into the operations and into parts of the lists. After that one merges the parts of the lists to get the whole list back.

zwer · Accepted Answer · 2018-05-16 11:00:34Z

You cannot share memory between processes (technically, you can on fork-based systems provided you don't change objects/affect ref count which would rarely ever happen in a real-world usage) - your options are to either use a shared structure (most of them available under the multiprocessing.Manager()) which will do the synchronization/updates for you, or to pass only the data needed for processing and then stitch back together the result.

Your example is simple enough for both approaches to work without serious penalties so I'd just go with a manager:

import multiprocessing
import functools

def your_function(list1, list2, iterable):
    i1 = 2 * iterable
    i2 = 2 * iterable + 1
    list1[i1] *= 2
    list2[i2] += 2

if __name__ == "__main__":  # a multi-processing guard for cross-platform use
    manager = multiprocessing.Manager()
    l1 = manager.list([1, 1, 1, 1, 1])
    l2 = manager.list([2, 2, 2, 2, 2])
    func = functools.partial(your_function, l1, l2)
    pool = multiprocessing.Pool()
    pool.map(func, [0, 1])
    print(l1, l2)  # [2, 1, 2, 1, 1] [2, 4, 2, 4, 2]

Or if your use case is more favorable to stitching the data after processing:

import multiprocessing
import functools

def your_function(list1, list2, iterable):
    i1 = 2 * iterable
    i2 = 2 * iterable + 1
    return (i1, list1[i1] * 2), (i2, list2[i2] + 2)  # return the changed index and value

if __name__ == "__main__":  # a multi-processing guard for cross-platform use
    l1 = [1, 1, 1, 1, 1]
    l2 = [2, 2, 2, 2, 2]
    func = functools.partial(your_function, l1, l2)
    pool = multiprocessing.Pool()
    results = pool.map(func, [0, 1])
    for r1, r2 in results:  # stitch the results back into l1 and l2
        l1[r1[0]] = r1[1]
        l2[r2[0]] = r2[1]
    print(l1, l2)  # [2, 1, 2, 1, 1] [2, 4, 2, 4, 2]

That being said, the output is not what you've listed/expected but it is what should happen based on your function.

Also, if your case is this simple you might want to steer clear from multiprocessing altogether - the overhead multiprocessing adds (plus the manager synchronization) is not worth it unless your_function() does some really CPU-intensitve task.

Thank you very much for your answer. I just added a solution which is similar to your second solution. Indeed this simple case is just a pedagogical example. My your_function() does some CPU-intensive task (multiple operations on tensors like numpy.tensordot(),scipy.linalg.eigsh(),numpy.linalgs.svd()) in practice.

mmarah · Accepted Answer · 2018-05-16 11:26:20Z

Here a solution to the problem. I dunno if this is the best way but it works:

import multiprocessing as mp
from functools import partial

def operation1(lst, pos)
    return(pos, lst[pos] * 2)

def operation2(lst, pos)
    return(pos, lst[pos] + 2)

if __name__ == "__main__":
    list1, list2 = [1,1,1,1,1], [2,2,2,2,2]
    iterable = [0,1]
    index1_list = [2*i for i in iterable]
    index2_list = [2*i+1 for i in iterable]

    func1 = partial(operation1, list1)
    func2 = partial(operation2, list2)

    with mp.Pool() as pool:
        result1 = pool.map(func1, index1_list)
        result2 = pool.map(func2, index2_list)

    for result in result1:
        list1[result[0]] = result[1]

    for result in result2:
        list2[result[0]] = result[1]

    print(list1, list2)

Collectives™ on Stack Overflow

Parallelizing a function with multiple lists arguments with python's multiprocessing

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related