I have a multi-nested for loop and I'd like to parallelize this as much as possible, in Python.
Suppose I have some arbitrary function, which accepts two arguments func(a,b) and I'd like to compute this function on all combinations of M and N.
What I've done so far is 'flatten' the indices as a dictionary
idx_map = {}
count = 0
for i in range(n):
for j in range(m):
idx_map[count] = (i,j)
count += 1
Now that my nested loop is flattened, I can use it like so:
arr = []
for idx in range(n*m):
i,j = idx_map[idx]
arr.append( func(M[i], N[j]) )
Can I use this with Python's built in multi-Processing to parallelize? Race conditions should not be an issue because I do not need to aggregate func calls; rather, I just want to arrive at some final array, which evaluates all func(a,b) combinations across M and N. (So Async behavior and complexity should not be relevant here.)
What's the best way to accomplish this effect?
I see from this related question but I don't understand what the author was trying to illustrate.
if 1: # multi-threaded
pool = mp.Pool(28) # try 2X num procs and inc/dec until cpu maxed
st = time.time()
for x in pool.imap_unordered(worker, range(data_Y)):
pass
print 'Multiprocess total time is %4.3f seconds' % (time.time()-st)
print
func). This is because multiprocessing use processes (heavy) and not threads (light). Note that Python code is generally interpreted using CPython so it is generally 10-100 slower than native code unlessfuncspend most of its time in optimized native modules. Iffuncmostly uses pure-Python codes, then it is generally more efficient to vectorize it first (ie. use native code). It avoid wasting 28 additional cores.# multi-threadedcomment is thus not correct. The "best way" is not really well defined (does it means fastest, pythonic, shortest, etc.?). It is also likely dependent of the content offunc: iffuncis IO bound (or mostly use Numpy on large arrays), then using multiple threads can be possibly faster and more convenient. Besides, what did you do not specifically understand from the code (and many multiprocessing example, and this post as well as the standard documentation)?mapjust call the Python function N times on each item, butfuncis still certainly not vectorized itself (hard to know without the code). I recommend re-writingfuncso it is a native call or only few of them (regarding your performance needs) and using efficient modules in the first place.