Multiprocess Python/Numpy code for processing data faster

Question

I am reading in hundreds of HDF files and processing the data of each HDF seperately. However, this takes an awful amount of time, since it is working on one HDF file at a time. I just stumbled upon http://docs.python.org/library/multiprocessing.html and am now wondering how I can speed things up using multiprocessing.

So far, I came up with this:

import numpy as np
from multiprocessing import Pool

def myhdf(date):
    ii      = dates.index(date)
    year    = date[0:4]
    month   = date[4:6]
    day     = date[6:8]
    rootdir = 'data/mydata/'
    filename = 'no2track'+year+month+day
    records = read_my_hdf(rootdir,filename)
    if records.size:
        results[ii] = np.mean(records)

dates = ['20080105','20080106','20080107','20080108','20080109']
results = np.zeros(len(dates))

pool = Pool(len(dates))
pool.map(myhdf,dates)

However, this is obviously not correct. Can you follow my chain of thought what I want to do? What do I need to change?

Fred Foo · Accepted Answer · 2012-10-25 11:38:03Z

5

Try joblib for a friendlier multiprocessing wrapper:

from joblib import Parallel, delayed

def myhdf(date):
    # do work
    return np.mean(records)

results = Parallel(n_jobs=-1)(delayed(myhdf)(d) for d in dates)

answered Oct 25, 2012 at 11:38

Fred Foo

365k80 gold badges765 silver badges852 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Andrew Walker Over a year ago

Other great libraries that achieve similar things to joblib and multiprocessing (including distributed job execution) are ipython parallel and celery

Fred Foo Over a year ago

@AndrewWalker: except that those are distributed, right? Joblib is just a wrapper around multiprocessing with some smart pickling.

Andrew Walker Over a year ago

Yep, distributed and local modes for both. But I think it's worth noting that there are alternative modules that scale up.

Andrew Walker · Accepted Answer · 2012-10-25 09:52:15Z

2

The Pool classes map function is like the standard python libraries map function, you're guaranteed to get your results back in the order that you put them in. Knowing that, the only other trick is that you need to return results in a consistant manner, and the filter them afterwards.

import numpy as np
from multiprocessing import Pool

def myhdf(date):
    year    = date[0:4]
    month   = date[4:6]
    day     = date[6:8]
    rootdir = 'data/mydata/'
    filename = 'no2track'+year+month+day
    records = read_my_hdf(rootdir,filename)
    if records.size:
        return np.mean(records)

dates = ['20080105','20080106','20080107','20080108','20080109']

pool = Pool(len(dates))
results = pool.map(myhdf,dates)
results = [ result for result in results if result ]
results = np.array(results)

If you really do want results as soon as they are available you can use imap_unordered

answered Oct 25, 2012 at 9:52

Andrew Walker

42.7k9 gold badges64 silver badges86 bronze badges

1 Comment

HyperCube Over a year ago

Thank you! That makes sense. One more question: with Pool(number) I am specifying the number of parallel processes? So in case I have a lot of HDF files (>1000), I should keep the pool value at around 4-5?

Collectives™ on Stack Overflow

Multiprocess Python/Numpy code for processing data faster

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related