1

I'm trying to multiprocess an action inside a for x in y loop. Basically, the concept of the script is to do a request to a site, load up a json file containing a list of URLs. Once fetched, another function is called to parse an URL individually. What i've been trying to do is to multiprocess this task with multiprocess.Process() in order to speed up the process since there is lots of URLs to parse. However, my approach doesn't speed up the process at all, it actually goes at the same speed than with no multiprocessing. It seems like gets blocked when using proc.join().

This is a code i've been working on:

import json
import requests
import multiprocessing

def ExtractData(id):
    print("Processing ", id)
    result = requests.get('http://example-index.com/' + id')
    result = result.text.split('\n')[:-1]
    for entry in result:
        data = json.loads(entry)['url']
        print("data is:", data)

def ParseJsonAndCall():
    url = "https://example-site.com/info.json"
    data = json.loads(requests.get(url).text)
    t = []
    for results in data:
        print("Processing ", results['url'])
        p = multiprocessing.Process(target=ExtractData, args=(results['id'],))
        t.append(p)
        p.start()
    for proc in threads:
        proc.join()

ParseJsonAndCall()

Any help would be greatly appreciated!

1
  • Didn't change anything code wise. Commented Jul 25, 2019 at 14:29

1 Answer 1

1

A Pool may help.

import multiprocessing as mp    

def ParseJsonAndCall():
    url = "https://example-site.com/info.json"
    data = json.loads(requests.get(url).text)
    collect_results = []
    with mp.Pool(processes=mp.cpu_count()) as pool:
        for results in data:
            res = pool.apply_async(ExtractData, [results['id'],])
            collect_results.append(res)
        for res in collect_results:
            res.get()

Although the print statement in ExtractData() might cause a race condition.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks. I'm not sure how to implement this with the code I provided. Also, I use Process(), would that be compatible with it?
With Pool.apply_async the call returns immediately instead of waiting for the result, therefore ExtractData is never completed (add a print('Job completed') at the end of the function). Maybe there's need for a lock?
Ah, my mistake, forgot to call get().

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.