Python Multiprocessing a large list with a loop

Question

Honestly, I was not even sure what to title this question. I am trying to loop through a large list of URLs, but only processing 20 URLs (20 is based on how many proxies I have) at a time. But I also need to keep looping through the proxy list, as I am processing the URLs. So, for example, it would start with the 1st URL and 1st proxy, and once it hits the 21st URL, it would use the 1st proxy again. Here is my poor example below, if anyone can even point me in the right direction, it would be much appreciated.

import pymysql.cursors
from multiprocessing import Pool
from fake_useragent import UserAgent

def worker(args):
    var_a, id, name, content, proxy, headers, connection = args
    print (var_a)
    print (id)
    print (name)
    print (content)
    print (proxy)
    print (headers)
    print (connection)
    print ('---------------------------')

if __name__ == '__main__':
    connection = pymysql.connect(
        host = 'host ',
        user = 'user',
        password = 'password',
        db = 'db',
        charset='utf8mb4',
        cursorclass=pymysql.cursors.DictCursor
    )

    ua = UserAgent()
    user_agent = ua.chrome
    headers = {'User-Agent' : user_agent}

    proxies = [
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx'
    ]

    with connection.cursor() as cursor:
        sql = "SELECT id,name,content FROM table"
        cursor.execute(sql)
        urls = cursor.fetchall()

    var_a = 'static'

    data = ((var_a, url['id'], url['name'], url['content'], proxies[i % len(proxies)], headers, connection) for i, url in enumerate(urls))
    proc_num = 20
    p = Pool(processes=proc_num)
    results = p.imap(worker, data)
    p.close() 
    p.join()

t.m.adam · Accepted Answer · 2017-08-18 17:36:03Z

1

You can use a list to store new processes. When you reach a certain number of items, call join for each process in the list. This should give you some control on the number of active processes.

if __name__ == '__main__':  
    proc_num = 20
    proc_list = []
    for i, url in enumerate(urls):
        proxy = proxies[i % len(proxies)] 
        p = Process(target=worker, args=(url, proxy))
        p.start()
        proc_list.append(p)
        if i % proc_num == 0 or i == len(urls)-1: 
            for proc in proc_list: 
                proc.join()

If you want a constant number of active processes you can try the Pool module. Just modify the worker definition to recieve a tuple.

if __name__ == '__main__': 
    data = ((url, proxies[i % len(proxies)]) for i, url in enumerate(urls))
    proc_num = 20
    p = Pool(processes=proc_num)
    results = p.imap(worker, data)
    p.close() 
    p.join()

Just to clarify things, the worker function should recieve a tuple and then unpack it.

def worker(args):
    var_a, id, name, content, proxy, headers, connection = args
    print (var_a)
    ... etc ...

edited Aug 18, 2017 at 17:36

answered Aug 14, 2017 at 0:33

t.m.adam

15.4k3 gold badges34 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

antfuentes87 Over a year ago

I have been testing the code you gave me, and it works to a degree. But I have while loop for when I am making a request and it does not break till the request goes through (sometimes the back connect proxy is bad and needs to wait to get a new one). But if that happens, it seems to wait for the while loop to complete, before any of the other links are requested. I thought the whole purpose of multiprocessing was being able to call the same function multiple times at once? Maybe I am misunderstanding how it works.

t.m.adam Over a year ago

You could use multiprocessing.Pool, it should be much smoother. Also consider using a reasonable timeout (5 - 30 sec) in requests.get.

antfuentes87 Over a year ago

That looks a lot more smooth. I see you are inputting "data" into the imap, but what if I have more variables I need to input into the function? I need to access url["name"], url["id"], etc... from urls. So little confused as how to add those variables into the imap.

t.m.adam Over a year ago

Can you be more specific? url is a string, it doesn't have any keys. However you can modify the definition of worker to accept an arbitrary number of arguments: def worker(*args):, or build a "helper" function to unpack the arguments to worker, eg: def helper(args): return worker(*args)

antfuentes87 Over a year ago

Yes, sorry, urls was just a example. urls is really a MySQL select query. So I need to be able to select the columns from that and input them into the function, along with the proxies (which are setup exactly how I have it in my example above) Hope that is a little more clear.

|

Pravitha V · Accepted Answer · 2017-08-13 07:08:14Z

0

Try the below code:

for i in range(len(urls)):
    url = urls[i] # Current URL
    proxy = proxies[i % len(proxies)] # Current proxy
    # ...

edited Aug 13, 2017 at 7:08

Pravitha V

3,3084 gold badges36 silver badges52 bronze badges

answered Aug 13, 2017 at 6:17

Oliver Ni

2,6627 gold badges33 silver badges44 bronze badges

4 Comments

antfuentes87 Over a year ago

What about only spawning 20 processes (or however many proxies there are in the list) at a time?

Oliver Ni Over a year ago

When each process starts, add it to a counter. Remove it when it ends. In the for loop, check the counter before doing it.

antfuentes87 Over a year ago

I guess I am just confused. Won't the for loop, just make all the process start at once? So if I have 1000 links, will it not try to start 1000 processes? How do I only have it create 20 processes at a time?

antfuentes87 Over a year ago

I think I need something like this stackoverflow.com/questions/20190668/… (first answer), but how do I input the proxies in the function, cause in the answer, there is no loop used. He just inputs the array in the map.

Collectives™ on Stack Overflow

Python Multiprocessing a large list with a loop

2 Answers 2

15 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

15 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related