2

Honestly, I was not even sure what to title this question. I am trying to loop through a large list of URLs, but only processing 20 URLs (20 is based on how many proxies I have) at a time. But I also need to keep looping through the proxy list, as I am processing the URLs. So, for example, it would start with the 1st URL and 1st proxy, and once it hits the 21st URL, it would use the 1st proxy again. Here is my poor example below, if anyone can even point me in the right direction, it would be much appreciated.

import pymysql.cursors
from multiprocessing import Pool
from fake_useragent import UserAgent

def worker(args):
    var_a, id, name, content, proxy, headers, connection = args
    print (var_a)
    print (id)
    print (name)
    print (content)
    print (proxy)
    print (headers)
    print (connection)
    print ('---------------------------')

if __name__ == '__main__':
    connection = pymysql.connect(
        host = 'host ',
        user = 'user',
        password = 'password',
        db = 'db',
        charset='utf8mb4',
        cursorclass=pymysql.cursors.DictCursor
    )

    ua = UserAgent()
    user_agent = ua.chrome
    headers = {'User-Agent' : user_agent}

    proxies = [
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx'
    ]

    with connection.cursor() as cursor:
        sql = "SELECT id,name,content FROM table"
        cursor.execute(sql)
        urls = cursor.fetchall()

    var_a = 'static'

    data = ((var_a, url['id'], url['name'], url['content'], proxies[i % len(proxies)], headers, connection) for i, url in enumerate(urls))
    proc_num = 20
    p = Pool(processes=proc_num)
    results = p.imap(worker, data)
    p.close() 
    p.join()

2 Answers 2

1

You can use a list to store new processes. When you reach a certain number of items, call join for each process in the list. This should give you some control on the number of active processes.

if __name__ == '__main__':  
    proc_num = 20
    proc_list = []
    for i, url in enumerate(urls):
        proxy = proxies[i % len(proxies)] 
        p = Process(target=worker, args=(url, proxy))
        p.start()
        proc_list.append(p)
        if i % proc_num == 0 or i == len(urls)-1: 
            for proc in proc_list: 
                proc.join()


If you want a constant number of active processes you can try the Pool module. Just modify the worker definition to recieve a tuple.

if __name__ == '__main__': 
    data = ((url, proxies[i % len(proxies)]) for i, url in enumerate(urls))
    proc_num = 20
    p = Pool(processes=proc_num)
    results = p.imap(worker, data)
    p.close() 
    p.join()

Just to clarify things, the worker function should recieve a tuple and then unpack it.

def worker(args):
    var_a, id, name, content, proxy, headers, connection = args
    print (var_a)
    ... etc ...
Sign up to request clarification or add additional context in comments.

15 Comments

I have been testing the code you gave me, and it works to a degree. But I have while loop for when I am making a request and it does not break till the request goes through (sometimes the back connect proxy is bad and needs to wait to get a new one). But if that happens, it seems to wait for the while loop to complete, before any of the other links are requested. I thought the whole purpose of multiprocessing was being able to call the same function multiple times at once? Maybe I am misunderstanding how it works.
You could use multiprocessing.Pool, it should be much smoother. Also consider using a reasonable timeout (5 - 30 sec) in requests.get.
That looks a lot more smooth. I see you are inputting "data" into the imap, but what if I have more variables I need to input into the function? I need to access url["name"], url["id"], etc... from urls. So little confused as how to add those variables into the imap.
Can you be more specific? url is a string, it doesn't have any keys. However you can modify the definition of worker to accept an arbitrary number of arguments: def worker(*args):, or build a "helper" function to unpack the arguments to worker, eg: def helper(args): return worker(*args)
Yes, sorry, urls was just a example. urls is really a MySQL select query. So I need to be able to select the columns from that and input them into the function, along with the proxies (which are setup exactly how I have it in my example above) Hope that is a little more clear.
|
0

Try the below code:

for i in range(len(urls)):
    url = urls[i] # Current URL
    proxy = proxies[i % len(proxies)] # Current proxy
    # ...

4 Comments

What about only spawning 20 processes (or however many proxies there are in the list) at a time?
When each process starts, add it to a counter. Remove it when it ends. In the for loop, check the counter before doing it.
I guess I am just confused. Won't the for loop, just make all the process start at once? So if I have 1000 links, will it not try to start 1000 processes? How do I only have it create 20 processes at a time?
I think I need something like this stackoverflow.com/questions/20190668/… (first answer), but how do I input the proxies in the function, cause in the answer, there is no loop used. He just inputs the array in the map.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.