Parallelize operations in python

Question

I'm populating a matrix using a conditional lookup from a file. The file is extremely large (25,00,000 records) and is saved as a dataframe ('file'). Each matrix row operation (lookup) is independent of the other. Is there anyway I could parallelize this process?

I'm working in pandas and python. My current approach is bare naive.

for r in row:
    for c in column:
        num=file[(file['Unique_Inventor_Number']==r) & file['AppYearStr']==c)]['Citation'].tolist()
        num = len(list(set(num)))
        d.set_value(r, c, num)

Please provide sample imput and desired (output) data sets. Please read how to make good reproducible pandas examples — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Feb 15, 2017 at 22:30
Have you done any profiling to determine which part of your code is slow? — Roland Smith
– Roland Smith, Commented Feb 15, 2017 at 22:44
@MaxU I would have said loading the pandas dataframe from disk... — Roland Smith
– Roland Smith, Commented Feb 16, 2017 at 6:59

user1827356 · Accepted Answer · 2017-02-15 23:46:08Z

1

For 2.5 million records you should be able to do

res = file.groupby(['Unique_Inventor_Number', 'AppYearStr']).Citation.nunique()

The matrix should be available in

res.unstack(level=1).fillna(0).values

I'm not sure if the is the fastest, but should be significantly faster than your implementation

answered Feb 15, 2017 at 23:46

user1827356

7,0122 gold badges26 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

FlyingAura Over a year ago

The group by operation indeed fastens up the process. Can you please explain in gist how this happens? res.unstack(level=1).fillna(0).values however does not result in a matrix.

FlyingAura Over a year ago

I think to create a matrix, just res.unstack(level=1).fillna(0) would suffice.

qmeeus · Accepted Answer · 2017-02-22 18:43:13Z

0

[EDIT] As Roland mentioned in comment, in a standard Python implementation, this post does not offer any solution to improve CPU performances.

In the standard Python implementation, threads do not really improve performance on CPU-bound tasks. There is a "Global Interpreter Lock" that enforces that only one thread at a time can be executing Python bytecode. This was done to keep the complexity of memory management down.

Have you tried to use different Threads for the different functions?

Let's say you separate your dataframe into columns and create multiple threads. Then you assign each thread to apply a function to a column. If you have enough processing power, you might be able to gain a lot of time:

from threading import Thread
import pandas as pd
import numpy as np
from queue import Queue
from time import time

# Those will be used afterwards
N_THREAD = 8
q = Queue()
df2 = pd.DataFrame()  # The output of the script 

# You create the job that each thread will do
def apply(series, func):
    df2[series.name] = series.map(func)


# You define the context of the jobs
def threader():
    while True:
        worker = q.get()
        apply(*worker)
        q.task_done()

def main():

    # You import your data to a pandas dataframe
    df = pd.DataFrame(np.random.randn(100000,4), columns=['A', 'B', 'C', 'D'])

    # You create the functions you will apply to your columns
    func1 = lambda x: x<10
    func2 = lambda x: x==0
    func3 = lambda x: x>=0
    func4 = lambda x: x<0
    func_rep = [func1, func2, func3, func4]

    for x in range(N_THREAD):  # You create your threads    
        t = Thread(target=threader)
        t.start()

    # Now is the tricky part: You enclose the arguments that
    # will be passed to the function into a tuple which you
    # put into a queue. Then you start the job by "joining"
    # the queue
    for i, func in enumerate(func_rep):
        worker = tuple([df.iloc[:,i], func])
        q.put(worker)

    t0 = time()
    q.join()
    print("Entire job took: {:.3} s.".format(time() - t0))

if __name__ == '__main__':
    main()

edited Feb 22, 2017 at 18:43

answered Feb 15, 2017 at 23:35

qmeeus

2,4612 gold badges15 silver badges22 bronze badges

3 Comments

Roland Smith Over a year ago

In the standard Python implementation, threads do not really improve performance on CPU-bound tasks. There is a "Global Interpreter Lock" that enforces that only one thread at a time can be executing Python bytecode. This was done to keep the complexity of memory management down.

qmeeus Over a year ago

Thank you for this explanation, I did not know this fact !

Tim McNamara Over a year ago

In cases where functions block such as I/O (not so likely with a Pandas apply function, but still plausible), using threads in Python can still lead to very effective speedups

Collectives™ on Stack Overflow

Parallelize operations in python

2 Answers 2

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related