0

I'm populating a matrix using a conditional lookup from a file. The file is extremely large (25,00,000 records) and is saved as a dataframe ('file'). Each matrix row operation (lookup) is independent of the other. Is there anyway I could parallelize this process?

I'm working in pandas and python. My current approach is bare naive.

for r in row:
    for c in column:
        num=file[(file['Unique_Inventor_Number']==r) & file['AppYearStr']==c)]['Citation'].tolist()
        num = len(list(set(num)))
        d.set_value(r, c, num)
4
  • Please provide sample imput and desired (output) data sets. Please read how to make good reproducible pandas examples Commented Feb 15, 2017 at 22:30
  • Have you done any profiling to determine which part of your code is slow? Commented Feb 15, 2017 at 22:44
  • @RolandSmith, let me guess - nested loops? ;-) Commented Feb 15, 2017 at 22:51
  • @MaxU I would have said loading the pandas dataframe from disk... Commented Feb 16, 2017 at 6:59

2 Answers 2

1

For 2.5 million records you should be able to do

res = file.groupby(['Unique_Inventor_Number', 'AppYearStr']).Citation.nunique()

The matrix should be available in

res.unstack(level=1).fillna(0).values

I'm not sure if the is the fastest, but should be significantly faster than your implementation

Sign up to request clarification or add additional context in comments.

2 Comments

The group by operation indeed fastens up the process. Can you please explain in gist how this happens? res.unstack(level=1).fillna(0).values however does not result in a matrix.
I think to create a matrix, just res.unstack(level=1).fillna(0) would suffice.
0

[EDIT] As Roland mentioned in comment, in a standard Python implementation, this post does not offer any solution to improve CPU performances.

In the standard Python implementation, threads do not really improve performance on CPU-bound tasks. There is a "Global Interpreter Lock" that enforces that only one thread at a time can be executing Python bytecode. This was done to keep the complexity of memory management down.

Have you tried to use different Threads for the different functions?

Let's say you separate your dataframe into columns and create multiple threads. Then you assign each thread to apply a function to a column. If you have enough processing power, you might be able to gain a lot of time:

from threading import Thread
import pandas as pd
import numpy as np
from queue import Queue
from time import time

# Those will be used afterwards
N_THREAD = 8
q = Queue()
df2 = pd.DataFrame()  # The output of the script 

# You create the job that each thread will do
def apply(series, func):
    df2[series.name] = series.map(func)


# You define the context of the jobs
def threader():
    while True:
        worker = q.get()
        apply(*worker)
        q.task_done()

def main():

    # You import your data to a pandas dataframe
    df = pd.DataFrame(np.random.randn(100000,4), columns=['A', 'B', 'C', 'D'])

    # You create the functions you will apply to your columns
    func1 = lambda x: x<10
    func2 = lambda x: x==0
    func3 = lambda x: x>=0
    func4 = lambda x: x<0
    func_rep = [func1, func2, func3, func4]

    for x in range(N_THREAD):  # You create your threads    
        t = Thread(target=threader)
        t.start()

    # Now is the tricky part: You enclose the arguments that
    # will be passed to the function into a tuple which you
    # put into a queue. Then you start the job by "joining"
    # the queue
    for i, func in enumerate(func_rep):
        worker = tuple([df.iloc[:,i], func])
        q.put(worker)

    t0 = time()
    q.join()
    print("Entire job took: {:.3} s.".format(time() - t0))

if __name__ == '__main__':
    main()

3 Comments

In the standard Python implementation, threads do not really improve performance on CPU-bound tasks. There is a "Global Interpreter Lock" that enforces that only one thread at a time can be executing Python bytecode. This was done to keep the complexity of memory management down.
Thank you for this explanation, I did not know this fact !
In cases where functions block such as I/O (not so likely with a Pandas apply function, but still plausible), using threads in Python can still lead to very effective speedups

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.