Using Apache Spark to parallelize processing of a Pandas dataframe

Question

I have a general question regarding the appropriateness of using Spark for a type of problem I frequently encounter in Python: performing the same task on the same set of data using different parameter settings using the multiprocessing package.

Consider the following toy example (note this is just an example of processing in Python; you might have used another approach):

import multiprocessing as mp
import pandas as pd
import numpy as np

mydf = pd.DataFrame({'a':np.random.random(100)})

output = mp.Queue()

def count_number_of_rows_above_k(indf,k,output):
  answer = sum(indf.a > k)
  output.put(answer)

processes = [mp.Process(target=count_number_of_rows_above_k,args=(mydf,k,output)) for k in np.random.random(10)]

for p in processes:
  p.start()
for p in processes:
  p.join()

results = [output.get() for item in processes]
print results

The point is that I have a blob of data, in this case a Pandas dataframe, and I performing a standard function to it using different parameter values. I perform this in parallel and then collect the results at the end. This is what I would like to do in Spark, under the belief that I could scale more easily and benefit from the builtin fault tolerance. In real life, the function would of course be significantly more complex and the data would be much larger.

In my reading on Spark, all the examples I have seen feature builtin routines using Spark dataframes. For example, counting the number of columns, summing a column, filtering, etc. I want to apply a custom function to my data.

Is Spark appropriate for my problem? If so, how do I implement this? Do I need to push the dataframe to all the worker nodes beforehand?

I am just asking for a few pointers. There must be documentation on this out there that I haven't found yet. Thanks.

Glauco · Accepted Answer · 2021-06-30 12:54:20Z

1

Koalas is a library able to do this kind of stuff, wich maintain a pandas-like API.

answered Jun 30, 2021 at 12:54

Glauco

1,4692 gold badges11 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

user9382513 · Accepted Answer · 2018-03-10 13:07:38Z

0

Spark will parallelize for you as long as you use a RDD or Spark data frame not 🐼 data frame. 🐼 will be single threaded. Just UDF operation to define your function.

answered Mar 10, 2018 at 13:07

user9382513

112 bronze badges

1 Comment

xbot Over a year ago

Will my function be able to return any data type? I am specifically thinking of a row of values. So the input to the function would be a dataframe and a set of parameter values, and the output would be a row. I would then vertically concatenate all the rows into a single dataframe representing the results over all paramater combinations.

Pardeep Sharma · Accepted Answer · 2020-09-22 15:23:06Z

0

Actually you have to understand what is meant for what. Spark is meant for humingoush data processing and heavy lifting..... On the other hand Pandas DF is for ML and DL... Most of the ML and DL libraries take direct input as Pandas DF , series or Numpy series.... So for ML this is essential.... But whether you build ML model on all data... Ideally not. Hence for ETL type of operations Spark DF or DS is essential... For ML Pandas DF is essential.

answered Sep 22, 2020 at 15:23

Pardeep Sharma

5825 silver badges21 bronze badges

Collectives™ on Stack Overflow

Using Apache Spark to parallelize processing of a Pandas dataframe

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related