2

I have a general question regarding the appropriateness of using Spark for a type of problem I frequently encounter in Python: performing the same task on the same set of data using different parameter settings using the multiprocessing package.

Consider the following toy example (note this is just an example of processing in Python; you might have used another approach):

import multiprocessing as mp
import pandas as pd
import numpy as np

mydf = pd.DataFrame({'a':np.random.random(100)})

output = mp.Queue()

def count_number_of_rows_above_k(indf,k,output):
  answer = sum(indf.a > k)
  output.put(answer)

processes = [mp.Process(target=count_number_of_rows_above_k,args=(mydf,k,output)) for k in np.random.random(10)]

for p in processes:
  p.start()
for p in processes:
  p.join()

results = [output.get() for item in processes]
print results

The point is that I have a blob of data, in this case a Pandas dataframe, and I performing a standard function to it using different parameter values. I perform this in parallel and then collect the results at the end. This is what I would like to do in Spark, under the belief that I could scale more easily and benefit from the builtin fault tolerance. In real life, the function would of course be significantly more complex and the data would be much larger.

In my reading on Spark, all the examples I have seen feature builtin routines using Spark dataframes. For example, counting the number of columns, summing a column, filtering, etc. I want to apply a custom function to my data.

Is Spark appropriate for my problem? If so, how do I implement this? Do I need to push the dataframe to all the worker nodes beforehand?

I am just asking for a few pointers. There must be documentation on this out there that I haven't found yet. Thanks.

3 Answers 3

1

Koalas is a library able to do this kind of stuff, wich maintain a pandas-like API.

Sign up to request clarification or add additional context in comments.

Comments

0

Spark will parallelize for you as long as you use a RDD or Spark data frame not 🐼 data frame. 🐼 will be single threaded. Just UDF operation to define your function.

1 Comment

Will my function be able to return any data type? I am specifically thinking of a row of values. So the input to the function would be a dataframe and a set of parameter values, and the output would be a row. I would then vertically concatenate all the rows into a single dataframe representing the results over all paramater combinations.
0

Actually you have to understand what is meant for what. Spark is meant for humingoush data processing and heavy lifting..... On the other hand Pandas DF is for ML and DL... Most of the ML and DL libraries take direct input as Pandas DF , series or Numpy series.... So for ML this is essential.... But whether you build ML model on all data... Ideally not. Hence for ETL type of operations Spark DF or DS is essential... For ML Pandas DF is essential.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.