I have a general question regarding the appropriateness of using Spark for a type of problem I frequently encounter in Python: performing the same task on the same set of data using different parameter settings using the multiprocessing package.
Consider the following toy example (note this is just an example of processing in Python; you might have used another approach):
import multiprocessing as mp
import pandas as pd
import numpy as np
mydf = pd.DataFrame({'a':np.random.random(100)})
output = mp.Queue()
def count_number_of_rows_above_k(indf,k,output):
answer = sum(indf.a > k)
output.put(answer)
processes = [mp.Process(target=count_number_of_rows_above_k,args=(mydf,k,output)) for k in np.random.random(10)]
for p in processes:
p.start()
for p in processes:
p.join()
results = [output.get() for item in processes]
print results
The point is that I have a blob of data, in this case a Pandas dataframe, and I performing a standard function to it using different parameter values. I perform this in parallel and then collect the results at the end. This is what I would like to do in Spark, under the belief that I could scale more easily and benefit from the builtin fault tolerance. In real life, the function would of course be significantly more complex and the data would be much larger.
In my reading on Spark, all the examples I have seen feature builtin routines using Spark dataframes. For example, counting the number of columns, summing a column, filtering, etc. I want to apply a custom function to my data.
Is Spark appropriate for my problem? If so, how do I implement this? Do I need to push the dataframe to all the worker nodes beforehand?
I am just asking for a few pointers. There must be documentation on this out there that I haven't found yet. Thanks.