6

I am running a jupyter-notebook on a Spark cluster (with yarn). I am using the "findspark" package to set up the notebook and it works perfectly fine (I connect to the cluster master through a SSH tunnel). When I write a "self-contained" notebook, it works perfectly, e.g. the following code runs with no problem:

import findspark
findspark.init()

import pyspark

sc = pyspark.SparkContext(appName='myApp')
a = sc.range(1000,numSlices=10)
a.take(10)
sc.stop()

The Spark job is perfectly distributed on the workers. However, when I want to use a python package that I wrote, the files are missing on the workers.

When I am not using Jupyter-notebook and when I use spark-submit --master yarn --py-files myPackageSrcFiles.zip, my Spark job works fine, e.g. the following code runs correctly:

main.py

import pyspark
from myPackage import myFunc

sc = pyspark.SparkContext(appName='myApp')
a = sc.range(1000,numSlices=10)
b = a.map(lambda x: myFunc(x)) 
b.take(10)
sc.stop()

Then

spark-submit --master yarn --py-files myPackageSrcFiles.zip main.py

The question is: How to run main.py from a jupyter notebook? I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error...

1 Answer 1

6

I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error

It is camel case:

sc = pyspark.SparkContext(appName='myApp', pyFiles=["myPackageSrcFiles.zip"])

Or you can addPyFile

sc.addPyFile("myPackageSrcFiles.zip")
Sign up to request clarification or add additional context in comments.

1 Comment

Is it possible to do this with spark session builder?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.