2

I am running a Jupyter notebook public server as in this tutorial : http://jupyter-notebook.readthedocs.io/en/stable/public_server.html

I want to use pyspark-2.2.1 with this server. I pip-installed py4j and downloaded spark-2.2.1 from the repository.

Locally, i added in my .bashrc the command lines

export SPARK_HOME='/home/ubuntu/spark-2.2.1-bin-hadoop2.7'  
export PATH=$SPARK_HOME:$PATH  
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH

and everything works fine when i run python locally.

However, when using the notebook server, i cannot import pyspark, because the above commands have not been executed at jupyter notebook's startup.

I partly (and non elegantly) solved the issue by typing

import sys
sys.path.append("/home/ubuntu/spark-2.2.1-bin-hadoop2.7/python")

in the first cell of my notebook. But

from pyspark import SparkContext
sc = SparkContext()
myrdd = sc.textFile('exemple.txt')
myrdd.collect()  # Everything works find util here
words = myrdd.map(lambda x:x.split())
words.collect()

returns the error

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.io.IOException: Cannot run program "python": error=2, No such file or directory

Any idea how i can set the correct paths (either manually or at startup) ?

Thanks

3

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.