18

I have installed pyspark with python 3.6 and I am using jupyter notebook to initialize a spark session.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").enableHieSupport.getOrCreate()

which runs without any errors

But I write,

df = spark.range(10)
df.show()

It throws me an error -->

Py4JError: An error occurred while calling o54.showString. Trace:
py4j.Py4JException: Method showString([class java.lang.Integer, class java.lang.Integer, class java.lang.Boolean]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:272)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)

I dont know why I am facing this issue.

If I do,

from pyspark import SparkContext
sc = SparkContext()
print(sc.version)

'2.1.0'

11 Answers 11

17

I am happy now because I have been having exactly the same issue with my pyspark and I found "the solution". In my case, I am running on Windows 10. After many searches via Google, I found the correct way of setting the required environment variables: PYTHONPATH=$SPARK_HOME$\python;$SPARK_HOME$\python\lib\py4j-<version>-src.zip The version of Py4J source package changes between the Spark versions, thus, check what you have in your Spark and change the placeholder accordingly. For a complete reference to the process look at this site: how to install spark locally

Sign up to request clarification or add additional context in comments.

Comments

10

For me

import findspark
findspark.init()

import pyspark

solved the problem

Comments

4

If you are using pyspark in anancoda, add below code to set SPARK_HOME before running your codes:

import os
import sys
spark_path = r"spark-2.3.2-bin-hadoop2.7" # spark installed folder
os.environ['SPARK_HOME'] = spark_path
sys.path.insert(0, spark_path + "/bin")
sys.path.insert(0, spark_path + "/python/pyspark/")
sys.path.insert(0, spark_path + "/python/lib/pyspark.zip")
sys.path.insert(0, spark_path + "/python/lib/py4j-0.10.7-src.zip")

Comments

3

I just needed to set the SPARK_HOME environment variable to the location of spark. I added the following lines to my ~/.bashrc file.

# SPARK_HOME
export SPARK_HOME="/home/pyuser/anaconda3/lib/python3.6/site-packages/pyspark/"

SInce I am using different versions of spark in different environments, I followed this tutorial (link) to create environment variables for each conda enviroment.

Comments

3

I had a similar Constructor [...] does not exist problem. Then I found the version of PySpark package is not the same as Spark (2.4.4) installed on the server. Finally, I solved the problem by reinstalling PySpark with the same version:

pip install pyspark==2.4.4

1 Comment

I had the same issue and this worked for me. Oddly enough, it worked with different versions of Spark and PySpark, but after a restart of JupyterLab, it stopped working, until I ensured that PySpark had the same version as Spark.
1

Here’s the steps and combination of tools that worked for me using Jupyter:

1) Install Java 1.8

2) Set Environment Variable in PATH for Java, e.g. JAVA_HOME = C:\Program Files\Java\javasdk_1.8.241

3) Install PySpark 2.7 Using Conda Install (3.0 did not work for me, it gave error asking me to match PySpark and Spark versions...search for Conda Install code for PySpark 2.7

4) Install Spark 2.4 (3.0 did not work for me)

5) Set SPARK_HOME in Environment Variable to the Spark download folder, e.g. SPARK_HOME = C:\Users\Spark

6) Set HADOOP_HOME in Environment Variable to the Spark download folder, e.g. HADOOP_HOME = C:\Users\Spark

7) Download winutils.exe and place it inside the bin folder in Spark software download folder after unzipping Spark.tgz

8) Install FindSpark in Conda, search for it on Anaconda.org website and install in Jupyter notebook (This was the one of the most important steps to avoid getting an error)

9) Restart computer to make sure Environment Variables are applied

10) You can validate if environment variables are applied by typing below in Windows command prompt:

C:\> echo %SPARK_HOME% 

This should show you the environment variable that you have added to Windows PATH in Advanced Settings for Windows 10

Comments

1
 %env PYTHONPATH=%SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j--src.zip:%PYTHONPATH%
!pip install findspark
import findspark 
!pip install pyspark==2.4.4 
import pyspark 
findspark.init() 
from pyspark import SparkConf, SparkContext
sc = pyspark.SparkContext.getOrCreate()

You have to add the paths and add the necessary libraries for Apache Spark.

1 Comment

Hello! While this code may solve the question, including an explanation of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please edit your answer to add explanations and give an indication of what limitations and assumptions apply.
1

try changing pyspark version. worked for me was using 3.2.1 and was getting this err after switching to 3.2.2 it worked perfectly fine.

1 Comment

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.
0

I think spark.range is supposed to return a RDD object. Therefore, show is not a method you can use. Please use instead collect or take.

You can also replace spark.range with sc.range if you want to use show.

1 Comment

I am still facing the error. I also printed the type of "df" and it shows a Dataframe
0
import findspark
findspark.init("path of SparkORHadoop ")
from pyspark import SparkContext

you need firstly set findspark.init() and then you can import pyspark

Comments

0

I had the same error when using PyCharm and executing code in the Python Console in Windows 10, however, I was able to run this same code without error when launching pyspark from the terminal. After trying solutions from many searches, the solution for the Pycharm Python Console error was a combination of all of the environment variable (I set them up for both User and System) and PyCharm setting steps in the following two blog posts, setup pyspark locally and spark & pycharm.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.