Py4J error when creating a spark dataframe using pyspark

Question

I have installed pyspark with python 3.6 and I am using jupyter notebook to initialize a spark session.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").enableHieSupport.getOrCreate()

which runs without any errors

But I write,

df = spark.range(10)
df.show()

It throws me an error -->

Py4JError: An error occurred while calling o54.showString. Trace:
py4j.Py4JException: Method showString([class java.lang.Integer, class java.lang.Integer, class java.lang.Boolean]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:272)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)

I dont know why I am facing this issue.

If I do,

from pyspark import SparkContext
sc = SparkContext()
print(sc.version)

'2.1.0'

user_dhrn · Accepted Answer · 2018-04-19 01:56:31Z

17

I am happy now because I have been having exactly the same issue with my pyspark and I found "the solution". In my case, I am running on Windows 10. After many searches via Google, I found the correct way of setting the required environment variables: PYTHONPATH=$SPARK_HOME$\python;$SPARK_HOME$\python\lib\py4j-<version>-src.zip The version of Py4J source package changes between the Spark versions, thus, check what you have in your Spark and change the placeholder accordingly. For a complete reference to the process look at this site: how to install spark locally

answered Apr 19, 2018 at 1:56

user_dhrn

5971 gold badge5 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Zin Yosrim · Accepted Answer · 2019-04-30 12:50:07Z

10

For me

import findspark
findspark.init()

import pyspark

solved the problem

answered Apr 30, 2019 at 12:50

Zin Yosrim

1,7041 gold badge24 silver badges51 bronze badges

Comments

Tung Nguyen · Accepted Answer · 2018-11-09 07:51:13Z

4

If you are using pyspark in anancoda, add below code to set SPARK_HOME before running your codes:

import os
import sys
spark_path = r"spark-2.3.2-bin-hadoop2.7" # spark installed folder
os.environ['SPARK_HOME'] = spark_path
sys.path.insert(0, spark_path + "/bin")
sys.path.insert(0, spark_path + "/python/pyspark/")
sys.path.insert(0, spark_path + "/python/lib/pyspark.zip")
sys.path.insert(0, spark_path + "/python/lib/py4j-0.10.7-src.zip")

edited Nov 9, 2018 at 7:51

answered Nov 9, 2018 at 7:45

Tung Nguyen

1,6662 gold badges21 silver badges13 bronze badges

Comments

GeneticsGuy · Accepted Answer · 2018-04-24 17:12:34Z

3

I just needed to set the SPARK_HOME environment variable to the location of spark. I added the following lines to my ~/.bashrc file.

# SPARK_HOME
export SPARK_HOME="/home/pyuser/anaconda3/lib/python3.6/site-packages/pyspark/"

SInce I am using different versions of spark in different environments, I followed this tutorial (link) to create environment variables for each conda enviroment.

answered Apr 24, 2018 at 17:12

GeneticsGuy

1201 gold badge1 silver badge8 bronze badges

Comments

micmia · Accepted Answer · 2019-11-10 20:13:33Z

3

I had a similar Constructor [...] does not exist problem. Then I found the version of PySpark package is not the same as Spark (2.4.4) installed on the server. Finally, I solved the problem by reinstalling PySpark with the same version:

pip install pyspark==2.4.4

answered Nov 10, 2019 at 20:13

micmia

1,4012 gold badges15 silver badges32 bronze badges

1 Comment

matmat Over a year ago

I had the same issue and this worked for me. Oddly enough, it worked with different versions of Spark and PySpark, but after a restart of JupyterLab, it stopped working, until I ensured that PySpark had the same version as Spark.

Youssef · Accepted Answer · 2020-04-05 21:28:07Z

Here’s the steps and combination of tools that worked for me using Jupyter:

1) Install Java 1.8

2) Set Environment Variable in PATH for Java, e.g. JAVA_HOME = C:\Program Files\Java\javasdk_1.8.241

3) Install PySpark 2.7 Using Conda Install (3.0 did not work for me, it gave error asking me to match PySpark and Spark versions...search for Conda Install code for PySpark 2.7

4) Install Spark 2.4 (3.0 did not work for me)

5) Set SPARK_HOME in Environment Variable to the Spark download folder, e.g. SPARK_HOME = C:\Users\Spark

6) Set HADOOP_HOME in Environment Variable to the Spark download folder, e.g. HADOOP_HOME = C:\Users\Spark

7) Download winutils.exe and place it inside the bin folder in Spark software download folder after unzipping Spark.tgz

8) Install FindSpark in Conda, search for it on Anaconda.org website and install in Jupyter notebook (This was the one of the most important steps to avoid getting an error)

9) Restart computer to make sure Environment Variables are applied

10) You can validate if environment variables are applied by typing below in Windows command prompt:

C:\> echo %SPARK_HOME%

This should show you the environment variable that you have added to Windows PATH in Advanced Settings for Windows 10

Ayse ILKAY · Accepted Answer · 2020-06-25 16:40:22Z

1

 %env PYTHONPATH=%SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j--src.zip:%PYTHONPATH%
!pip install findspark
import findspark 
!pip install pyspark==2.4.4 
import pyspark 
findspark.init() 
from pyspark import SparkConf, SparkContext
sc = pyspark.SparkContext.getOrCreate()

You have to add the paths and add the necessary libraries for Apache Spark.

answered Jun 25, 2020 at 16:40

Ayse ILKAY

111 bronze badge

1 Comment

Brian61354270 Over a year ago

Hello! While this code may solve the question, including an explanation of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please edit your answer to add explanations and give an indication of what limitations and assumptions apply.

MAYANK GUPTA · Accepted Answer · 2022-09-14 05:19:16Z

1

try changing pyspark version. worked for me was using 3.2.1 and was getting this err after switching to 3.2.2 it worked perfectly fine.

answered Sep 14, 2022 at 5:19

MAYANK GUPTA

111 bronze badge

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Steven · Accepted Answer · 2018-03-02 09:17:51Z

0

I think spark.range is supposed to return a RDD object. Therefore, show is not a method you can use. Please use instead collect or take.

You can also replace spark.range with sc.range if you want to use show.

answered Mar 2, 2018 at 9:17

Steven

15.4k7 gold badges49 silver badges80 bronze badges

1 Comment

Regressor Over a year ago

I am still facing the error. I also printed the type of "df" and it shows a Dataframe

Zekeriyya Demirci · Accepted Answer · 2020-02-18 16:26:40Z

0

import findspark
findspark.init("path of SparkORHadoop ")
from pyspark import SparkContext

you need firstly set findspark.init() and then you can import pyspark

answered Feb 18, 2020 at 16:26

Zekeriyya Demirci

413 bronze badges

Comments

brd033 · Accepted Answer · 2020-11-06 17:27:13Z

0

I had the same error when using PyCharm and executing code in the Python Console in Windows 10, however, I was able to run this same code without error when launching pyspark from the terminal. After trying solutions from many searches, the solution for the Pycharm Python Console error was a combination of all of the environment variable (I set them up for both User and System) and PyCharm setting steps in the following two blog posts, setup pyspark locally and spark & pycharm.

answered Nov 6, 2020 at 17:27

brd033

11 bronze badge

Collectives™ on Stack Overflow

Py4J error when creating a spark dataframe using pyspark

11 Answers 11

Comments

Comments

Comments

Comments

1 Comment

Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

Comments

Comments

Comments

Comments

1 Comment

Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related