Getting Py4JJavaError Pyspark error on using rdd

Question

I am getting the below error:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

on this line:

result = df.select('student_age').rdd.flatMap(lambda x: x).collect()

'student_age' is a column name. It was running fine until last week but now this error.

Does anyone have any insights on that?

Is this the full stack trace? Also, "running fine until last week" -- have you updated anything recently like java version? Also, where are you running this? — viggnah
– viggnah, Commented Aug 7, 2022 at 6:43
@Slickmind could you show your code of data_percentage and count_percentage function? — Jonathan
– Jonathan, Commented Aug 8, 2022 at 4:40

Matt Andruff · Accepted Answer · 2022-08-08 15:58:36Z

1

+50

Using collect is dangerous for this very reason, It's prone to Out Of Memory errors. I suggest removing it. You also do not need to use a rdd for this you can do this with a data frame:

result = df.select(explode(df['student_age'])) #returns a dataFrame
#write code to use a data frame instead of any array.

If nothing else changed, likely the data did, and finally outgrew the size in memory.

It's also possible that you have new 'bad' data that is throwing an error.

Either way you could likely prove this by find this(OOM) or prove the data is bad by printing it.

def f(row):
    print(row.student_age)

result.foreach(f) # used for simple stuff that doesn't require heavy initialization.

IF that works you may want to break your code down to use foreachPartition. This will let you do math on each value in the memory of each executor. The only trick is that within fun below as you are executing this code on the executor you cannot reference anything that uses sparkContext. (Python code only instead of Pyspark).

def f(rows):
    #intialize a database connection here
    for row in rows:
        print(row.student_age) # do stuff with student_age
    #close database connection here

result.foreachPartition(f) # used for things that need heavy initialization

Spark foreachPartition vs foreach | what to use?

answered Aug 8, 2022 at 15:58

Matt Andruff

5,1901 gold badge7 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Slickmind Over a year ago

Thanks Matt! Let me try it!

Slickmind · Accepted Answer · 2022-08-08 22:12:46Z

0

This issue is solved, here is the answer:

result = [i[0] for i in df.select('student_age').toLocalIterator()]

answered Aug 8, 2022 at 22:12

Slickmind

4642 gold badges7 silver badges18 bronze badges

3 Comments

Matt Andruff Over a year ago

This may work for your use case as it does decrease memory pressure using an iterator instead of a collect(). You are not doing pyspark at this point you are doing python coding. That is to say, principally, you are using a small data approach instead of using big data approach. I'd encourage you to think bigger and use dataFrames over using python arrays. It will scale better.

Slickmind Over a year ago

I am getting this error: ******** File "C:\Users\Jarvis\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\sql\column.py", line 470, in iter raise TypeError("Column is not iterable") ********************* result = df.select(explode(df['student_age'])) unique_result = list(set([j for i in result for j in xs]))

Matt Andruff Over a year ago

You are still using Python instead of pyspark. To make the Colum unique just call distinct() on it. You might benefit from more studying of pyspark. Clearly you know Python.

Collectives™ on Stack Overflow

Getting Py4JJavaError Pyspark error on using rdd

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related