Using collect is dangerous for this very reason, It's prone to Out Of Memory errors. I suggest removing it. You also do not need to use a rdd for this you can do this with a data frame:
result = df.select(explode(df['student_age'])) #returns a dataFrame
#write code to use a data frame instead of any array.
If nothing else changed, likely the data did, and finally outgrew the size in memory.
It's also possible that you have new 'bad' data that is throwing an error.
Either way you could likely prove this by find this(OOM) or prove the data is bad by printing it.
def f(row):
print(row.student_age)
result.foreach(f) # used for simple stuff that doesn't require heavy initialization.
IF that works you may want to break your code down to use foreachPartition. This will let you do math on each value in the memory of each executor. The only trick is that within fun below as you are executing this code on the executor you cannot reference anything that uses sparkContext. (Python code only instead of Pyspark).
def f(rows):
#intialize a database connection here
for row in rows:
print(row.student_age) # do stuff with student_age
#close database connection here
result.foreachPartition(f) # used for things that need heavy initialization
Spark foreachPartition vs foreach | what to use?
data_percentageandcount_percentagefunction?