3

I want to read data from Postgresql using JDBC and store it in pyspark dataframe. When I want to preview the data in dataframe with methods like df.show(), df.take(), they return an error saying caused by: java.lang.ClassNotFoundException: org.postgresql.Driver. But df.printschema() would return info of the DB table perfectly. Here is my code:

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.master("spark://spark-master:7077")
    .appName("read-postgres-jdbc")
    .config("spark.driver.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar")
    .config("spark.executor.memory", "1g")
    .getOrCreate()
)
sc = spark.sparkContext

df = (
    spark.read.format("jdbc")
    .option("driver", "org.postgresql.Driver")
    .option("url", "jdbc:postgresql://postgres/postgres")
    .option("table", 'public."ASSET_DATA"')
    .option("dbtable", _select_sql)
    .option("user", "airflow")
    .option("password", "airflow")
    .load()
)

df.show(1)

Error log:

Py4JJavaError: An error occurred while calling o44.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.21.0.6, executor 1): java.lang.ClassNotFoundException: org.postgresql.Driver

Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver

Edited 7/24/2021 The script was executed on JupyterLab in a separated docker container from the Standalone Spark cluster.

5 Answers 5

6

You are not using the proper option. When reading the doc, you see this :

Extra classpath entries to prepend to the classpath of the driver. Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.

This option is for the driver. This is the reason why the acquisition of the schema works, it is an action done on the driver side. But when you run a spark command, this command is executed by the workers (or executors). They need also to have the .jar to access postgres.

If your postgres driver ("/opt/workspace/postgresql-42.2.18.jar") does not need any dependencies, then you can add it to the worker using spark.jars - I know mysql does not require depencies for example but I never tried postgres. If it needs dependencies, then it is better to call directly the package from maven using spark.jars.packages option. (see the link of the doc for help)

Sign up to request clarification or add additional context in comments.

Comments

3

I attempted various methods, but unfortunately, none of them proved to be very helpful as I continued to encounter the same error. Subsequently, I decided to try the solution provided below. postgres package will be downloaded automatically in the environment and made accessible.

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
                .appName('learn.com') \
                .config("spark.jars.packages", 
                "org.postgresql:postgresql:42.6.0") \
                .getOrCreate()

For reading the databse table from postgres use below command -

     jdbcDF = spark.read.format("jdbc"). \
     options(url='jdbc:postgresql://localhost:5432/postgres', # 
     jdbc:postgresql://<hos
     dbtable='company',
     user='postgres',
     password='admin',
     driver='org.postgresql.Driver').\
     load()

Comments

1

You can also try adding:

.config("spark.executor.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar"

So that the jar is included for your executors as well.

1 Comment

This worked for me! I think it's importat to say that the value of this argument has to be the exact path to the .jar. I was having that error because I just copied the path from Jupyter without checking the location with pwd command.
1

Scenario - Connecting Jupyterlab with local/server host (with installed postgreSQL) & writing JSON in postgreSQL db

  1. No need to add - .config("spark.jars","postgresql-42.7.0.jar")(while creating SparkSession) & and also we can't set spark.set("spark.jars","postgresql-42.7.0.jar") in the run time
  2. Just add the below commands (& at the end of df.write do not forget to add .save(), else it won't be saved in DB)
  3. Do not use regular write statements (df.write.jdbc(url="url", table="table_name",mode="append", properties="properties") - you will be getting parse URL error
  4. Download the jar (latest/required) file from (https://jdbc.postgresql.org/download/postgresql-42.7.0.jar)
  5. Add the downloaded jar file in Spark\spark-3.2.0-bin-hadoop3.2\jars
  6. Restart the Jupyter/Server & It should be working

df.write.format("jdbc").mode("append") \
        .option("driver","org.postgresql.Driver") \
        .option("url","jdbc:postgresql://localhost:5432/postgres") \
        .option("dbtable","TABLENAME") \
        .option("user","postgres") \
        .option("password","PASSWORD") \
        .save()

It worked for me

Comments

0

If you are using spark-submit command to run your spark job, do not forget to add the two prameters --driver-class-path and --jars.

Exmaple:spark-submit --driver-class-path /path/toPostgresJar/postgresql-42.6.1.jar --jars postgresql-42.6.1.jar --master spark://localhost:7077 yourSparkJob.py

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.