Pyspark connection to Postgres database in ipython notebook

Question

I've read previous posts on this, but I still cannot pinpoint why I am unable to connect my ipython notebook to a Postgres db.

I am able to launch pyspark in an ipython notebook, SparkContext is loaded as 'sc'.

I have the following in my .bash_profile for finding the Postgres driver:

export SPARK_CLASSPATH=/path/to/downloaded/jar

Here's what I am doing in the ipython notebook to connect to the db (based on this post):

from pyspark.sql import DataFrameReader as dfr
sqlContext = SQLContext(sc)

table= 'some query'
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}

df = dfr(sqlContext).jdbc(
url='jdbc:%s' % url, table=table, properties=properties
)

The error:

Py4JJavaError: An error occurred while calling o156.jdbc.
: java.SQL.SQLException: No suitable driver.

I understand it's an error with finding the driver I've downloaded, but I don't understand why I am getting this error when I've added the path to it in my .bash_profile.

I also tried to set driver via pyspark --jars, but I get a "no such file or directory" error.

This blogpost also shows how to connect to Postgres data sources, but the following also gives me a "no such directory" error:

 ./bin/spark-shell --packages org.postgresql:postgresql:42.1.4

Additional info:

spark version: 2.2.0
python version: 3.6
java: 1.8.0_25
postgres driver: 42.1.4

kluu · Accepted Answer · 2018-08-13 09:26:15Z

2

I am not sure why the above answer did not work for me but I thought I could also share what actually worked for me when running pyspark from a jupyter notebook (Spark 2.3.1 - Python 3.6.3):

from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

answered Aug 13, 2018 at 9:26

kluu

2,9954 gold badges18 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sudo · Accepted Answer · 2017-10-25 17:29:46Z

0

They've changed how this works several times in Apache Spark. Looking at my setup, this is what I have in my .bashrc (aka .bash_profile on Mac), so you could try it: export SPARK_CLASSPATH=$SPARK_CLASSPATH:/absolute/path/to/your/driver.jar Edit: I'm using Spark 1.6.1.

And, as always, make sure you use a new shell or source the script so you have the updated envvar (verify with echo $SPARK_CLASSPATH in your shell before you run ipython notebook).

edited Oct 25, 2017 at 17:29

answered Oct 25, 2017 at 0:36

sudo

5,8626 gold badges46 silver badges81 bronze badges

4 Comments

cocanut Over a year ago

What do you mean by absolute path? I got the real path of the driver and used that. I changed to how it's shown above and have the same problem.

cocanut Over a year ago

It would appear that spark classpath is deprecated: github.com/elastic/elasticsearch-hadoop/pull/580

cocanut Over a year ago

I've connected, I'll upload my solution for sake of documenting once I have some time, but basically, I think it's because spark_classpath is deprecated so you have to use --driver-class-path

sudo Over a year ago

@cocanut By absolute path, I mean not using any ~ shortcuts. I think you did it right. Yeah, this is one of my gripes with Spark: there are many deprecated ways to do everything. SPARK_CLASSPATH works for me despite being deprecated, but I'm on 1.6.1.

cocanut · Accepted Answer · 2017-10-26 02:04:35Z

0

I followed directions in this post. SparkContext is already set as sc for me, so all I had to do was remove the SPARK_CLASSPATH setting from my .bash_profile, and use the following in my ipython notebook:

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql-42.1.4.jar --jars /path/to/postgresql-42.1.4.jar pyspark-shell'

I added a 'driver' settings to properties as well, and it worked. As stated elsewhere in this post, this is likely because SPARK_CLASSPATH is deprecated, and it is preferable to use --driver-class-path.

answered Oct 26, 2017 at 2:04

cocanut

1791 silver badge8 bronze badges

Collectives™ on Stack Overflow

Pyspark connection to Postgres database in ipython notebook

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related