4

I've read previous posts on this, but I still cannot pinpoint why I am unable to connect my ipython notebook to a Postgres db.

I am able to launch pyspark in an ipython notebook, SparkContext is loaded as 'sc'.

I have the following in my .bash_profile for finding the Postgres driver:

export SPARK_CLASSPATH=/path/to/downloaded/jar

Here's what I am doing in the ipython notebook to connect to the db (based on this post):

from pyspark.sql import DataFrameReader as dfr
sqlContext = SQLContext(sc)

table= 'some query'
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}

df = dfr(sqlContext).jdbc(
url='jdbc:%s' % url, table=table, properties=properties
)

The error:

Py4JJavaError: An error occurred while calling o156.jdbc.
: java.SQL.SQLException: No suitable driver.

I understand it's an error with finding the driver I've downloaded, but I don't understand why I am getting this error when I've added the path to it in my .bash_profile.

I also tried to set driver via pyspark --jars, but I get a "no such file or directory" error.

This blogpost also shows how to connect to Postgres data sources, but the following also gives me a "no such directory" error:

 ./bin/spark-shell --packages org.postgresql:postgresql:42.1.4

Additional info:

spark version: 2.2.0
python version: 3.6
java: 1.8.0_25
postgres driver: 42.1.4

3 Answers 3

2

I am not sure why the above answer did not work for me but I thought I could also share what actually worked for me when running pyspark from a jupyter notebook (Spark 2.3.1 - Python 3.6.3):

from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
Sign up to request clarification or add additional context in comments.

Comments

0

They've changed how this works several times in Apache Spark. Looking at my setup, this is what I have in my .bashrc (aka .bash_profile on Mac), so you could try it: export SPARK_CLASSPATH=$SPARK_CLASSPATH:/absolute/path/to/your/driver.jar Edit: I'm using Spark 1.6.1.

And, as always, make sure you use a new shell or source the script so you have the updated envvar (verify with echo $SPARK_CLASSPATH in your shell before you run ipython notebook).

4 Comments

What do you mean by absolute path? I got the real path of the driver and used that. I changed to how it's shown above and have the same problem.
It would appear that spark classpath is deprecated: github.com/elastic/elasticsearch-hadoop/pull/580
I've connected, I'll upload my solution for sake of documenting once I have some time, but basically, I think it's because spark_classpath is deprecated so you have to use --driver-class-path
@cocanut By absolute path, I mean not using any ~ shortcuts. I think you did it right. Yeah, this is one of my gripes with Spark: there are many deprecated ways to do everything. SPARK_CLASSPATH works for me despite being deprecated, but I'm on 1.6.1.
0

I followed directions in this post. SparkContext is already set as sc for me, so all I had to do was remove the SPARK_CLASSPATH setting from my .bash_profile, and use the following in my ipython notebook:

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql-42.1.4.jar --jars /path/to/postgresql-42.1.4.jar pyspark-shell'

I added a 'driver' settings to properties as well, and it worked. As stated elsewhere in this post, this is likely because SPARK_CLASSPATH is deprecated, and it is preferable to use --driver-class-path.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.