Pyspark: java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.sqlserver.jdbc.spark (SQL Data Pool)

Question

I'm trying to load streaming data from Kafka into SQL Server Big Data Clusters Data Pools. I'm using Spark 2.4.5 (Bitnami 2.4.5 spark image).

If I want to load data into regular tables, I use this sentence and it goes well:

logs_df.write.format('jdbc').mode('append').option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver').option \
        ('url', 'jdbc:sqlserver://XXX.XXX.XXX.XXXX:31433;databaseName=sales;').option('user', user).option \
        ('password', password).option('dbtable', 'SYSLOG_TEST_TABLE').save()

But the same sentence to load data into SQL Data Pool gives me this error:

py4j.protocol.Py4JJavaError: An error occurred while calling o93.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 3, localhost, executor driver): java.sql.BatchUpdateException: External Data Pool Table DML statement cannot be used inside a user transaction.

I found that the way to load data into SQL Data Pool is to use 'com.microsoft.sqlserver.jdbc.spark' format, as this:

logs_df.write.format('com.microsoft.sqlserver.jdbc.spark').mode('append').option('url', url).option('dbtable', datapool_table).option('user', user).option('password', password).option('dataPoolDataSource',datasource_name).save()

But it's giving me this error:

py4j.protocol.Py4JJavaError: An error occurred while calling o93.save.
: java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.sqlserver.jdbc.spark. Please find packages at http://spark.apache.org/third-party-projects.html

I'm running the script with spark-submit like this:

docker exec spark245_spark_1 /opt/bitnami/spark/bin/spark-submit --driver-class-path /opt/bitnami/spark/jars/mssql-jdbc-8.2.2.jre8.jar --jars /opt/bitnami/spark/jars/mssql-jdbc-8.2.2.jre8.jar --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 /storage/scripts/some_script.py

Is there any other package I should include or some special import I'm missing?

Thanks in advance

Edited: I've tried in scala with same results

Pblade · Accepted Answer · 2021-05-26 10:42:44Z

You need to build the repository into the jar file first using SBT. Then include it to your spark cluster.

I know there will be a lot of people having trouble with buiding this jar file (include myself of several hours ago), so I will guide you how to build the jar file, step by step:

Go to https://www.scala-sbt.org/download.html to download SBT, then install it.
Go to https://github.com/microsoft/sql-spark-connector and download the zip file.
Open the folder of the repository you have just downloaded, right click in the blank space and click "Open PowerShell windows here" . https://i.sstatic.net/Fq7NX.png
In the Shell windows, type "sbt" then press enter. It may require you to download the Java Development Kit. If so, go to https://www.oracle.com/java/technologies/javase-downloads.html to download and install it. You may need to close and reopen the shell windows after installing.

If things go right, you may see this screen: https://i.sstatic.net/fMxVr.png

After the above step has done its job, type "package". The shell may show you something like this, and it may take you a long time to finish the job. https://i.sstatic.net/hr2hw.png
After the build is done, go to the "target" folder, then "scala-2.11" folder to get the jar file. https://i.sstatic.net/Aziqy.png
After you got the jar file, include it to the Spark cluster.

~~OR, if you don't want to do the troublesome procedures above....~~

UPDATE MAY 26, 2021: The connector is now available in Maven, so you can just go there and do the rest.

https://mvnrepository.com/artifact/com.microsoft.azure/spark-mssql-connector

If you need more information, just comment. I will try my best to help.

I found version 1.2 in Maven list, installed fine in Databricks, and solved problem. Thanks.

J.Cheuk · Accepted Answer · 2020-09-22 11:16:40Z

0

According to the documentation: "To include the connector in your projects, download this repository and build the jar using SBT."

So you need to build the connector JAR file using the build.sbt in the repository, then put the JAR file into spark: your_path\spark\jars

To do this, download the SBT here: https://www.scala-sbt.org/download.html. Open SBT in the directory where you saved the build.sbt then run sbt package. A target folder should be created in the same directory and the JAR file is located in target\scala-2.11

answered Sep 22, 2020 at 11:16

J.Cheuk

931 gold badge2 silver badges8 bronze badges

Comments

pranik · Accepted Answer · 2025-04-06 09:51:07Z

I was facing same issue to write to sql server using spark, so i tried the approach in this thread:

https://sqlrelease.com/read-and-write-data-to-sql-server-from-spark-using-pyspark

steps mentioned in the above thread:

Download the driver file.
unzip it and get the “sqljdbc42.jar” file from “sqljdbc_6.0\enu\jre8” location (if are using java 8).
Copy it to spark’s jar folder. In our case it is C:\Spark\spark-2.4.3-bin-hadoop2.7\jars.
Start a new SparkSession if required.

note I am using spark 3.5 version compared to yours 2.4.5.

Also make sure you stop spark session and then restart new session and try.

Code i implemented:

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("kafka-streaming-app") \
        .config("spark.streaming.stopgracefullyOnShutdown", True) \
        .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0") \
        .config("spark.sql.shuffle.partitions", 4) \
        .master("local[2]").getOrCreate()

kafka_df = spark.readStream.format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9093") \
    .option("subscribe", "device-data") \
    .option("startingOffsets", "earliest") \
    .load()

url = 'jdbc:sqlserver://localhost:1433;database=mydb;'

kafka_df.write.mode("append") \
            .format("jdbc")\
            .option("url", url) \
            .option("dbtable", 'event') \
            .option("user", "demo") \
            .option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver") \
            .option("password", "PassTest123@") \
            .save()

Collectives™ on Stack Overflow

Pyspark: java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.sqlserver.jdbc.spark (SQL Data Pool)

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related