Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn
Spark SQL is a module of Apache Spark designed for handling structured and semi-structured data, improving upon the limitations of Apache Hive by offering better performance and fault tolerance. It features a robust architecture with support for multiple programming languages and data sources, leveraging DataFrames and a Catalyst optimizer for efficient query execution. Users can run SQL queries and process large datasets seamlessly using Spark's integrated capabilities.
Spark SQL is Apache Spark's module for structured and semi-structured data, overcoming Hive's limits. It enhances performance and job resume capabilities.
Spark SQL is Apache Spark's module for structured and semi-structured data, overcoming Hive's limits. It enhances performance and job resume capabilities.
Spark SQL features include high compatibility, integration within Spark, scalability, and support for JDBC/ODBC connectivity.
Spark SQL architecture consists of three layers supporting various data sources and programming languages, enabling structured data manipulation.
The DataFrame API facilitates working with structured/semi-structured data, inspired by R and Python, processing up to petabytes on a single cluster.
Spark SQL supports various data sources (CSV, Avro, etc.) via the DataFrame interface, lazily evaluated and integrating with Big Data tools.
Catalyst Optimizer is a key feature of Spark SQL, enhancing query optimization through a multi-phase process leveraging Scala.
SQLContext initializes Spark SQL functionalities, requiring SparkContext, while SparkSession serves as the entry point for Spark applications.
Applications can create DataFrames from RDDs or data sources, utilizing domain-specific language for structured data manipulation.
Spark SQL allows running SQL queries through the sql function on a SparkSession, returning results as DataFrames.
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn
2.
What is SparkSQL?
Spark SQL Features
Spark SQL Architecture
Spark SQL – DataFrame API
Spark SQL – Data Source
API
Spark SQL – Catalyst
Optimizer
Running SQL Queries
Spark SQL Demo
What’s in it for you?
SQL
3.
What is SparkSQL?
SQL
Spark SQL is Apache Spark’s module for working with structured and
semi-structured data
SQL
Spark SQL isApache Spark’s module for working with structured and
semi-structured data
It originated to overcome the
limitations of Apache Hive
What is Spark SQL?
6.
SQL
Spark SQL isApache Spark’s module for working with structured and
semi-structured data
It originated to overcome the
limitations of Apache Hive
Hive lags in performance as it uses MapReduce
jobs for executing ad-hoc queries
Hive does not allow you to resume a job
processing if it fails in the middle
Limitations
What is Spark SQL?
7.
SQL
Spark performs betterthan Hive in most scenarios
Source: https://engineering.fb.com/
Hive ~ Spark
SQL
Integrated
High
Compatibility
You can integrateSpark SQL
and query structured data
inside Spark programs
You can run unmodified Hive queries
on existing warehouses in Spark
SQL. With existing Hive data, queries
and UDFs, Spark SQL offers full
compatibility
Below are some essential features of Spark SQL that makes it a compelling
framework for data processing and analyzing
Spark SQL Features
Spark
SQL
Spark
programs
SQLQueries
10.
SQL
Scalability
Standard
Connectivity
Spark SQL leveragesRDD model as
it supports large jobs and mid-
query fault tolerance. For interactive
and long queries, it uses the same
engine
You can easily connect Spark
SQL with JDBC or ODBC. For
connectivity for business
intelligence tools, both turned as
industry norms
Spark SQL Features
SQL
SQL
RDD
Below are some essential features of Spark SQL that makes it a compelling
framework for data processing and analyzing
Spark SQL hasthree main layers
Spark SQL is Apache Spark’s module for working with structured data
Language API SchemaRDD Data Sources
Spark is very compatible as it
supports languages like Python,
HiveQL, Scala, and Java
As Spark SQL works on schema,
tables, and records, you can use
SchemaRDD or DataFrame as a
temporary table
SQL
Spark SQL supports multiple
data sources like JSON,
Cassandra database, Hive
tables
Spark SQL Architecture
A DataFrame isa domain-specific language (DSL) for working
with structured and semi-structured data, i.e., datasets with a schema
Spark SQL – Data Frame API
16.
DataFrame API inSpark was
designed taking inspiration from
DataFrame in R programming and
Pandas in Python
Spark SQL – Data Frame API
A DataFrame is a domain-specific language (DSL) for working
with structured and semi-structured data, i.e., datasets with a schema
17.
Has can processthe data in the size of Kilobytes to Petabytes
on a single node cluster
Can be easily integrated with all Big Data tools and frameworks
via Spark-Core
Provides API for Python, Java, Scala, and R Programming
DataFrame features
Spark SQL – Data Frame API
DataFrame API in Spark was
designed taking inspiration from
DataFrame in R programming and
Pandas in Python
A DataFrame is a domain-specific language (DSL) for working
with structured and semi-structured data, i.e., datasets with a schema
Spark SQL supportsoperating on a variety of data sources through the
DataFrame interface
Spark SQL – Data Source API
20.
Spark SQL supportsoperating on a variety of data sources through the
DataFrame interface
It supports different files such as
CSV, Hive, Avro, JSON, Parquet
Spark SQL – Data Source API
21.
It supports differentfiles such as
CSV, Hive, Avro, JSON, Parquet
It is lazily evaluated like Apache
Spark Transformations and can
be accessed through SQL
Context and Hive Context
ContextSQL
Spark SQL – Data Source API
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
22.
It can beeasily integrated with all
Big Data tools and frameworks
via Spark-Core
ContextSQL
Spark SQL – Data Source API
It supports different files such as
CSV, Hive, Avro, JSON, Parquet
It is lazily evaluated like Apache
Spark Transformations and can
be accessed through SQL
Context and Hive Context
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
Catalyst optimizer leveragesadvanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
Spark SQL – Catalyst Optimizer
25.
It works in4 phases:
1 Analyzing a logical plan to
resolve references
2 Logical plan optimization
3 Physical planning 4
Code generation to compile parts of
the query to Java bytecode
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
26.
SQL
Query
SQL
Query
Spark SQL –Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
27.
SQL
Query
SQL
Query
Unresolved
Logical plan
Spark SQL– Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logicalplan
Catalog
Analysis
Logical
Optimization
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
30.
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logicalplan
Physical
plans
Catalog
Analysis
Logical
Optimization
Physical
Planning
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
31.
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logicalplan
Physical
plans
Cost Model
Catalog
Analysis
Logical
Optimization
Physical
Planning
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
32.
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logicalplan
Physical
plans
Cost Model
Selected
Physical
Plan
Catalog
Analysis
Logical
Optimization
Physical
Planning
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
33.
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logicalplan
Physical
plans
Cost Model
Selected
Physical
Plan
RDDs
Catalog
Analysis
Logical
Optimization
Physical
Planning
Code Generation
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SparkContext class object(sc) is required for initializing SQLContext class object
The following command initializes
SparkContext through spark-shell
$ spark-shell
Spark SQLContext
SQLContext is a class used for initializing the functionalities of Spark SQL
37.
The following commandcreates a SQLContext
scala> val sqlcontext = new
org.apache.sql.SQLContext(sc)
Spark SQLContext
SparkContext class object (sc) is required for initializing SQLContext class object
SQLContext is a class used for initializing the functionalities of Spark SQL
The following command initializes
SparkContext through spark-shell
$ spark-shell
It is theentry point to any functionality in Spark. To create a basic
SparkSession, use SparkSession.builder()
Source: https://spark.apache.org/
SparkSession
40.
Applications can createDataFrames with the help of an existing RDD using a
Hive table, or from Spark data sources
The following creates a DataFrame based on the content of a JSON file:
https://spark.apache.org/Source:
Creating DataFrames
Structured data canbe manipulated using domain-specific language provided
by DataFrames
https://spark.apache.org/Source:
DataFrame Operations
Below are some examples of structured data processing:
The sql functionon a SparkSession allows applications to run SQL queries
programmatically and returns the result in the form of a DataFrame
https://spark.apache.org/Source:
Running SQL Queries