What is Spark SQL?
Spark SQL Features
Spark SQL Architecture
Spark SQL – DataFrame API
Spark SQL – Data Source
API
Spark SQL – Catalyst
Optimizer
Running SQL Queries
Spark SQL Demo
What’s in it for you?
SQL
What is Spark SQL?
SQL
Spark SQL is Apache Spark’s module for working with structured and
semi-structured data
Click here to watch the video
SQL
Spark SQL is Apache Spark’s module for working with structured and
semi-structured data
It originated to overcome the
limitations of Apache Hive
What is Spark SQL?
SQL
Spark SQL is Apache Spark’s module for working with structured and
semi-structured data
It originated to overcome the
limitations of Apache Hive
Hive lags in performance as it uses MapReduce
jobs for executing ad-hoc queries
Hive does not allow you to resume a job
processing if it fails in the middle
Limitations
What is Spark SQL?
SQL
Spark performs better than Hive in most scenarios
Source: https://engineering.fb.com/
Hive ~ Spark
Spark SQL
Features
SQL
Integrated
High
Compatibility
You can integrate Spark SQL
and query structured data
inside Spark programs
You can run unmodified Hive queries
on existing warehouses in Spark
SQL. With existing Hive data, queries
and UDFs, Spark SQL offers full
compatibility
Below are some essential features of Spark SQL that makes it a compelling
framework for data processing and analyzing
Spark SQL Features
Spark
SQL
Spark
programs
SQLQueries
SQL
Scalability
Standard
Connectivity
Spark SQL leverages RDD model as
it supports large jobs and mid-
query fault tolerance. For interactive
and long queries, it uses the same
engine
You can easily connect Spark
SQL with JDBC or ODBC. For
connectivity for business
intelligence tools, both turned as
industry norms
Spark SQL Features
SQL
SQL
RDD
Below are some essential features of Spark SQL that makes it a compelling
framework for data processing and analyzing
Spark SQL
Architecture
SQL
DataFrame DSLDataframe DSL
DataFrame API
Data Source API
CSV JSON JDBC
DataFrame DSLSpark SQL and HQL
Spark SQL Architecture
Spark SQL has three main layers
Spark SQL is Apache Spark’s module for working with structured data
Language API SchemaRDD Data Sources
Spark is very compatible as it
supports languages like Python,
HiveQL, Scala, and Java
As Spark SQL works on schema,
tables, and records, you can use
SchemaRDD or DataFrame as a
temporary table
SQL
Spark SQL supports multiple
data sources like JSON,
Cassandra database, Hive
tables
Spark SQL Architecture
Spark SQL –
DataFrame API
A DataFrame is a domain-specific language (DSL) for working
with structured and semi-structured data, i.e., datasets with a schema
Spark SQL – Data Frame API
DataFrame API in Spark was
designed taking inspiration from
DataFrame in R programming and
Pandas in Python
Spark SQL – Data Frame API
A DataFrame is a domain-specific language (DSL) for working
with structured and semi-structured data, i.e., datasets with a schema
Has can process the data in the size of Kilobytes to Petabytes
on a single node cluster
Can be easily integrated with all Big Data tools and frameworks
via Spark-Core
Provides API for Python, Java, Scala, and R Programming
DataFrame features
Spark SQL – Data Frame API
DataFrame API in Spark was
designed taking inspiration from
DataFrame in R programming and
Pandas in Python
A DataFrame is a domain-specific language (DSL) for working
with structured and semi-structured data, i.e., datasets with a schema
Spark SQL –
Data Source API
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
Spark SQL – Data Source API
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
It supports different files such as
CSV, Hive, Avro, JSON, Parquet
Spark SQL – Data Source API
It supports different files such as
CSV, Hive, Avro, JSON, Parquet
It is lazily evaluated like Apache
Spark Transformations and can
be accessed through SQL
Context and Hive Context
ContextSQL
Spark SQL – Data Source API
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
It can be easily integrated with all
Big Data tools and frameworks
via Spark-Core
ContextSQL
Spark SQL – Data Source API
It supports different files such as
CSV, Hive, Avro, JSON, Parquet
It is lazily evaluated like Apache
Spark Transformations and can
be accessed through SQL
Context and Hive Context
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
Spark SQL –
Catalyst
Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
Spark SQL – Catalyst Optimizer
It works in 4 phases:
1 Analyzing a logical plan to
resolve references
2 Logical plan optimization
3 Physical planning 4
Code generation to compile parts of
the query to Java bytecode
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Catalog
Analysis
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Catalog
Analysis
Logical
Optimization
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Catalog
Analysis
Logical
Optimization
Physical
Planning
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Cost Model
Catalog
Analysis
Logical
Optimization
Physical
Planning
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Cost Model
Selected
Physical
Plan
Catalog
Analysis
Logical
Optimization
Physical
Planning
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Cost Model
Selected
Physical
Plan
RDDs
Catalog
Analysis
Logical
Optimization
Physical
Planning
Code Generation
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
Spark
SQLContext
Spark SQLContext
SQLContext is a class used for initializing the functionalities of Spark SQL
SparkContext class object (sc) is required for initializing SQLContext class object
The following command initializes
SparkContext through spark-shell
$ spark-shell
Spark SQLContext
SQLContext is a class used for initializing the functionalities of Spark SQL
The following command creates a SQLContext
scala> val sqlcontext = new
org.apache.sql.SQLContext(sc)
Spark SQLContext
SparkContext class object (sc) is required for initializing SQLContext class object
SQLContext is a class used for initializing the functionalities of Spark SQL
The following command initializes
SparkContext through spark-shell
$ spark-shell
SparkSession
It is the entry point to any functionality in Spark. To create a basic
SparkSession, use SparkSession.builder()
Source: https://spark.apache.org/
SparkSession
Applications can create DataFrames with the help of an existing RDD using a
Hive table, or from Spark data sources
The following creates a DataFrame based on the content of a JSON file:
https://spark.apache.org/Source:
Creating DataFrames
DataFrame
Operations
Structured data can be manipulated using domain-specific language provided
by DataFrames
https://spark.apache.org/Source:
DataFrame Operations
Below are some examples of structured data processing:
https://spark.apache.org/Source:
DataFrame Operations
Structured data can be manipulated using domain-specific language provided
by DataFrames
Below are some examples of structured data processing:
https://spark.apache.org/Source:
DataFrame Operations
Structured data can be manipulated using domain-specific language provided
by DataFrames
Below are some examples of structured data processing:
Running SQL
Queries
The sql function on a SparkSession allows applications to run SQL queries
programmatically and returns the result in the form of a DataFrame
https://spark.apache.org/Source:
Running SQL Queries
Demo on Spark
SQL
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

  • 2.
    What is SparkSQL? Spark SQL Features Spark SQL Architecture Spark SQL – DataFrame API Spark SQL – Data Source API Spark SQL – Catalyst Optimizer Running SQL Queries Spark SQL Demo What’s in it for you? SQL
  • 3.
    What is SparkSQL? SQL Spark SQL is Apache Spark’s module for working with structured and semi-structured data
  • 4.
    Click here towatch the video
  • 5.
    SQL Spark SQL isApache Spark’s module for working with structured and semi-structured data It originated to overcome the limitations of Apache Hive What is Spark SQL?
  • 6.
    SQL Spark SQL isApache Spark’s module for working with structured and semi-structured data It originated to overcome the limitations of Apache Hive Hive lags in performance as it uses MapReduce jobs for executing ad-hoc queries Hive does not allow you to resume a job processing if it fails in the middle Limitations What is Spark SQL?
  • 7.
    SQL Spark performs betterthan Hive in most scenarios Source: https://engineering.fb.com/ Hive ~ Spark
  • 8.
  • 9.
    SQL Integrated High Compatibility You can integrateSpark SQL and query structured data inside Spark programs You can run unmodified Hive queries on existing warehouses in Spark SQL. With existing Hive data, queries and UDFs, Spark SQL offers full compatibility Below are some essential features of Spark SQL that makes it a compelling framework for data processing and analyzing Spark SQL Features Spark SQL Spark programs SQLQueries
  • 10.
    SQL Scalability Standard Connectivity Spark SQL leveragesRDD model as it supports large jobs and mid- query fault tolerance. For interactive and long queries, it uses the same engine You can easily connect Spark SQL with JDBC or ODBC. For connectivity for business intelligence tools, both turned as industry norms Spark SQL Features SQL SQL RDD Below are some essential features of Spark SQL that makes it a compelling framework for data processing and analyzing
  • 11.
  • 12.
    SQL DataFrame DSLDataframe DSL DataFrameAPI Data Source API CSV JSON JDBC DataFrame DSLSpark SQL and HQL Spark SQL Architecture
  • 13.
    Spark SQL hasthree main layers Spark SQL is Apache Spark’s module for working with structured data Language API SchemaRDD Data Sources Spark is very compatible as it supports languages like Python, HiveQL, Scala, and Java As Spark SQL works on schema, tables, and records, you can use SchemaRDD or DataFrame as a temporary table SQL Spark SQL supports multiple data sources like JSON, Cassandra database, Hive tables Spark SQL Architecture
  • 14.
  • 15.
    A DataFrame isa domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema Spark SQL – Data Frame API
  • 16.
    DataFrame API inSpark was designed taking inspiration from DataFrame in R programming and Pandas in Python Spark SQL – Data Frame API A DataFrame is a domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema
  • 17.
    Has can processthe data in the size of Kilobytes to Petabytes on a single node cluster Can be easily integrated with all Big Data tools and frameworks via Spark-Core Provides API for Python, Java, Scala, and R Programming DataFrame features Spark SQL – Data Frame API DataFrame API in Spark was designed taking inspiration from DataFrame in R programming and Pandas in Python A DataFrame is a domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema
  • 18.
  • 19.
    Spark SQL supportsoperating on a variety of data sources through the DataFrame interface Spark SQL – Data Source API
  • 20.
    Spark SQL supportsoperating on a variety of data sources through the DataFrame interface It supports different files such as CSV, Hive, Avro, JSON, Parquet Spark SQL – Data Source API
  • 21.
    It supports differentfiles such as CSV, Hive, Avro, JSON, Parquet It is lazily evaluated like Apache Spark Transformations and can be accessed through SQL Context and Hive Context ContextSQL Spark SQL – Data Source API Spark SQL supports operating on a variety of data sources through the DataFrame interface
  • 22.
    It can beeasily integrated with all Big Data tools and frameworks via Spark-Core ContextSQL Spark SQL – Data Source API It supports different files such as CSV, Hive, Avro, JSON, Parquet It is lazily evaluated like Apache Spark Transformations and can be accessed through SQL Context and Hive Context Spark SQL supports operating on a variety of data sources through the DataFrame interface
  • 23.
  • 24.
    Catalyst optimizer leveragesadvanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer Spark SQL – Catalyst Optimizer
  • 25.
    It works in4 phases: 1 Analyzing a logical plan to resolve references 2 Logical plan optimization 3 Physical planning 4 Code generation to compile parts of the query to Java bytecode Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 26.
    SQL Query SQL Query Spark SQL –Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 27.
    SQL Query SQL Query Unresolved Logical plan Spark SQL– Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 28.
    SQL Query SQL Query Unresolved Logical plan Logical plan Catalog Analysis SparkSQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 29.
    SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logicalplan Catalog Analysis Logical Optimization Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 30.
    SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logicalplan Physical plans Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 31.
    SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logicalplan Physical plans Cost Model Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 32.
    SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logicalplan Physical plans Cost Model Selected Physical Plan Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 33.
    SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logicalplan Physical plans Cost Model Selected Physical Plan RDDs Catalog Analysis Logical Optimization Physical Planning Code Generation Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 34.
  • 35.
    Spark SQLContext SQLContext isa class used for initializing the functionalities of Spark SQL
  • 36.
    SparkContext class object(sc) is required for initializing SQLContext class object The following command initializes SparkContext through spark-shell $ spark-shell Spark SQLContext SQLContext is a class used for initializing the functionalities of Spark SQL
  • 37.
    The following commandcreates a SQLContext scala> val sqlcontext = new org.apache.sql.SQLContext(sc) Spark SQLContext SparkContext class object (sc) is required for initializing SQLContext class object SQLContext is a class used for initializing the functionalities of Spark SQL The following command initializes SparkContext through spark-shell $ spark-shell
  • 38.
  • 39.
    It is theentry point to any functionality in Spark. To create a basic SparkSession, use SparkSession.builder() Source: https://spark.apache.org/ SparkSession
  • 40.
    Applications can createDataFrames with the help of an existing RDD using a Hive table, or from Spark data sources The following creates a DataFrame based on the content of a JSON file: https://spark.apache.org/Source: Creating DataFrames
  • 41.
  • 42.
    Structured data canbe manipulated using domain-specific language provided by DataFrames https://spark.apache.org/Source: DataFrame Operations Below are some examples of structured data processing:
  • 43.
    https://spark.apache.org/Source: DataFrame Operations Structured datacan be manipulated using domain-specific language provided by DataFrames Below are some examples of structured data processing:
  • 44.
    https://spark.apache.org/Source: DataFrame Operations Structured datacan be manipulated using domain-specific language provided by DataFrames Below are some examples of structured data processing:
  • 45.
  • 46.
    The sql functionon a SparkSession allows applications to run SQL queries programmatically and returns the result in the form of a DataFrame https://spark.apache.org/Source: Running SQL Queries
  • 47.