Introduction to Apache Spark
Olalekan Fuad Elesin, Data Engineer
https://twitter.com/elesinOlalekan
https://github.com/OElesin
https://www.linkedin.com/in/elesinolalekan
00: Getting Started
Introduction
Necessary downloads and installations
Intro: Achievements
By the end of this session, you will be comfortable
with the following:
• open a Spark Shell
• explore data sets loaded from HDFS, etc.
• review Spark SQL, Spark Streaming,
• use the Spark Notebook
• developer community resources, etc.
• return to workplace and demo use of Spark!
Intro: Preliminaries
I believe we all have
basic Scala programming skills
01: Getting Started
Installations
hands-on: 5 mins (max)
Installation:
Step 1: Install JDK 7/8 on MacOs or Windows or Linux
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads
Step 2: Download Spark 2.0.1 from this URL http://spark.apache.org/downlo
(for session, please copy all installations from the USB disk or hard drive )
Step 3: Run Spark Shell
We’ll run Spark’s interactive shell…
./bin/spark-shell
Let’s create some data from the “Scala” REPL prompt
val data = 1 to 100000
Step 4: Now, let’s create some RDD
val dataRDD = sc.parallelize(data)
then we filter out some
dataRDD.filter(_ < 35 ).collect()
Step 3: Run Spark Shell
We’ll run Spark’s interactive shell…
./bin/spark-shell
Let’s create some data from the “Scala” REPL prompt
val data = 1 to 100000
Step 4: Now, let’s create some RDD
val dataRDD = sc.parallelize(data)
then we filter out some
dataRDD.filter(_ < 35 ).collect()Check point 1
What was your result?
02: Why Spark
Why Spark
Talk time: 6 mins (max)
Why Spark
• Most machine learning algorithms are iterative
• A large number of computations on data are also iterative
• With Disk based approached in Hadoop MapReduce, each iteration is written to
disk. This makes process very slow
Input Data
on Disk
Tuples
(On Disk)
Tuples
(On Disk)
Tuples
(On Disk)
Output Data
on Disk
http://www.wiziq.com/blog/hype-around-apache-spark/
Input Data
on Disk
RDD1
(in memory)
RDD2
(in memory)
RDD3
(in memory)
Output Data
on Disk
Hadoop Execution Flow
Spark Execution Flow
03: About Apache Spark
About Apache Spark
Talk time: 4 mins (max)
About Apache Spark
• Initial started at UC Berkeley in 2009 as PhD thesis project by Matei Zarahia
• Fast and general purpose cluster computing system
• 10x (on disk) - 100x (In memory) faster than MapReduce
• Popular for running iterative machine learning algorithms, batch and streaming
computations on data, its SQL interface and data frames.
• Can also be used for Graph processing
• Provides high level API in:
• Scala
• Java
• Python
• R
• Seamless integration with Hadoop and its Ecosystem. Can also read data from a
number of existing data sources
• More info: http://spark.apache.org/
04: Spark Stack
Spark Built-in Libraries
Talk time: 4 mins (max)
Spark Stack
• Spark SQL lets you query
structured data inside Spark
programs, using either SQL or a
familiar DataFrame API.
• Spark Streaming lets you write
streaming jobs the same way you
write batch jobs.
• Spark MLlib & ML: Machine
learning algorithms
• Graphx unifies ETL, exploratory
analysis, and iterative graph
computation within a single system
05: Spark Execution Flow
Spark Execution Flow
Talk time: 4 mins (max)
Execution Flow
http://spark.apache.org/docs/latest/cluster-overview.html
Image Courtesy: Apache Spark Website
06: Terminology
BuzzWords !!!
Talk time: 6 mins (max)
Term Meaning
Application User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar A jar containing the user's Spark application. In some cases users will want to create an
"uber jar" containing their application along with its dependencies. The user's jar should
never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program The process running the main() function of the application and creating the SparkContext
Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager,
Mesos, YARN)
Deploy mode Distinguishes where the driver process runs. In "cluster" mode, the framework launches
the driver inside of the cluster. In "client" mode, the submitter launches the driver outside
of the cluster.
Worker node Any node that can run application code in the cluster
Executor A process launched for an application on a worker node, that runs tasks and keeps data in
memory or disk storage across them. Each application has its own executors.
Task A unit of work that will be sent to one executor
Job A parallel computation consisting of multiple tasks that gets spawned in response to a
Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other
(similar to the map and reduce stages in MapReduce); you'll see this term used in the
driver's logs.
07: Resilient Distributed Dataset
RDD !!!
Talk time: 4 mins (max)
• Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark
• Immutable, Partitioned collection of elements that can be operated in parallel
• Basic Operations
– map
– filter
– persist
• Multiple Implementation
– PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join
– DoubleRDDFunctions : Operation related to double values
– SequenceFileRDDFunctions : Operation related to SequenceFiles
• RDD main characteristics:
– A list of partitions – A function for computing each split
– A list of dependencies on other RDDs
– Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash partitioned)
– Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)
• Custom RDD can be also implemented (by overriding functions)
Resilient Distributed Dataset
08: Cluster Deployment
Cluster Deployment
Talk time: 3 mins (max)
• Standalone Deploy Mode
– simplest way to deploy Spark on a private cluster
• Amazon EC2
– EC2 scripts are available
– Very quick launching a new cluster
• Apache Mesos
• Hadoop YARN
Cluster Deployment
09: Monitoring
Monitoring Spark with
WebUI
Talk time: 2 mins (max)
09: Hand-Ons Time
Hands-on with Spark-shell
and
Spark Notebook
Talk time: ~ mins (max)

Introduction to Apache Spark :: Lagos Scala Meetup session 2

  • 1.
    Introduction to ApacheSpark Olalekan Fuad Elesin, Data Engineer https://twitter.com/elesinOlalekan https://github.com/OElesin https://www.linkedin.com/in/elesinolalekan
  • 2.
    00: Getting Started Introduction Necessarydownloads and installations
  • 3.
    Intro: Achievements By theend of this session, you will be comfortable with the following: • open a Spark Shell • explore data sets loaded from HDFS, etc. • review Spark SQL, Spark Streaming, • use the Spark Notebook • developer community resources, etc. • return to workplace and demo use of Spark!
  • 4.
    Intro: Preliminaries I believewe all have basic Scala programming skills
  • 5.
  • 6.
    Installation: Step 1: InstallJDK 7/8 on MacOs or Windows or Linux http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads Step 2: Download Spark 2.0.1 from this URL http://spark.apache.org/downlo (for session, please copy all installations from the USB disk or hard drive ) Step 3: Run Spark Shell We’ll run Spark’s interactive shell… ./bin/spark-shell Let’s create some data from the “Scala” REPL prompt val data = 1 to 100000 Step 4: Now, let’s create some RDD val dataRDD = sc.parallelize(data) then we filter out some dataRDD.filter(_ < 35 ).collect()
  • 7.
    Step 3: RunSpark Shell We’ll run Spark’s interactive shell… ./bin/spark-shell Let’s create some data from the “Scala” REPL prompt val data = 1 to 100000 Step 4: Now, let’s create some RDD val dataRDD = sc.parallelize(data) then we filter out some dataRDD.filter(_ < 35 ).collect()Check point 1 What was your result?
  • 8.
    02: Why Spark WhySpark Talk time: 6 mins (max)
  • 9.
    Why Spark • Mostmachine learning algorithms are iterative • A large number of computations on data are also iterative • With Disk based approached in Hadoop MapReduce, each iteration is written to disk. This makes process very slow Input Data on Disk Tuples (On Disk) Tuples (On Disk) Tuples (On Disk) Output Data on Disk http://www.wiziq.com/blog/hype-around-apache-spark/ Input Data on Disk RDD1 (in memory) RDD2 (in memory) RDD3 (in memory) Output Data on Disk Hadoop Execution Flow Spark Execution Flow
  • 10.
    03: About ApacheSpark About Apache Spark Talk time: 4 mins (max)
  • 11.
    About Apache Spark •Initial started at UC Berkeley in 2009 as PhD thesis project by Matei Zarahia • Fast and general purpose cluster computing system • 10x (on disk) - 100x (In memory) faster than MapReduce • Popular for running iterative machine learning algorithms, batch and streaming computations on data, its SQL interface and data frames. • Can also be used for Graph processing • Provides high level API in: • Scala • Java • Python • R • Seamless integration with Hadoop and its Ecosystem. Can also read data from a number of existing data sources • More info: http://spark.apache.org/
  • 12.
    04: Spark Stack SparkBuilt-in Libraries Talk time: 4 mins (max)
  • 13.
    Spark Stack • SparkSQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. • Spark Streaming lets you write streaming jobs the same way you write batch jobs. • Spark MLlib & ML: Machine learning algorithms • Graphx unifies ETL, exploratory analysis, and iterative graph computation within a single system
  • 14.
    05: Spark ExecutionFlow Spark Execution Flow Talk time: 4 mins (max)
  • 15.
  • 16.
  • 17.
    Term Meaning Application Userprogram built on Spark. Consists of a driver program and executors on the cluster. Application jar A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime. Driver program The process running the main() function of the application and creating the SparkContext Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN) Deploy mode Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster. Worker node Any node that can run application code in the cluster Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. Task A unit of work that will be sent to one executor Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs. Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
  • 18.
    07: Resilient DistributedDataset RDD !!! Talk time: 4 mins (max)
  • 19.
    • Resilient DistributedDataset (RDD) is a basic Abstraction in Spark • Immutable, Partitioned collection of elements that can be operated in parallel • Basic Operations – map – filter – persist • Multiple Implementation – PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join – DoubleRDDFunctions : Operation related to double values – SequenceFileRDDFunctions : Operation related to SequenceFiles • RDD main characteristics: – A list of partitions – A function for computing each split – A list of dependencies on other RDDs – Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash partitioned) – Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) • Custom RDD can be also implemented (by overriding functions) Resilient Distributed Dataset
  • 20.
    08: Cluster Deployment ClusterDeployment Talk time: 3 mins (max)
  • 21.
    • Standalone DeployMode – simplest way to deploy Spark on a private cluster • Amazon EC2 – EC2 scripts are available – Very quick launching a new cluster • Apache Mesos • Hadoop YARN Cluster Deployment
  • 22.
    09: Monitoring Monitoring Sparkwith WebUI Talk time: 2 mins (max)
  • 23.
    09: Hand-Ons Time Hands-onwith Spark-shell and Spark Notebook Talk time: ~ mins (max)