Apache Spark
Lightening Fast Cluster Computing
Eric Mizell – Director, Solution Engineering
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Apache Spark?
Apache Open Source Project
Distributed Compute Engine
for fast and expressive data processing
Designed for Iterative, In-Memory
computations and interactive data mining
Expressive Multi-Language APIs
for Java, Scala, Python, and R
Powerful Abstractions
Enable data workers to rapidly iterate over
data for:
• ETL, Machine Learning, SQL, Stream Processing,
and Graph Processing
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
Spark
Streaming
MLlib
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Why Spark?
Elegant Developer APIs
• Data Frames/SQL, Machine Learning, Graph algorithms and streaming
• Scala, Python, Java and R
• Single environment for pre-processing and Machine Learning
In-memory computation model
• Effective for iterative computations and machine learning
Machine Learning On Hadoop
• Implementation of distributed ML-algorithms
• Pipeline API (Spark ML)
Runs on Hadoop on YARN, Mesos, standalone
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Interactions with Spark
Command Line
• Scala shell – Scala/Java (./bin/spark-shell)
• Python - (./bin/pyspark)
Notebooks
• Apache Zeppelin Notebook
• Juptyer/IPython Notebook
• IRuby Notebook
ODBC/JDBC (Spark SQL only via Thrift)
• Simba driver
• DataDirect driver
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Introducing Apache Zeppelin Web-based Notebook for
interactive analytics
Features
Ad-hoc experimentation
Deeply integrated with Spark + Hadoop
Supports multiple language backends
Incubating at Apache
Use Case
Data exploration and discovery
Visualization
Interactive snippet-at-a-time experience
“Modern Data Science Studio”
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Fundamental Abstraction: Resilient Distributed Datasets
RDD
Work with distributed collections as
primitives
RDD Properties
• Immutable collections of objects spread across
a cluster
• Built through parallel transformations (map,
filter, etc.)
• Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)
Multiple Languages
broad developer, partner and customer
engagement
RDD
Partition 1
RDD
Partition 2
RDD
Partition 3Worker Node
Worker Node
Worker Node
RDD
LogicalSpark
Driver
sc = new SparkContext
rDD
=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
…
Developer
Physical
Writes
RDD
RDDs are collections of objects distributed across a cluster,
cached in RAM or on disk. They are built through parallel
transformations, automatically rebuilt on failure and immutable
(each transformation creates a new RDD).
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What can developers do with RDDs?
RDD Operations
Transformations
• e.g. map, filter, groupBy, join
• Lazy operations to build RDDs from other
RDDs
Actions
• e.g. count, collect, save
• Return a result or write it to storage
Other primitives
• Accumulator
• Broadcast Variables
Developer
Writes
RDD
Operations
Writes
Accumulator
s
Actions
Broadcast
Variables
Transformations
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘t’)[2])
messages.cache()
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec
(vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Example: Mining Console Logs
Load error messages from a log into memory, then
interactively search for patterns
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
RDD
Demo
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SQL
SQL Access and Data Frames
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
Streaming
MLlib
Spark
SQL
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
YARN
HDFS
Spark SQL
Table Structure
integrated to work with tables and rows
Hive Queries via Spark
by Spark SQL Context can connect to Hive and
query Hive
Bindings
to Python, Scala, Java, and R
Data Frames
new abstractions simplifies and speeds up SQL
processing
Spark Core Engine
Spark SQL
Data Frame DSL Spark SQL
Data Frame API
Data Source API
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storage
What are Data Frames?
Data Frames represent data in RDDs as a Table
RDD is a low level abstraction
–Think of RDD as bytecode and DataFrame as the
Java Program
Data Frame Properties
–Data Frames attach schema to RDDs
–Allows users to perform aggressive query
optimizations
–Brings the power of SQL to RDDs!
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Tuple
Relational
View
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Frames are intuitive
RDD Example
Equivalent Data Frame Example
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Find average age by department?
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
DataFrame
Demo
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
Streaming
MLlib
Spark
SQL
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
MLlib
Machine Learning Library
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
Spark
Streaming
MLlib
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Machine Learning?
Machine learning is the study of
algorithms that learn concepts from
data.
A key aspect of learning is
generalization: how well a learning
algorithm is able to predict on unseen
examples.
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Machine Learning Primitives
Unsupervised Learning
Clustering (K-means)
Recommendation
Collaborative Filtering
- alternating least squares
Dimensionality Reductions
- Principal component analysis (PCA) and singular
value decomposition (SVD)
Supervised Learning
Classification
- Naïve Bayes, Decision Tree, Random Forest,
Gradient Boosted Trees
Regression
- linear, logistic and Support Vector Machines
(SVMs)
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ML Workflows are complex
Q-Q
Q-A
similarit
y
Log
Parsing,
Cleanin
g
Ad
category
mapping
Query
category
mapping
Poly
Exp
(Q-A)
Feature
s
Model
Linear
Solver
train
test
Metrics
• Feature Extraction
Feature
Extraction
Ad Server
Sponsored Search Advertising Pipeline Challenges:
-> specify pipeline
-> inspect and debug
-> tune hyperparameters
-> productionize
HDFS
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ML Pipeline makes ML workflows easier
Transformer
Transforms one dataset into another
Estimator
Fits model to data
Pipeline
Sequence of stages, consisting of estimators
or transformers
Parameters
Trait for components that take parameters
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Streaming
Real Time Stream Processing
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
MLlib
Spark
Streaming
Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark Streaming
• Spark Streaming is an extension of Spark-core API that supports scalable, high
throughput and fault-tolerant streaming applications.
• Data can be ingested from many data sources like Kafka, Flume, Twitter, ZeroMQ or
TCP sockets
• Data is processed using the now-familiar API: map, filter, reduce, join and window
• Processed data can be stored in databases, filesystems, or live dashboards
Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
GraphX
Graph Processing
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark GraphX Graph API on Spark
Seamlessly work with graphs and collections
Growing library of graph algorithms
• SVD++, Connected Components, Triangle
Count, …
Iterative Graph Computations using
Pregel
Implements Valiant’s Bulk Synchronous
Parallel (BSP) model for distributing graph
algorithms.
Use Case
Social Media: Suggest new connections based
on existing relationships
Networking: Best routing through a given
network
Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Part. 2
Part. 1
Vertex Table
(RDD)
B C
A D
F E
A D
Distributed Graphs as Tables (RDDs)
D
Property Graph
B C
D
E
AA
F
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
Routing
Table (RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
2D Vertex Cut Heuristic
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How to Get Started with Spark
Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Try Spark Today
Download the Hortonworks Sandbox
http://hortonworks.com/products/hortonworks-sandbox/
Go to the Apache Spark Website
http://spark.apache.org/
Learn Spark
Build a Proof of Concept
Test New Functionality
Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
© Hortonworks Inc. 2013
Thank You!
Eric Mizell - Director, Solutions Engineering
emizell@hortonworks.com

Apache Spark: Lightning Fast Cluster Computing

  • 1.
    Apache Spark Lightening FastCluster Computing Eric Mizell – Director, Solution Engineering
  • 2.
    Page2 © HortonworksInc. 2011 – 2015. All Rights Reserved What is Apache Spark? Apache Open Source Project Distributed Compute Engine for fast and expressive data processing Designed for Iterative, In-Memory computations and interactive data mining Expressive Multi-Language APIs for Java, Scala, Python, and R Powerful Abstractions Enable data workers to rapidly iterate over data for: • ETL, Machine Learning, SQL, Stream Processing, and Graph Processing Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL Spark Streaming MLlib
  • 3.
    Page3 © HortonworksInc. 2011 – 2015. All Rights Reserved Why Spark? Elegant Developer APIs • Data Frames/SQL, Machine Learning, Graph algorithms and streaming • Scala, Python, Java and R • Single environment for pre-processing and Machine Learning In-memory computation model • Effective for iterative computations and machine learning Machine Learning On Hadoop • Implementation of distributed ML-algorithms • Pipeline API (Spark ML) Runs on Hadoop on YARN, Mesos, standalone
  • 4.
    Page4 © HortonworksInc. 2011 – 2015. All Rights Reserved Interactions with Spark Command Line • Scala shell – Scala/Java (./bin/spark-shell) • Python - (./bin/pyspark) Notebooks • Apache Zeppelin Notebook • Juptyer/IPython Notebook • IRuby Notebook ODBC/JDBC (Spark SQL only via Thrift) • Simba driver • DataDirect driver
  • 5.
    Page5 © HortonworksInc. 2011 – 2015. All Rights Reserved Introducing Apache Zeppelin Web-based Notebook for interactive analytics Features Ad-hoc experimentation Deeply integrated with Spark + Hadoop Supports multiple language backends Incubating at Apache Use Case Data exploration and discovery Visualization Interactive snippet-at-a-time experience “Modern Data Science Studio”
  • 6.
    Page6 © HortonworksInc. 2011 – 2015. All Rights Reserved Fundamental Abstraction: Resilient Distributed Datasets RDD Work with distributed collections as primitives RDD Properties • Immutable collections of objects spread across a cluster • Built through parallel transformations (map, filter, etc.) • Automatically rebuilt on failure • Controllable persistence (e.g. caching in RAM) Multiple Languages broad developer, partner and customer engagement RDD Partition 1 RDD Partition 2 RDD Partition 3Worker Node Worker Node Worker Node RDD LogicalSpark Driver sc = new SparkContext rDD =sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map … Developer Physical Writes RDD RDDs are collections of objects distributed across a cluster, cached in RAM or on disk. They are built through parallel transformations, automatically rebuilt on failure and immutable (each transformation creates a new RDD).
  • 7.
    Page7 © HortonworksInc. 2011 – 2015. All Rights Reserved What can developers do with RDDs? RDD Operations Transformations • e.g. map, filter, groupBy, join • Lazy operations to build RDDs from other RDDs Actions • e.g. count, collect, save • Return a result or write it to storage Other primitives • Accumulator • Broadcast Variables Developer Writes RDD Operations Writes Accumulator s Actions Broadcast Variables Transformations
  • 8.
    Page8 © HortonworksInc. 2011 – 2015. All Rights Reserved lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Example: Mining Console Logs Load error messages from a log into memory, then interactively search for patterns
  • 9.
    Page9 © HortonworksInc. 2011 – 2015. All Rights Reserved RDD Demo
  • 10.
    Page10 © HortonworksInc. 2011 – 2015. All Rights Reserved SQL SQL Access and Data Frames YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark Streaming MLlib Spark SQL
  • 11.
    Page11 © HortonworksInc. 2011 – 2015. All Rights Reserved YARN HDFS Spark SQL Table Structure integrated to work with tables and rows Hive Queries via Spark by Spark SQL Context can connect to Hive and query Hive Bindings to Python, Scala, Java, and R Data Frames new abstractions simplifies and speeds up SQL processing Spark Core Engine Spark SQL Data Frame DSL Spark SQL Data Frame API Data Source API
  • 12.
    Page12 © HortonworksInc. 2011 – 2015. All Rights Reserved Storage What are Data Frames? Data Frames represent data in RDDs as a Table RDD is a low level abstraction –Think of RDD as bytecode and DataFrame as the Java Program Data Frame Properties –Data Frames attach schema to RDDs –Allows users to perform aggressive query optimizations –Brings the power of SQL to RDDs! dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Tuple Relational View Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog
  • 13.
    Page13 © HortonworksInc. 2011 – 2015. All Rights Reserved Data Frames are intuitive RDD Example Equivalent Data Frame Example dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Find average age by department?
  • 14.
    Page14 © HortonworksInc. 2011 – 2015. All Rights Reserved DataFrame Demo YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark Streaming MLlib Spark SQL
  • 15.
    Page15 © HortonworksInc. 2011 – 2015. All Rights Reserved MLlib Machine Learning Library YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL Spark Streaming MLlib
  • 16.
    Page16 © HortonworksInc. 2011 – 2015. All Rights Reserved What is Machine Learning? Machine learning is the study of algorithms that learn concepts from data. A key aspect of learning is generalization: how well a learning algorithm is able to predict on unseen examples.
  • 17.
    Page17 © HortonworksInc. 2011 – 2015. All Rights Reserved Machine Learning Primitives Unsupervised Learning Clustering (K-means) Recommendation Collaborative Filtering - alternating least squares Dimensionality Reductions - Principal component analysis (PCA) and singular value decomposition (SVD) Supervised Learning Classification - Naïve Bayes, Decision Tree, Random Forest, Gradient Boosted Trees Regression - linear, logistic and Support Vector Machines (SVMs)
  • 18.
    Page18 © HortonworksInc. 2011 – 2015. All Rights Reserved ML Workflows are complex Q-Q Q-A similarit y Log Parsing, Cleanin g Ad category mapping Query category mapping Poly Exp (Q-A) Feature s Model Linear Solver train test Metrics • Feature Extraction Feature Extraction Ad Server Sponsored Search Advertising Pipeline Challenges: -> specify pipeline -> inspect and debug -> tune hyperparameters -> productionize HDFS
  • 19.
    Page19 © HortonworksInc. 2011 – 2015. All Rights Reserved ML Pipeline makes ML workflows easier Transformer Transforms one dataset into another Estimator Fits model to data Pipeline Sequence of stages, consisting of estimators or transformers Parameters Trait for components that take parameters
  • 20.
    Page20 © HortonworksInc. 2011 – 2015. All Rights Reserved Streaming Real Time Stream Processing YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL MLlib Spark Streaming
  • 21.
    Page21 © HortonworksInc. 2011 – 2015. All Rights Reserved Spark Streaming • Spark Streaming is an extension of Spark-core API that supports scalable, high throughput and fault-tolerant streaming applications. • Data can be ingested from many data sources like Kafka, Flume, Twitter, ZeroMQ or TCP sockets • Data is processed using the now-familiar API: map, filter, reduce, join and window • Processed data can be stored in databases, filesystems, or live dashboards
  • 22.
    Page22 © HortonworksInc. 2011 – 2015. All Rights Reserved GraphX Graph Processing YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine Spark SQL Spark Streaming MLlib GraphX
  • 23.
    Page23 © HortonworksInc. 2011 – 2015. All Rights Reserved Spark GraphX Graph API on Spark Seamlessly work with graphs and collections Growing library of graph algorithms • SVD++, Connected Components, Triangle Count, … Iterative Graph Computations using Pregel Implements Valiant’s Bulk Synchronous Parallel (BSP) model for distributing graph algorithms. Use Case Social Media: Suggest new connections based on existing relationships Networking: Best routing through a given network
  • 24.
    Page24 © HortonworksInc. 2011 – 2015. All Rights Reserved Part. 2 Part. 1 Vertex Table (RDD) B C A D F E A D Distributed Graphs as Tables (RDDs) D Property Graph B C D E AA F Edge Table (RDD) A B A C C D B C A E A F E F E D B C D E A F Routing Table (RDD) B C D E A F 1 2 1 2 1 2 1 2 2D Vertex Cut Heuristic
  • 25.
    Page25 © HortonworksInc. 2011 – 2015. All Rights Reserved How to Get Started with Spark
  • 26.
    Page26 © HortonworksInc. 2011 – 2015. All Rights Reserved Try Spark Today Download the Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/ Go to the Apache Spark Website http://spark.apache.org/ Learn Spark Build a Proof of Concept Test New Functionality
  • 27.
    Page27 © HortonworksInc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2013 Thank You! Eric Mizell - Director, Solutions Engineering emizell@hortonworks.com

Editor's Notes

  • #3 NEED SPEAKER NOTES
  • #4 NEED SPEAKER NOTES
  • #5 NEED SPEAKER NOTES
  • #6 TALK TRACK Ad-hoc experimentation Spark, Hive, Shell, Flink, Tajo, Ignite, Lens, etc Deeply integrated with Spark + Hadoop Can be managed via Ambari Stacks Supports multiple language backends Pluggable “Interpreters” Incubating at Apache 100% open source and open community [NEXT SLIDE]
  • #7 TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pd
  • #8 TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pd
  • #9 Key idea: add “variables” to the “functions” in functional programming
  • #10 NEED SPEAKER NOTES
  • #11 NEED SPEAKER NOTES
  • #12 NEED SPEAKER NOTES
  • #14 Spark DataFrames represent tabular Data
  • #15 NEED SPEAKER NOTES
  • #16 NEED SPEAKER NOTES
  • #17 NEED SPEAKER NOTES
  • #18 TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE]
  • #20 TALK TRACK [NEXT SLIDE]
  • #21 NEED SPEAKER NOTES
  • #23 NEED SPEAKER NOTES
  • #24 TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] [RESOURCES] A vertex is an entity that can bring a bag of data (generally small) An edge connects vertices and can also own a bag of data https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf
  • #28 Takeaways Change order of interoperability slide Flush out no lock-in slide to talk about “proprietary open source”