Apache Spark: Lightning Fast Cluster Computing

Apache Spark
Lightening Fast Cluster Computing
Eric Mizell – Director, Solution Engineering

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Apache Spark?
Apache Open Source Project
Distributed Compute Engine
for fast and expressive data processing
Designed for Iterative, In-Memory
computations and interactive data mining
Expressive Multi-Language APIs
for Java, Scala, Python, and R
Powerful Abstractions
Enable data workers to rapidly iterate over
data for:
• ETL, Machine Learning, SQL, Stream Processing,
and Graph Processing
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
Spark
Streaming
MLlib

Why Spark?
Elegant Developer APIs
• Data Frames/SQL, Machine Learning, Graph algorithms and streaming
• Scala, Python, Java and R
• Single environment for pre-processing and Machine Learning
In-memory computation model
• Effective for iterative computations and machine learning
Machine Learning On Hadoop
• Implementation of distributed ML-algorithms
• Pipeline API (Spark ML)
Runs on Hadoop on YARN, Mesos, standalone

Interactions with Spark
Command Line
• Scala shell – Scala/Java (./bin/spark-shell)
• Python - (./bin/pyspark)
Notebooks
• Apache Zeppelin Notebook
• Juptyer/IPython Notebook
• IRuby Notebook
ODBC/JDBC (Spark SQL only via Thrift)
• Simba driver
• DataDirect driver

Introducing Apache Zeppelin Web-based Notebook for
interactive analytics
Features
Ad-hoc experimentation
Deeply integrated with Spark + Hadoop
Supports multiple language backends
Incubating at Apache
Use Case
Data exploration and discovery
Visualization
Interactive snippet-at-a-time experience
“Modern Data Science Studio”

Fundamental Abstraction: Resilient Distributed Datasets
RDD
Work with distributed collections as
primitives
RDD Properties
• Immutable collections of objects spread across
a cluster
• Built through parallel transformations (map,
filter, etc.)
• Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)
Multiple Languages
broad developer, partner and customer
engagement
RDD
Partition 1
RDD
Partition 2
RDD
Partition 3Worker Node
Worker Node
Worker Node
RDD
LogicalSpark
Driver
sc = new SparkContext
rDD
=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
…
Developer
Physical
Writes
RDD
RDDs are collections of objects distributed across a cluster,
cached in RAM or on disk. They are built through parallel
transformations, automatically rebuilt on failure and immutable
(each transformation creates a new RDD).

What can developers do with RDDs?
RDD Operations
Transformations
• e.g. map, filter, groupBy, join
• Lazy operations to build RDDs from other
RDDs
Actions
• e.g. count, collect, save
• Return a result or write it to storage
Other primitives
• Accumulator
• Broadcast Variables
Developer
Writes
RDD
Operations
Writes
Accumulator
s
Actions
Broadcast
Variables
Transformations

lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘t’)[2])
messages.cache()
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec
(vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Example: Mining Console Logs
Load error messages from a log into memory, then
interactively search for patterns

RDD
Demo

SQL
SQL Access and Data Frames
YARN
HDFS
Scala
Java
Python
APIs
GraphX
Spark
Streaming
MLlib
Spark
SQL

YARN
HDFS
Spark SQL
Table Structure
integrated to work with tables and rows
Hive Queries via Spark
by Spark SQL Context can connect to Hive and
query Hive
Bindings
to Python, Scala, Java, and R
Data Frames
new abstractions simplifies and speeds up SQL
processing
Spark Core Engine
Spark SQL
Data Frame DSL Spark SQL
Data Frame API
Data Source API

Storage
What are Data Frames?
Data Frames represent data in RDDs as a Table
RDD is a low level abstraction
–Think of RDD as bytecode and DataFrame as the
Java Program
Data Frame Properties
–Data Frames attach schema to RDDs
–Allows users to perform aggressive query
optimizations
–Brings the power of SQL to RDDs!
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Tuple
Relational
View
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog

Data Frames are intuitive
RDD Example
Equivalent Data Frame Example
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Find average age by department?

DataFrame
Demo
YARN
HDFS
Scala
Java
Python
APIs
GraphX
Spark
Streaming
MLlib
Spark
SQL

MLlib
Machine Learning Library
YARN
HDFS
Scala
Java
Python
APIs
GraphX
Spark
SQL
Spark
Streaming
MLlib

What is Machine Learning?
Machine learning is the study of
algorithms that learn concepts from
data.
A key aspect of learning is
generalization: how well a learning
algorithm is able to predict on unseen
examples.

Machine Learning Primitives
Unsupervised Learning
Clustering (K-means)
Recommendation
Collaborative Filtering
- alternating least squares
Dimensionality Reductions
- Principal component analysis (PCA) and singular
value decomposition (SVD)
Supervised Learning
Classification
- Naïve Bayes, Decision Tree, Random Forest,
Gradient Boosted Trees
Regression
- linear, logistic and Support Vector Machines
(SVMs)

ML Workflows are complex
Q-Q
Q-A
similarit
y
Log
Parsing,
Cleanin
g
Ad
category
mapping
Query
category
mapping
Poly
Exp
(Q-A)
Feature
s
Model
Linear
Solver
train
test
Metrics
• Feature Extraction
Feature
Extraction
Ad Server
Sponsored Search Advertising Pipeline Challenges:
-> specify pipeline
-> inspect and debug
-> tune hyperparameters
-> productionize
HDFS

ML Pipeline makes ML workflows easier
Transformer
Transforms one dataset into another
Estimator
Fits model to data
Pipeline
Sequence of stages, consisting of estimators
or transformers
Parameters
Trait for components that take parameters

Streaming
Real Time Stream Processing
YARN
HDFS
Scala
Java
Python
APIs
GraphX
Spark
SQL
MLlib
Spark
Streaming

Spark Streaming
• Spark Streaming is an extension of Spark-core API that supports scalable, high
throughput and fault-tolerant streaming applications.
• Data can be ingested from many data sources like Kafka, Flume, Twitter, ZeroMQ or
TCP sockets
• Data is processed using the now-familiar API: map, filter, reduce, join and window
• Processed data can be stored in databases, filesystems, or live dashboards

GraphX
Graph Processing
YARN
HDFS
Scala
Java
Python
APIs
Spark
SQL
Spark
Streaming
MLlib GraphX

Spark GraphX Graph API on Spark
Seamlessly work with graphs and collections
Growing library of graph algorithms
• SVD++, Connected Components, Triangle
Count, …
Iterative Graph Computations using
Pregel
Implements Valiant’s Bulk Synchronous
Parallel (BSP) model for distributing graph
algorithms.
Use Case
Social Media: Suggest new connections based
on existing relationships
Networking: Best routing through a given
network

Part. 2
Part. 1
Vertex Table
(RDD)
B C
A D
F E
A D
Distributed Graphs as Tables (RDDs)
D
Property Graph
B C
D
E
AA
F
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
Routing
Table (RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
2D Vertex Cut Heuristic

How to Get Started with Spark

Try Spark Today
Download the Hortonworks Sandbox
http://hortonworks.com/products/hortonworks-sandbox/
Go to the Apache Spark Website
http://spark.apache.org/
Learn Spark
Build a Proof of Concept
Test New Functionality

Apache Spark: Lightning Fast Cluster Computing

More Related Content

What's hot

Viewers also liked

Similar to Apache Spark: Lightning Fast Cluster Computing

More from All Things Open

Recently uploaded

Apache Spark: Lightning Fast Cluster Computing

Editor's Notes