Apache Spark:
killer or savior of Apache Hadoop?	

Roman Shaposhnik	

Director of Open Source @Pivotal	

(Twitter: @rhatr)
Who’s this guy?	

•  Director of Open Source (building a team of OS contributors)	

•  Apache Software Foundation guy (Member, VP of Apache
Incubator, committer on Hadoop, Giraph, Sqoop, etc)	

•  Used to be root@Cloudera	

•  Used to be PHB@Yahoo! (original Hadoop team)	

•  Used to be a hacker at Sun microsystems (Sun Studio compilers
and tools)
Shameless plug	

http://manning.com/martella
Dearly beloved…
40 minute to figure out	

Hadoop vs. Spark
40 minute to figure out	

Hadoop++ == Spark
40 minute to figure out	

Hadoop + Spark
40 minute to figure out
Long, long time ago…	

HDFS
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

MapReduce
In a blink of an eye	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
Tachyon
A Spark view?	

HDFS
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

GemFire XD
Oozie
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase Spark
Shark
Streaming
MLib
GraphX
SpringXD
YARN
Tachyon
BDAS
Principle #1	

HDFS is the datalake
Your datacenter	

…	

server 1	

server N
Hadoop’s view	

MapReduce	

server 1	

server N	

HDFS
HDFS: decoupled storage	

	

…	

 MR	

HDFS	

MR
Anatomy of MapReduce	

d a c 	

a b c	

a 3	

b 1	

c 2	

a 1	

b 1 	

c 1	

a 1	

c 1 	

a 1	

a 1 1 1	

b 1 	

c 1 1	

HDFS mappers reducers HDFS
Principle #2	

MR is assembly language
MapReduce 1.0	

Job	

Tracker	

Task	

Tracker
(HDFS)	

Task	

Tracker
(HDFS)	

task1	

task1	

task1	

task1	

task1	

task1	

task1	

task1	

task1	

taskN
YARN (AKA MR2.0)	

Resource
Manager	

Job	

Tracker	

task1	

task1	

task1	

task1	

task1	

Task	

Tracker
YARN (AKA MR2.0)	

Resource
Manager	

Job	

Tracker	

task1	

task1	

task1	

task1	

task1	

Task	

Tracker
Principle #3	

MR: YARN + library
What’s wrong with MR?	

Source: UC Berkeley Spark project (just the image)
Principle #4	

$ grep –R | awk | sort …
Spark philosophy	

• Make life easy for Data Scientists	

• Provide well documented and expressive APIs	

• Powerful Domain Specific Libraries	

• Easy integration with storage systems	

• Caching to avoid data movement	

• Well defined releases, stable API
Spark innovations	

• Resilient Distribtued Datasets (RDDs)	

• Distributed on a cluster	

• Manipulated via parallel operators (map, etc.)	

• Automatically rebuilt on failure	

• A parallel ecosystem	

• A solution to iterative and multi-stage apps
RDDs	

warnings = textFile(…).filter(_.contains(“warning”))	

.map(_.split(‘ ‘)(1))	

	

	

	

	

	

	

	

HadoopRDD
path = hdfs://	

FilteredRDD
contains…	

MappedRDD	

split…
Parallel operators	

• map, reduce	

• sample, filter	

• groupBy, reduceByKey	

• join, leftOuterJoin, rightOuterJoin	

• union, cross
How do I use it?	

val file = spark.textFile(hdfs://...)	

val counts = file.flatMap(line = line.split( ))	

.map(word = (word, 1))	

.reduceByKey(_ + _)	

counts.saveAsTextFile(hdfs://...)
Principle #5	

Memory is the new disk
RDDs are the foundation	

• SQL	

• Graph	

• ML	

• Streaming
Spark SQL	

• Lib in Spark Core that models RDDs as rels.	

• SchemaRDD	

• Replaces Shark	

• Lightweight with no code from Hive	

• Import/Export into different storage formats	

• Columnar storage (as in Shark)
Spark Streaming	

• Extend Spark to do large scale stream
processing	

• Simple, batch like API with RDDs	

• Single semantics for both real time and high
latency
D-Streams
Streaming from Twitter	

TwitterUtils.createStream(...)	

.filter(_.getText.contains(Spark))	

.countByWindow(Seconds(5))
Spark GraphX	

• Pregel (BSP) (formerly know as Bagel)	

• Graph-centric modeling	

• Unification of processing	

• No more MR trickery
You killed Apache Giraph?
MLbase	

• Machine Learning toolset	

• MatLab for scale out computing	

• Built on Spark Mlib	

• Classification, Regression, Colab. Filtering, etc.
What is really happening?	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
Tachyon
Principle #6	

Spark: the ecosystem
May be its not so bad	

server 1	

server N
But HDFS/YARN are safe?	

HDFS, Ceph, S3, NAS, etc.	

New	

HDFS	

New	

YARN
What is *really* going on?	

• 2009 Research at UCB, written in Scala	

• 2010 Open Sourced	

• 2013 Accepted into Apache Incubator	

• 2013 Databricks formed ($14M funding)	

• 2014 Becomes TLP with ASF	

• 2014 Spark 1.0 is out	

• 2014 Databricks gets an extra $33M
Bigdata: brought to U by ASF	

• 50% ML traffic	

• 100-200 contributors across 25-35 companies	

• More active than Hadoop	

• Cross-pollination with other TLPs
Principle #7	

Where Hadoop was ‘09
This is how hardening looks
What is Hadoop?	

Hadoop != MR + HDFS
The ecosystem	

• Apache HBase	

• Apache Crunch, Pig, Hive and Phoenix	

• Apache Giraph	

• Apache Oozie	

• Apache Mahout	

• Apache Sqoop and Flume
Principle #8	

Spark: an alternative
backend
Spark is best for cloud
Principle #9	

Memory is expensive
What’s new?	

• True elasticity	

• Resource partitioning	

• Security	

• Data marketplace	

• Multi datacenter deployments
Hadoop Maturity
ETL Offload	

Accommodate massive 
data growth with existing
EDW investments	

Data Lakes	

Unify Unstructured and
Structured Data Access	

Big Data
Apps	

Build analytic-led
applications impacting 
top line revenue	

Data-Driven
Enterprise	

App Dev and Operational
Management on HDFS
Data Architecture
Pivotal HD on Pivotal CF
Ÿ Enterprise PaaS Management System
Ÿ Flexible multi-language ‘buildpack’
architecture
Ÿ Deployed applications enjoy built-in
services
Ÿ On-Premise Hadoop as a Service
Ÿ Single cluster deployment of Pivotal HD
Ÿ Developers instantly bind to shared
Hadoop Clusters
Ÿ Speeds up time-to-value
Pivotal’s view	

Data Science Platform
Tachyon/Gem
Cluster Manager
MR
Application
Stream
Server
MPP
SQL
Data Lake / HDFS / Virtual Storage
GemFireXD	

...ETC	

Hadoop HDFS	

 Isilon	

App Dev / Ops	

MLbase	

Streaming	

Legacy	

Systems	

Legacy	

Data Scientists	

Data Sources	

 End Users	

SparkSQL
Principle #10	

The rumors of my
death…
It will be called Hadoop	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

GemFire with Tachyon
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
Spark recap	

• Is it “Big Data” (Yes)	

• Is it “Hadoop” (No)	

• It’s one of those “in memory” things, right (Yes)	

• JVM, Java, Scala (All)	

• Is it Real or just another shiny technology with
a long, but ultimately small tail (Yes and ?)
A NEW PLATFORM FOR A NEW
ERA
Credits	

• Wikipedia and Dilbert.com	

• Apache Software Foundation	

• Scott Deeg	

• Milind Bhandarkar	

• Susheel Kaushik	

• Mak Gokhale
Questions ?

Apache Spark: killer or savior of Apache Hadoop?