Apache Spark: killer or savior of Apache Hadoop?

Apache Spark:
killer or savior of Apache Hadoop?

Roman Shaposhnik

Director of Open Source @Pivotal

(Twitter: @rhatr)

Who’s this guy?

•  Director of Open Source (building a team of OS contributors)

•  Apache Software Foundation guy (Member, VP of Apache
Incubator, committer on Hadoop, Giraph, Sqoop, etc)

•  Used to be root@Cloudera

•  Used to be PHB@Yahoo! (original Hadoop team)

•  Used to be a hacker at Sun microsystems (Sun Studio compilers
and tools)

Shameless plug

http://manning.com/martella

40 minute to ﬁgure out

Hadoop vs. Spark


Hadoop++ == Spark


Hadoop + Spark

Long, long time ago…

HDFS
ASF Projects

FLOSS Projects

Pivotal Products

MapReduce

In a blink of an eye

HDFS
Pig
Sqoop Flume
Coordination and
workﬂow
management

Zookeeper
Command
Center
ASF Projects

FLOSS Projects

Pivotal Products

GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
Tachyon

A Spark view?

HDFS
Sqoop Flume
Coordination and
workﬂow
management

Zookeeper
Command
Center
ASF Projects

FLOSS Projects

Pivotal Products

GemFire XD
Oozie
Hadoop UI

Hue
SolrCloud
Phoenix
HBase Spark
Shark
Streaming
MLib
GraphX
SpringXD
YARN
Tachyon

Principle #1

HDFS is the datalake

Your datacenter

…

server 1

server N

Hadoop’s view

MapReduce

server 1

server N

HDFS

HDFS: decoupled storage

…

MR

HDFS

MR

Anatomy of MapReduce

d a c

a b c

a 3

b 1

c 2

a 1

b 1

c 1

a 1

c 1

a 1

a 1 1 1

b 1

c 1 1

HDFS mappers reducers HDFS

Principle #2

MR is assembly language

MapReduce 1.0

Job

Tracker

Task

Tracker
(HDFS)

Task

Tracker
(HDFS)

task1

task1

task1

task1

task1

task1

task1

task1

task1

taskN

YARN (AKA MR2.0)

Resource
Manager

Job

Tracker

task1

task1

task1

task1

task1

Task

Tracker

Principle #3

MR: YARN + library

What’s wrong with MR?

Source: UC Berkeley Spark project (just the image)

Principle #4

$ grep –R | awk | sort …

Spark philosophy

• Make life easy for Data Scientists

• Provide well documented and expressive APIs

• Powerful Domain Speciﬁc Libraries

• Easy integration with storage systems

• Caching to avoid data movement

• Well deﬁned releases, stable API

Spark innovations

• Resilient Distribtued Datasets (RDDs)

• Distributed on a cluster

• Manipulated via parallel operators (map, etc.)

• Automatically rebuilt on failure

• A parallel ecosystem

• A solution to iterative and multi-stage apps

RDDs

warnings = textFile(…).ﬁlter(_.contains(“warning”))

.map(_.split(‘ ‘)(1))

HadoopRDD
path = hdfs://

FilteredRDD
contains…

MappedRDD

split…

Parallel operators

• map, reduce

• sample, ﬁlter

• groupBy, reduceByKey

• join, leftOuterJoin, rightOuterJoin

• union, cross

How do I use it?

val file = spark.textFile(hdfs://...)

val counts = file.flatMap(line = line.split( ))

.map(word = (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile(hdfs://...)

Principle #5

Memory is the new disk

RDDs are the foundation

• SQL

• Graph

• ML

• Streaming

Spark SQL

• Lib in Spark Core that models RDDs as rels.

• SchemaRDD

• Replaces Shark

• Lightweight with no code from Hive

• Import/Export into different storage formats

• Columnar storage (as in Shark)

Spark Streaming

• Extend Spark to do large scale stream
processing

• Simple, batch like API with RDDs

• Single semantics for both real time and high
latency

Streaming from Twitter

TwitterUtils.createStream(...)

.ﬁlter(_.getText.contains(Spark))

.countByWindow(Seconds(5))

Spark GraphX

• Pregel (BSP) (formerly know as Bagel)

• Graph-centric modeling

• Uniﬁcation of processing

• No more MR trickery

MLbase

• Machine Learning toolset

• MatLab for scale out computing

• Built on Spark Mlib

• Classiﬁcation, Regression, Colab. Filtering, etc.

What is really happening?

HDFS
Pig
Sqoop Flume
Coordination and
workﬂow
management

Zookeeper
Command
Center
ASF Projects

FLOSS Projects

Pivotal Products

GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
Tachyon

Principle #6

Spark: the ecosystem

May be its not so bad

server 1

server N

But HDFS/YARN are safe?

HDFS, Ceph, S3, NAS, etc.

New

HDFS

New

YARN

What is *really* going on?

• 2009 Research at UCB, written in Scala

• 2010 Open Sourced

• 2013 Accepted into Apache Incubator

• 2013 Databricks formed ($14M funding)

• 2014 Becomes TLP with ASF

• 2014 Spark 1.0 is out

• 2014 Databricks gets an extra $33M

Bigdata: brought to U by ASF

• 50% ML trafﬁc

• 100-200 contributors across 25-35 companies

• More active than Hadoop

• Cross-pollination with other TLPs

Principle #7

Where Hadoop was ‘09

What is Hadoop?

Hadoop != MR + HDFS

The ecosystem

• Apache HBase

• Apache Crunch, Pig, Hive and Phoenix

• Apache Giraph

• Apache Oozie

• Apache Mahout

• Apache Sqoop and Flume

Principle #8

Spark: an alternative
backend

Principle #9

Memory is expensive

What’s new?

• True elasticity

• Resource partitioning

• Security

• Data marketplace

• Multi datacenter deployments

Hadoop Maturity
ETL Ofﬂoad

Accommodate massive
data growth with existing
EDW investments

Data Lakes

Unify Unstructured and
Structured Data Access

Big Data
Apps

Build analytic-led
applications impacting
top line revenue

Data-Driven
Enterprise

App Dev and Operational
Management on HDFS
Data Architecture

Pivotal HD on Pivotal CF
Ÿ Enterprise PaaS Management System
Ÿ Flexible multi-language ‘buildpack’
architecture
Ÿ Deployed applications enjoy built-in
services
Ÿ On-Premise Hadoop as a Service
Ÿ Single cluster deployment of Pivotal HD
Ÿ Developers instantly bind to shared
Hadoop Clusters
Ÿ Speeds up time-to-value

Pivotal’s view

Data Science Platform
Tachyon/Gem
Cluster Manager
MR
Application
Stream
Server
MPP
SQL
Data Lake / HDFS / Virtual Storage
GemFireXD

...ETC

Hadoop HDFS

Isilon

App Dev / Ops

MLbase

Streaming

Legacy

Systems

Legacy

Data Scientists

Data Sources

End Users

SparkSQL

Principle #10

The rumors of my
death…

It will be called Hadoop

HDFS
Pig
Sqoop Flume
Coordination and
workﬂow
management

Zookeeper
Command
Center
ASF Projects

FLOSS Projects

Pivotal Products

GemFire with Tachyon
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN

Spark recap

• Is it “Big Data” (Yes)

• Is it “Hadoop” (No)

• It’s one of those “in memory” things, right (Yes)

• JVM, Java, Scala (All)

• Is it Real or just another shiny technology with
a long, but ultimately small tail (Yes and ?)

Credits

• Wikipedia and Dilbert.com

• Apache Software Foundation

• Scott Deeg

• Milind Bhandarkar

• Susheel Kaushik

• Mak Gokhale

Apache Spark: killer or savior of Apache Hadoop?

More Related Content

What's hot

Viewers also liked

Similar to Apache Spark: killer or savior of Apache Hadoop?

Recently uploaded

Apache Spark: killer or savior of Apache Hadoop?