Scaling with apache spark (a lesson in unintended consequences) strange loop 2017 (1)

Scaling with Apache Spark*
or a lesson in unintended consequences
StrangeLoop 2017
*Not suitable for all audiences. Viewer discretion is
advised for individuals who believe vendor
marketing materials.

Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● Apache Spark committer (as of January!) :)
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos

Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
● On twitter @BooProgrammer

Spark Technology
Center
5
IBM
Spark
Technology
Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark
community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business
applications — http://bigdatauniversity.com
Key statistics:
About 50 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now an Apache Incubator project.
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
Spark Technology
Center

What is going to be covered:
● What I think I might know about you
● Spark’s core abstractions for distributed data & computation
○ That wonderful wordcount example as always :)
● Why Spark is designed the way it is
● Re-using Data in Spark and why it needs special considerations
● Why I wish we had a different method for partitioning, and you will too
● How Spark in “other” (R & Python, C# & friends) works
○ Why this doesn’t always summon Cthulhu but definitely has the possibility of
Torsten Reuschling

Who I think you wonderful humans are?
● Nice* people
● Don’t mind pictures of cats
● You might want to scale your Apache Spark jobs
● You might also be curious why Spark is designed the way it is
● Don’t overly mind a grab-bag of topics
● Likely no longer distracted with Pokemon GO :(

What is Spark?
● General purpose data parallel
distributed system
○ With a really nice API including Python :)
● Apache project
● Must faster than Hadoop
Map/Reduce
● Good when too your problem is too
big for a single machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets

When we say distributed we mean...

Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods

Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau

What’s Spark history?
● Can be viewed as the descendant of several projects
● Map/Reduce (Google/Hadoop)
○ Except with more primitives & different resilience
● DryiadLINQ
○ Different language, more in memory focus
● Flume Java
○ (not to be confused with Apache Flume)
○ Lazy evaluation instead of whole program optimizer
○ Does not compile to MR
● Came out of UCB AmpLab, early workload on Mesos

Spark specific terms in this talk
● RDD
○ Resilient Distributed Dataset - Like a distributed collection. Supports
many of the same operations as Seq’s in Scala but automatically
distributed and fault tolerant. Lazily evaluated, and handles faults by
recompute. Any* Java or Kyro serializable object.
● DataFrame
○ Spark DataFrame - not a Pandas or R DataFrame. Distributed,
supports a limited set of operations. Columnar structured, runtime
schema information only. Limited* data types.
● Dataset
○ Compile time typed version of DataFrame (templated)
skdevitt

The different pieces of Spark: 2.0+
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark
ML
bagel &
Graph X
MLLib
Community
Packages
Structured
Streaming

Design piece #1: Lazyness
● In Spark most of our work is done by transformations
● Transformations return new RDDs or DataFrames
representing this data
● The RDD or DataFrame however isn’t eagerily
evaluated
● RDD & DataFrames are really just “plans” of how to
make the data show up if we force Spark’s hand
● With a side order of immutable for “free”
● tl;dr - the data doesn’t exist until it “has” to
Photo by Dan G

The DAG The query plan Susanne Nilsson

Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Photo By: Will
Keightley

Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile("output")
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
daniilr

Why laziness is cool (and not)
● Pipelining (can put maps, filter, flatMap together)
● Can do interesting optimizations by delaying work
● We use the DAG to recompute on failure
○ (writing data out to 3 disks on different machines is so last season)
○ Or the DAG puts the R is Resilient RDD, except DAG doesn’t have an
R :(
How it hurts:
● Debugging is confusing
● Re-using data - lazyness only sees up to the first action
● Some people really hate immutability
Matthew Hurst

RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory caching in memory
○ persisting at another level
■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK,
MEMORY_AND_DISK_SER
○ checkpointing
● Noisey clusters
○ _2 & checkpointing can help
● persist first for checkpointing
Richard Gillin

Design part #2: partitioning
● When reading data …
● When we need to get data to different machines (e.g.
shuffle) we get a special “known” partitioner
● Partioners in Spark are deterministic on key input (e.g.
for any given key they must always send to the same
partition)
● Impacts operations like groupByKey but also even just
sortByKey
Helen Olney

Key-skew to the anti-rescue… :(
● Keys aren’t evenly distributed
○ Sales by zip code, or records by city, etc.
● groupByKey will explode (but it's pretty easy to break)
● We can have really unbalanced partitions
○ If we have enough key skew sortByKey could even fail
○ Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce

So what does groupByKey look like?
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
Tomomi

So what did we do instead?
● reduceByKey
○ Works when the types are the same (e.g. in our summing version)
● aggregateByKey
○ Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)

Can just the shuffle cause problems?
● Sorting by key can put all of the records in the same partition
● We can run into partition size limits (around 2GB)
● Or just get bad performance
● So we can handle data like the above we can add some “junk” to our key
● Common approach in Hadoop MR -- other systems allow combination of
non-deterministic partioners OR dynamic splitting during compute.
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
PROTodd
Klassy

Shuffle explosions :(
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110, A, B)
(94110, A, C)
(94110, E, F)
(94110, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(10003, D, E)
javier_artiles

100% less explosions
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, U, R)
(94110, T, R)
(94110_A, A, B)
(94110_A, A, C)
(94110_A, A, R)
(94110_D, D, R)
(94110_U, U, R)
(10003_A, A, R)
(10003_D, D, E)
(67843_T, T, R)
(94110_E, E, R)
(94110_E, E, R)
(94110_E, E, F)
(94110_T, T, R)
Jennifer Williams

Why (deterministic)* partitioning?
● Splits up our data (and our work)
● Known deterministic partitioners allow for fast joins
● You can even do interesting lookup type things this way
● co-location - yaaay
● Sorting - could split but would have to do more sampling

Design part #3: Arbitrary functions
● Super powerful
● Difficult for the optimizer to look inside
● groupByKey + mapValues is effectively opaque (as
discussed)
● But so is filter -- what about if I only need X partitions?
● Part of the motivation for DataFrames/Datasets
○ Can use SQL expressions which the optimizer can look at
○ For complicated things we can still do arbitrary work

key-skew + black boxes == more sadness
● There is a worse way to do WordCount
● We can use the seemingly safe thing called groupByKey
● Then compute the sum
● But since it’s on a slide of “more sadness” we know where
this is going...
_torne

Bad word count :(
words = rdd.flatMap(lambda x: x.split(" "))
wordPairs = words.map(lambda w: (w, 1))
grouped = wordPairs.groupByKey()
counted_words = grouped.mapValues(lambda counts: sum(counts))
counted_words.saveAsTextFile("boop")
Tomomi

Mini “fix”: Datasets (aka DataFrames)
● Still super powerful
● Still allow arbitrary lambdas
● But give you more options to “help” the optimizer
● groupBy returns a GroupedDataStructure and offers
special aggregates
● Selects can push filters down for us*
● Etc.

Using Datasets to mix functional & relational style
val ds: Dataset[RawPanda] = ...
val happiness = ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)

So what was that?
ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
A typed query (specifies the
return type). Without the as[]
will return a DataFrame
(Dataset[Row])
Traditional functional
reduction:
arbitrary scala code :)
Robert Couse-Baker

And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}

How much faster can it be?
Andrew Skudder

But where DataFrames explode?
● Iterative algorithms - large plans
● Some push downs are sad pandas :(
● Default shuffle size is sometimes too small for big data
(200 partitions)
● Default partition size when reading in is also sad

Adding/working with non-JVM languages
● Spark is written in Scala (runs on the JVM)
● Users want to work in their favourite language
● We also want to support “deep learning” (GPUs, etc.)
○ I live in the bay area, buzzwords =~ rent
● Python, R, C#, etc. all need a way to talk to the JVM
● How expensive could IPC be anyways? :P
○ Also strings are a great format for everything right?

A quick detour into PySpark’s internals
Photo by Bill Ward

Spark in Scala, how does PySpark work?
● Py4J + pickling + magic
○ This can be kind of slow sometimes
● RDDs are generally RDDs of pickled objects
● Spark SQL (and DataFrames) avoid some of this
kristin klein

So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe

So how does that impact PySpark?
● Data from Spark worker serialized and piped to Python
worker
○ Multiple iterator-to-iterator transformations are still pipelined :)
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● etc.

And back to Dataframes…:
Andrew Skudder
*Note: do not compare absolute #s with previous graph -
different dataset sizes because I forgot to write it down when I
made the first one.

Andrew Skudder
*Arrow: possibly the future. I really hope so. Spark 2.3 and beyond!
* *

Patches Welcome?
● For most of these not really
○ Hard to fix core design changes incrementally
● SPIPs more welcome (w/ proof of concept code if you
want folks to read them)
○ Possibly thesis proposals as well :p
● Also building other systems that Spark can use (like
Apache Arrow)

Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
Paul Anderson

PLZ test (Spark Testing Resources)
● Libraries
○ Scala: spark-testing-base (scalacheck & unit) sscheck (scalacheck)
example-spark (unit)
○ Java: spark-testing-base (unit)
○ Python: spark-testing-base (unittest2), pyspark.test (pytest)
● Strata San Jose Talk (up on YouTube)
● Blog posts
○ Unit Testing Spark with Java by Jesse Anderson
○ Making Apache Spark Testing Easy with Spark Testing Base
○ Unit testing Apache Spark with py.test
raider of gin

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark

High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore :p
Cat’s love it!*
http://bit.ly/hkHighPerfSpark
Stephen Woods
*Or at least the box it comes in. No returns please.

And some upcoming talks:
● Spark Summit EU (Dublin, October)
● Big Data Spain (Madrid, November)
● Bee Scala (Ljubljana, November)
● Strata Singapore (Singapore, December)
● ScalaX (London, December)
● Linux Conf AU (Sydney, January)
● Know of interesting conferences/webinar things that
should be on my radar? Let me know!

k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I need to give a testing talk next
month, help a “friend” out.
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

Scaling with apache spark (a lesson in unintended consequences) strange loop 2017 (1)

More Related Content

What's hot

Similar to Scaling with apache spark (a lesson in unintended consequences) strange loop 2017 (1)

Recently uploaded

Scaling with apache spark (a lesson in unintended consequences) strange loop 2017 (1)