Scaling with Apache Spark*
or a lesson in unintended consequences
StrangeLoop 2017
*Not suitable for all audiences. Viewer discretion is
advised for individuals who believe vendor
marketing materials.
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● Apache Spark committer (as of January!) :)
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
● On twitter @BooProgrammer
Spark Technology
Center
5
IBM
Spark
Technology
Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark
community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business
applications — http://bigdatauniversity.com
Key statistics:
About 50 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now an Apache Incubator project.
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
Spark Technology
Center
What is going to be covered:
● What I think I might know about you
● Spark’s core abstractions for distributed data & computation
○ That wonderful wordcount example as always :)
● Why Spark is designed the way it is
● Re-using Data in Spark and why it needs special considerations
● Why I wish we had a different method for partitioning, and you will too
● How Spark in “other” (R & Python, C# & friends) works
○ Why this doesn’t always summon Cthulhu but definitely has the possibility of
Torsten Reuschling
Who I think you wonderful humans are?
● Nice* people
● Don’t mind pictures of cats
● You might want to scale your Apache Spark jobs
● You might also be curious why Spark is designed the way it is
● Don’t overly mind a grab-bag of topics
● Likely no longer distracted with Pokemon GO :(
What is Spark?
● General purpose data parallel
distributed system
○ With a really nice API including Python :)
● Apache project
● Must faster than Hadoop
Map/Reduce
● Good when too your problem is too
big for a single machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
When we say distributed we mean...
Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
What’s Spark history?
● Can be viewed as the descendant of several projects
● Map/Reduce (Google/Hadoop)
○ Except with more primitives & different resilience
● DryiadLINQ
○ Different language, more in memory focus
● Flume Java
○ (not to be confused with Apache Flume)
○ Lazy evaluation instead of whole program optimizer
○ Does not compile to MR
● Came out of UCB AmpLab, early workload on Mesos
Spark specific terms in this talk
● RDD
○ Resilient Distributed Dataset - Like a distributed collection. Supports
many of the same operations as Seq’s in Scala but automatically
distributed and fault tolerant. Lazily evaluated, and handles faults by
recompute. Any* Java or Kyro serializable object.
● DataFrame
○ Spark DataFrame - not a Pandas or R DataFrame. Distributed,
supports a limited set of operations. Columnar structured, runtime
schema information only. Limited* data types.
● Dataset
○ Compile time typed version of DataFrame (templated)
skdevitt
The different pieces of Spark: 2.0+
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark
ML
bagel &
Graph X
MLLib
Community
Packages
Structured
Streaming
Design piece #1: Lazyness
● In Spark most of our work is done by transformations
● Transformations return new RDDs or DataFrames
representing this data
● The RDD or DataFrame however isn’t eagerily
evaluated
● RDD & DataFrames are really just “plans” of how to
make the data show up if we force Spark’s hand
● With a side order of immutable for “free”
● tl;dr - the data doesn’t exist until it “has” to
Photo by Dan G
The DAG The query plan Susanne Nilsson
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Photo By: Will
Keightley
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile("output")
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
daniilr
Why laziness is cool (and not)
● Pipelining (can put maps, filter, flatMap together)
● Can do interesting optimizations by delaying work
● We use the DAG to recompute on failure
○ (writing data out to 3 disks on different machines is so last season)
○ Or the DAG puts the R is Resilient RDD, except DAG doesn’t have an
R :(
How it hurts:
● Debugging is confusing
● Re-using data - lazyness only sees up to the first action
● Some people really hate immutability
Matthew Hurst
RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory caching in memory
○ persisting at another level
■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK,
MEMORY_AND_DISK_SER
○ checkpointing
● Noisey clusters
○ _2 & checkpointing can help
● persist first for checkpointing
Richard Gillin
Design part #2: partitioning
● When reading data …
● When we need to get data to different machines (e.g.
shuffle) we get a special “known” partitioner
● Partioners in Spark are deterministic on key input (e.g.
for any given key they must always send to the same
partition)
● Impacts operations like groupByKey but also even just
sortByKey
Helen Olney
Key-skew to the anti-rescue… :(
● Keys aren’t evenly distributed
○ Sales by zip code, or records by city, etc.
● groupByKey will explode (but it's pretty easy to break)
● We can have really unbalanced partitions
○ If we have enough key skew sortByKey could even fail
○ Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce
So what does groupByKey look like?
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
Tomomi
So what did we do instead?
● reduceByKey
○ Works when the types are the same (e.g. in our summing version)
● aggregateByKey
○ Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
Can just the shuffle cause problems?
● Sorting by key can put all of the records in the same partition
● We can run into partition size limits (around 2GB)
● Or just get bad performance
● So we can handle data like the above we can add some “junk” to our key
● Common approach in Hadoop MR -- other systems allow combination of
non-deterministic partioners OR dynamic splitting during compute.
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
PROTodd
Klassy
Shuffle explosions :(
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110, A, B)
(94110, A, C)
(94110, E, F)
(94110, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(10003, D, E)
javier_artiles
100% less explosions
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, U, R)
(94110, T, R)
(94110_A, A, B)
(94110_A, A, C)
(94110_A, A, R)
(94110_D, D, R)
(94110_U, U, R)
(10003_A, A, R)
(10003_D, D, E)
(67843_T, T, R)
(94110_E, E, R)
(94110_E, E, R)
(94110_E, E, F)
(94110_T, T, R)
Jennifer Williams
GroupByKey
reduceByKey
Why (deterministic)* partitioning?
● Splits up our data (and our work)
● Known deterministic partitioners allow for fast joins
● You can even do interesting lookup type things this way
● co-location - yaaay
● Sorting - could split but would have to do more sampling
Design part #3: Arbitrary functions
● Super powerful
● Difficult for the optimizer to look inside
● groupByKey + mapValues is effectively opaque (as
discussed)
● But so is filter -- what about if I only need X partitions?
● Part of the motivation for DataFrames/Datasets
○ Can use SQL expressions which the optimizer can look at
○ For complicated things we can still do arbitrary work
key-skew + black boxes == more sadness
● There is a worse way to do WordCount
● We can use the seemingly safe thing called groupByKey
● Then compute the sum
● But since it’s on a slide of “more sadness” we know where
this is going...
_torne
Bad word count :(
words = rdd.flatMap(lambda x: x.split(" "))
wordPairs = words.map(lambda w: (w, 1))
grouped = wordPairs.groupByKey()
counted_words = grouped.mapValues(lambda counts: sum(counts))
counted_words.saveAsTextFile("boop")
Tomomi
Mini “fix”: Datasets (aka DataFrames)
● Still super powerful
● Still allow arbitrary lambdas
● But give you more options to “help” the optimizer
● groupBy returns a GroupedDataStructure and offers
special aggregates
● Selects can push filters down for us*
● Etc.
Using Datasets to mix functional & relational style
val ds: Dataset[RawPanda] = ...
val happiness = ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
So what was that?
ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
A typed query (specifies the
return type). Without the as[]
will return a DataFrame
(Dataset[Row])
Traditional functional
reduction:
arbitrary scala code :)
Robert Couse-Baker
And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
How much faster can it be?
Andrew Skudder
But where DataFrames explode?
● Iterative algorithms - large plans
● Some push downs are sad pandas :(
● Default shuffle size is sometimes too small for big data
(200 partitions)
● Default partition size when reading in is also sad
Adding/working with non-JVM languages
● Spark is written in Scala (runs on the JVM)
● Users want to work in their favourite language
● We also want to support “deep learning” (GPUs, etc.)
○ I live in the bay area, buzzwords =~ rent
● Python, R, C#, etc. all need a way to talk to the JVM
● How expensive could IPC be anyways? :P
○ Also strings are a great format for everything right?
A quick detour into PySpark’s internals
Photo by Bill Ward
Spark in Scala, how does PySpark work?
● Py4J + pickling + magic
○ This can be kind of slow sometimes
● RDDs are generally RDDs of pickled objects
● Spark SQL (and DataFrames) avoid some of this
kristin klein
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
So how does that impact PySpark?
● Data from Spark worker serialized and piped to Python
worker
○ Multiple iterator-to-iterator transformations are still pipelined :)
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● etc.
And back to Dataframes…:
Andrew Skudder
*Note: do not compare absolute #s with previous graph -
different dataset sizes because I forgot to write it down when I
made the first one.
Andrew Skudder
*Arrow: possibly the future. I really hope so. Spark 2.3 and beyond!
* *
Patches Welcome?
● For most of these not really
○ Hard to fix core design changes incrementally
● SPIPs more welcome (w/ proof of concept code if you
want folks to read them)
○ Possibly thesis proposals as well :p
● Also building other systems that Spark can use (like
Apache Arrow)
Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
Paul Anderson
PLZ test (Spark Testing Resources)
● Libraries
○ Scala: spark-testing-base (scalacheck & unit) sscheck (scalacheck)
example-spark (unit)
○ Java: spark-testing-base (unit)
○ Python: spark-testing-base (unittest2), pyspark.test (pytest)
● Strata San Jose Talk (up on YouTube)
● Blog posts
○ Unit Testing Spark with Java by Jesse Anderson
○ Making Apache Spark Testing Easy with Spark Testing Base
○ Unit testing Apache Spark with py.test
raider of gin
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore :p
Cat’s love it!*
http://bit.ly/hkHighPerfSpark
Stephen Woods
*Or at least the box it comes in. No returns please.
And some upcoming talks:
● Spark Summit EU (Dublin, October)
● Big Data Spain (Madrid, November)
● Bee Scala (Ljubljana, November)
● Strata Singapore (Singapore, December)
● ScalaX (London, December)
● Linux Conf AU (Sydney, January)
● Know of interesting conferences/webinar things that
should be on my radar? Let me know!
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I need to give a testing talk next
month, help a “friend” out.
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

Scaling with apache spark (a lesson in unintended consequences) strange loop 2017 (1)

  • 1.
    Scaling with ApacheSpark* or a lesson in unintended consequences StrangeLoop 2017 *Not suitable for all audiences. Viewer discretion is advised for individuals who believe vendor marketing materials.
  • 2.
    Who am I? ●My name is Holden Karau ● Prefered pronouns are she/her ● I’m a Principal Software Engineer at IBM’s Spark Technology Center ● Apache Spark committer (as of January!) :) ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 4.
    Who is Boo? ●Boo uses she/her pronouns (as I told the Texas house committee) ● Best doge ● Lot’s of experience barking at computers to make them go faster ● Author of “Learning to Bark” & “High Performance Barking” ● On twitter @BooProgrammer
  • 5.
    Spark Technology Center 5 IBM Spark Technology Center Founded in2015. Location: Physical: 505 Howard St., San Francisco CA Web: http://spark.tc Twitter: @apachespark_tc Mission: Contribute intellectual and technical capital to the Apache Spark community. Make the core technology enterprise- and cloud-ready. Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com Key statistics: About 50 developers, co-located with 25 IBM designers. Major contributions to Apache Spark http://jiras.spark.tc Apache SystemML is now an Apache Incubator project. Founding member of UC Berkeley AMPLab and RISE Lab Member of R Consortium and Scala Center Spark Technology Center
  • 6.
    What is goingto be covered: ● What I think I might know about you ● Spark’s core abstractions for distributed data & computation ○ That wonderful wordcount example as always :) ● Why Spark is designed the way it is ● Re-using Data in Spark and why it needs special considerations ● Why I wish we had a different method for partitioning, and you will too ● How Spark in “other” (R & Python, C# & friends) works ○ Why this doesn’t always summon Cthulhu but definitely has the possibility of Torsten Reuschling
  • 7.
    Who I thinkyou wonderful humans are? ● Nice* people ● Don’t mind pictures of cats ● You might want to scale your Apache Spark jobs ● You might also be curious why Spark is designed the way it is ● Don’t overly mind a grab-bag of topics ● Likely no longer distracted with Pokemon GO :(
  • 8.
    What is Spark? ●General purpose data parallel distributed system ○ With a really nice API including Python :) ● Apache project ● Must faster than Hadoop Map/Reduce ● Good when too your problem is too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 9.
    When we saydistributed we mean...
  • 10.
    Why people cometo Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 11.
    Why people cometo Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 12.
    What’s Spark history? ●Can be viewed as the descendant of several projects ● Map/Reduce (Google/Hadoop) ○ Except with more primitives & different resilience ● DryiadLINQ ○ Different language, more in memory focus ● Flume Java ○ (not to be confused with Apache Flume) ○ Lazy evaluation instead of whole program optimizer ○ Does not compile to MR ● Came out of UCB AmpLab, early workload on Mesos
  • 13.
    Spark specific termsin this talk ● RDD ○ Resilient Distributed Dataset - Like a distributed collection. Supports many of the same operations as Seq’s in Scala but automatically distributed and fault tolerant. Lazily evaluated, and handles faults by recompute. Any* Java or Kyro serializable object. ● DataFrame ○ Spark DataFrame - not a Pandas or R DataFrame. Distributed, supports a limited set of operations. Columnar structured, runtime schema information only. Limited* data types. ● Dataset ○ Compile time typed version of DataFrame (templated) skdevitt
  • 14.
    The different piecesof Spark: 2.0+ Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Structured Streaming
  • 15.
    Design piece #1:Lazyness ● In Spark most of our work is done by transformations ● Transformations return new RDDs or DataFrames representing this data ● The RDD or DataFrame however isn’t eagerily evaluated ● RDD & DataFrames are really just “plans” of how to make the data show up if we force Spark’s hand ● With a side order of immutable for “free” ● tl;dr - the data doesn’t exist until it “has” to Photo by Dan G
  • 16.
    The DAG Thequery plan Susanne Nilsson
  • 17.
    Word count (inpython) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(“output”) Photo By: Will Keightley
  • 18.
    Word count (inpython) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile("output") No data is read or processed until after this line This is an “action” which forces spark to evaluate the RDD daniilr
  • 19.
    Why laziness iscool (and not) ● Pipelining (can put maps, filter, flatMap together) ● Can do interesting optimizations by delaying work ● We use the DAG to recompute on failure ○ (writing data out to 3 disks on different machines is so last season) ○ Or the DAG puts the R is Resilient RDD, except DAG doesn’t have an R :( How it hurts: ● Debugging is confusing ● Re-using data - lazyness only sees up to the first action ● Some people really hate immutability Matthew Hurst
  • 20.
    RDD re-use -sadly not magic ● If we know we are going to re-use the RDD what should we do? ○ If it fits nicely in memory caching in memory ○ persisting at another level ■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER ○ checkpointing ● Noisey clusters ○ _2 & checkpointing can help ● persist first for checkpointing Richard Gillin
  • 21.
    Design part #2:partitioning ● When reading data … ● When we need to get data to different machines (e.g. shuffle) we get a special “known” partitioner ● Partioners in Spark are deterministic on key input (e.g. for any given key they must always send to the same partition) ● Impacts operations like groupByKey but also even just sortByKey Helen Olney
  • 22.
    Key-skew to theanti-rescue… :( ● Keys aren’t evenly distributed ○ Sales by zip code, or records by city, etc. ● groupByKey will explode (but it's pretty easy to break) ● We can have really unbalanced partitions ○ If we have enough key skew sortByKey could even fail ○ Stragglers (uneven sharding can make some tasks take much longer) Mitchell Joyce
  • 23.
    So what doesgroupByKey look like? (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)] Tomomi
  • 24.
    So what didwe do instead? ● reduceByKey ○ Works when the types are the same (e.g. in our summing version) ● aggregateByKey ○ Doesn’t require the types to be the same (e.g. computing stats model or similar) Allows Spark to pipeline the reduction & skip making the list We also got a map-side reduction (note the difference in shuffled read)
  • 25.
    Can just theshuffle cause problems? ● Sorting by key can put all of the records in the same partition ● We can run into partition size limits (around 2GB) ● Or just get bad performance ● So we can handle data like the above we can add some “junk” to our key ● Common approach in Hadoop MR -- other systems allow combination of non-deterministic partioners OR dynamic splitting during compute. (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) PROTodd Klassy
  • 26.
    Shuffle explosions :( (94110,A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110, A, B) (94110, A, C) (94110, E, F) (94110, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (10003, D, E) javier_artiles
  • 27.
    100% less explosions (94110,A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, U, R) (94110, T, R) (94110_A, A, B) (94110_A, A, C) (94110_A, A, R) (94110_D, D, R) (94110_U, U, R) (10003_A, A, R) (10003_D, D, E) (67843_T, T, R) (94110_E, E, R) (94110_E, E, R) (94110_E, E, F) (94110_T, T, R) Jennifer Williams
  • 28.
  • 29.
  • 30.
    Why (deterministic)* partitioning? ●Splits up our data (and our work) ● Known deterministic partitioners allow for fast joins ● You can even do interesting lookup type things this way ● co-location - yaaay ● Sorting - could split but would have to do more sampling
  • 31.
    Design part #3:Arbitrary functions ● Super powerful ● Difficult for the optimizer to look inside ● groupByKey + mapValues is effectively opaque (as discussed) ● But so is filter -- what about if I only need X partitions? ● Part of the motivation for DataFrames/Datasets ○ Can use SQL expressions which the optimizer can look at ○ For complicated things we can still do arbitrary work
  • 32.
    key-skew + blackboxes == more sadness ● There is a worse way to do WordCount ● We can use the seemingly safe thing called groupByKey ● Then compute the sum ● But since it’s on a slide of “more sadness” we know where this is going... _torne
  • 33.
    Bad word count:( words = rdd.flatMap(lambda x: x.split(" ")) wordPairs = words.map(lambda w: (w, 1)) grouped = wordPairs.groupByKey() counted_words = grouped.mapValues(lambda counts: sum(counts)) counted_words.saveAsTextFile("boop") Tomomi
  • 34.
    Mini “fix”: Datasets(aka DataFrames) ● Still super powerful ● Still allow arbitrary lambdas ● But give you more options to “help” the optimizer ● groupBy returns a GroupedDataStructure and offers special aggregates ● Selects can push filters down for us* ● Etc.
  • 35.
    Using Datasets tomix functional & relational style val ds: Dataset[RawPanda] = ... val happiness = ds.filter($"happy" === true). select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
  • 36.
    So what wasthat? ds.filter($"happy" === true). select($"attributes"(0).as[Double]). reduce((x, y) => x + y) A typed query (specifies the return type). Without the as[] will return a DataFrame (Dataset[Row]) Traditional functional reduction: arbitrary scala code :) Robert Couse-Baker
  • 37.
    And functional stylemaps: /** * Functional map + Dataset, sums the positive attributes for the pandas */ def funMap(ds: Dataset[RawPanda]): Dataset[Double] = { ds.map{rp => rp.attributes.filter(_ > 0).sum} }
  • 38.
    How much fastercan it be? Andrew Skudder
  • 39.
    But where DataFramesexplode? ● Iterative algorithms - large plans ● Some push downs are sad pandas :( ● Default shuffle size is sometimes too small for big data (200 partitions) ● Default partition size when reading in is also sad
  • 40.
    Adding/working with non-JVMlanguages ● Spark is written in Scala (runs on the JVM) ● Users want to work in their favourite language ● We also want to support “deep learning” (GPUs, etc.) ○ I live in the bay area, buzzwords =~ rent ● Python, R, C#, etc. all need a way to talk to the JVM ● How expensive could IPC be anyways? :P ○ Also strings are a great format for everything right?
  • 41.
    A quick detourinto PySpark’s internals Photo by Bill Ward
  • 42.
    Spark in Scala,how does PySpark work? ● Py4J + pickling + magic ○ This can be kind of slow sometimes ● RDDs are generally RDDs of pickled objects ● Spark SQL (and DataFrames) avoid some of this kristin klein
  • 43.
    So what doesthat look like? Driver py4j Worker 1 Worker K pipe pipe
  • 44.
    So how doesthat impact PySpark? ● Data from Spark worker serialized and piped to Python worker ○ Multiple iterator-to-iterator transformations are still pipelined :) ● Double serialization cost makes everything more expensive ● Python worker startup takes a bit of extra time ● Python memory isn’t controlled by the JVM - easy to go over container limits if deploying on YARN or similar ● Error messages make ~0 sense ● etc.
  • 45.
    And back toDataframes…: Andrew Skudder *Note: do not compare absolute #s with previous graph - different dataset sizes because I forgot to write it down when I made the first one.
  • 46.
    Andrew Skudder *Arrow: possiblythe future. I really hope so. Spark 2.3 and beyond! * *
  • 47.
    Patches Welcome? ● Formost of these not really ○ Hard to fix core design changes incrementally ● SPIPs more welcome (w/ proof of concept code if you want folks to read them) ○ Possibly thesis proposals as well :p ● Also building other systems that Spark can use (like Apache Arrow)
  • 48.
    Spark Videos ● ApacheSpark Youtube Channel ● My Spark videos on YouTube - ○ http://bit.ly/holdenSparkVideos ● Spark Summit 2014 training ● Paco’s Introduction to Apache Spark Paul Anderson
  • 49.
    PLZ test (SparkTesting Resources) ● Libraries ○ Scala: spark-testing-base (scalacheck & unit) sscheck (scalacheck) example-spark (unit) ○ Java: spark-testing-base (unit) ○ Python: spark-testing-base (unittest2), pyspark.test (pytest) ● Strata San Jose Talk (up on YouTube) ● Blog posts ○ Unit Testing Spark with Java by Jesse Anderson ○ Making Apache Spark Testing Easy with Spark Testing Base ○ Unit testing Apache Spark with py.test raider of gin
  • 50.
    Learning Spark Fast Data Processingwith Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 51.
    High Performance Spark! Availabletoday! You can buy it from that scrappy Seattle bookstore :p Cat’s love it!* http://bit.ly/hkHighPerfSpark Stephen Woods *Or at least the box it comes in. No returns please.
  • 52.
    And some upcomingtalks: ● Spark Summit EU (Dublin, October) ● Big Data Spain (Madrid, November) ● Bee Scala (Ljubljana, November) ● Strata Singapore (Singapore, December) ● ScalaX (London, December) ● Linux Conf AU (Sydney, January) ● Know of interesting conferences/webinar things that should be on my radar? Let me know!
  • 53.
    k thnx bye:) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark I need to give a testing talk next month, help a “friend” out. Will tweet results “eventually” @holdenkarau Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :)