Writing your own RDD for
fun and profit
by Paweł Szulc
@rabbitonweb
Writing my own RDD? What for?
● To write your own RDD, you need to understand to some
extent internal mechanics of Apache Spark
● Writing your own RDD will prove you understand them well
● When connecting to external storage, it is reasonable to
create your own RDD for it
Outline
1. The Recap
Outline
1. The Recap
2. The Internals
Outline
1. The Recap
2. The Internals
3. The Fun & Profit
Part I - The Recap
RDD - the definition
RDD - the definition
RDD stands for resilient distributed dataset
RDD - the definition
RDD stands for resilient distributed dataset
Dataset - initial data comes from some distributed storage
RDD - the definition
RDD stands for resilient distributed dataset
Distributed - stored in nodes among the cluster
Dataset - initial data comes from some distributed storage
RDD - the definition
RDD stands for resilient distributed dataset
Resilient - if data is lost, data can be recreated
Distributed - stored in nodes among the cluster
Dataset - initial data comes from some distributed storage
RDD - example
RDD - example
val logs = sc.textFile("hdfs://logs.txt")
RDD - example
val logs = sc.textFile("hdfs://logs.txt")
From Hadoop Distributed
File System
RDD - example
val logs = sc.textFile("hdfs://logs.txt")
From Hadoop Distributed
File System
This is the RDD
RDD - example
val numbers = sc.parallelize(List(1, 2, 3, 4))
Programmatically from a
collection of elements
This is the RDD
RDD - example
val logs = sc.textFile("logs.txt")
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
Creates a new RDD
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
And yet another RDD
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
And yet another RDD
Performance Alert?!?!
RDD - Operations
1. Transformations
a. Map
b. Filter
c. FlatMap
d. Sample
e. Union
f. Intersect
g. Distinct
h. GroupByKey
i. ….
2. Actions
a. Reduce
b. Collect
c. Count
d. First
e. Take(n)
f. TakeSample
g. SaveAsTextFile
h. ….
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
val numberOfErrors = errors.count
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
val numberOfErrors = errors.count
This will trigger the
computation
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
val numberOfErrors = errors.count
This will the calculated
value (Int)
This will trigger the
computation
Partitions?
Partitions?
A partition represents subset of data within your
distributed collection.
Partitions?
A partition represents subset of data within your
distributed collection.
Number of partitions tightly coupled with level of
parallelism.
Partitions evaluation
val counted = sc.textFile(..).count
Partitions evaluation
val counted = sc.textFile(..).count
node 1
node 2
node 3
Partitions evaluation
val counted = sc.textFile(..).count
node 1
node 2
node 3
Partitions evaluation
val counted = sc.textFile(..).count
node 1
node 2
node 3
Partitions evaluation
val counted = sc.textFile(..).count
node 1
node 2
node 3
Partitions evaluation
val counted = sc.textFile(..).count
node 1
node 2
node 3
Partitions evaluation
val counted = sc.textFile(..).count
node 1
node 2
node 3
Partitions evaluation
val counted = sc.textFile(..).count
node 1
node 2
node 3
Partitions evaluation
val counted = sc.textFile(..).count
node 1
node 2
node 3
Partitions evaluation
val counted = sc.textFile(..).count
node 1
node 2
node 3
Partitions evaluation
val counted = sc.textFile(..).count
node 1
node 2
node 3
Partitions evaluation
val counted = sc.textFile(..).count
node 1
node 2
node 3
Partitions evaluation
val counted = sc.textFile(..).count
node 1
node 2
node 3
Partitions evaluation
val counted = sc.textFile(..).count
node 1
node 2
node 3
Pipeline
Pipeline
map
Pipeline
map count
Pipeline
map count
task
Pipeline
map count
task
Pipeline
map count
task
But what if...
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
But what if...
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
But what if...
filter
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
And now what?
filter
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
And now what?
filter mapValues
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
And now what?
filter
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
Shuffling
filter groupBy
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
Shuffling
filter mapValuesgroupBy
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
Shuffling
filter reduceByKeygroupBy
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
mapValues
Shuffling
filter reduceByKeygroupBy mapValues
Shuffling
filter reduceByKey
task
groupBy mapValues
Shuffling
filter reduceByKey
task
groupBy mapValues
Shuffling
filter reduceByKey
task
groupBy mapValues
Shuffling
filter reduceByKey
task
Wait for calculations on all partitions before moving on
groupBy mapValues
Shuffling
filter reduceByKey
task
groupBy mapValues
Shuffling
filter reduceByKey
task
groupBy
Data flying around through cluster
mapValues
Shuffling
filter reduceByKey
task
groupBy mapValues
Shuffling
filter reduceByKey
task task
groupBy mapValues
Shuffling
filter reduceByKeygroupBy mapValues
stage1
Stage
filter reduceByKeygroupBy mapValues
sda
stage2stage1
Stage
filter reduceByKeygroupBy mapValues
Part II - The Internals
What is a RDD?
What is a RDD?
Resilient Distributed Dataset
What is a RDD?
Resilient Distributed Dataset
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
What is a RDD?
node 1
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
node 2 node 3
What is a RDD?
node 1
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
node 2 node 3
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
What is a RDD?
What is a RDD?
What is a RDD?
RDD needs to hold 3 chunks of information in
order to do its work:
What is a RDD?
RDD needs to hold 3 chunks of information in
order to do its work:
1. pointer to his parent
What is a RDD?
RDD needs to hold 3 chunks of information in
order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
What is a RDD?
RDD needs to hold 3 chunks of information in
order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
What is a RDD?
RDD needs to hold 3 chunks of information in
order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
What is a partition?
A partition represents subset of data within your
distributed collection.
What is a partition?
A partition represents subset of data within your
distributed collection.
override def getPartitions: Array[Partition]
= ???
What is a partition?
A partition represents subset of data within your
distributed collection.
override def getPartitions: Array[Partition]
= ???
How this subset is defined depends on type of the RDD
example: HadoopRDD
val journal = sc.textFile(“hdfs://journal/*”)
example: HadoopRDD
val journal = sc.textFile(“hdfs://journal/*”)
How HadoopRDD is partitioned?
example: HadoopRDD
val journal = sc.textFile(“hdfs://journal/*”)
How HadoopRDD is partitioned?
In HadoopRDD partition is exactly the same as file chunks in
HDFS
example: HadoopRDD
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {
...
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i))
}
array
}
example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {
...
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i))
}
array
}
example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {
...
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i))
}
array
}
example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
How MapPartitionsRDD is partitioned?
example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
How MapPartitionsRDD is partitioned?
MapPartitionsRDD inherits partition information from its parent
RDD
example: MapPartitionsRDD
class MapPartitionsRDD[U: ClassTag, T: ClassTag](...) extends RDD[U](prev) {
...
override def getPartitions: Array[Partition] = firstParent[T].partitions
What is a RDD?
RDD needs to hold 3 chunks of information in
order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
What is a RDD?
RDD needs to hold 3 chunks of information in
order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
RDD parent
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
.foreach(println)
RDD parent
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
.foreach(println)
RDD parent
sc.textFile()
.groupBy()
.map { }
.filter {
}
.take()
.foreach()
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two types of parent dependencies:
1. narrow dependency
2. wider dependency
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two types of parent dependencies:
1. narrow dependency
2. wider dependency
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two types of parent dependencies:
1. narrow dependency
2. wider dependency
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Stage 1
Stage 2
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
What is a RDD?
RDD needs to hold 3 chunks of information in
order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
What is a RDD?
RDD needs to hold 3 chunks of information in
order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
Stage 1
Stage 2
Running Job aka materializing DAG
sc.textFile() .groupBy() .map { } .filter { }
Stage 1
Stage 2
Running Job aka materializing DAG
sc.textFile() .groupBy() .map { } .filter { } .collect()
Stage 1
Stage 2
Running Job aka materializing DAG
sc.textFile() .groupBy() .map { } .filter { } .collect()
action
Stage 1
Stage 2
Running Job aka materializing DAG
sc.textFile() .groupBy() .map { } .filter { } .collect()
action
Actions are implemented
using sc.runJob method
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
): Array[U]
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
): Array[U]
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
func: Iterator[T] => U,
): Array[U]
Running Job aka materializing DAG
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T] = {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
Running Job aka materializing DAG
/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
Multiple jobs for single action
/**
* Take the first num elements of the RDD. It works by first scanning one partition, and use the
results from that partition to estimate the number of additional partitions needed to satisfy the
limit.
*/
def take(num: Int): Array[T] = {
while (buf.size < num && partsScanned < totalParts) {
(….)
val left = num - buf.size
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true)
(….)
res.foreach(buf ++= _.take(num - buf.size))
partsScanned += numPartsToTry
(….)
}
buf.toArray
}
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
func: Iterator[T] => U,
): Array[U]
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
func: Iterator[T] => U,
): Array[U]
Running Job aka materializing DAG
/**
* :: DeveloperApi ::
* Implemented by subclasses to compute a given partition.
*/
@DeveloperApi
def compute(split: Partition, context: TaskContext): Iterator[T]
What is a RDD?
RDD needs to hold 3 chunks of information in
order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
What is a RDD?
RDD needs to hold 3 + 2 chunks of information
in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
4. data locality
5. paritioner
What is a RDD?
RDD needs to hold 3 + 2 chunks of information
in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
4. data locality
5. paritioner
Data Locality: HDFS example
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
What is a RDD?
RDD needs to hold 3 + 2 chunks of information
in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
4. data locality
5. paritioner
What is a RDD?
RDD needs to hold 3 + 2 chunks of information
in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
4. data locality
5. paritioner
Spark performance - shuffle optimization
Spark performance - shuffle optimization
join
Spark performance - shuffle optimization
join
Spark performance - shuffle optimization
map groupBy
Spark performance - shuffle optimization
map groupBy
Spark performance - shuffle optimization
map groupBy join
Spark performance - shuffle optimization
map groupBy join
Spark performance - shuffle optimization
map groupBy join
Optimization: shuffle avoided if
data is already partitioned
Spark performance - shuffle optimization
map groupBy
Spark performance - shuffle optimization
map groupBy map
Spark performance - shuffle optimization
map groupBy map
Spark performance - shuffle optimization
map groupBy map join
Spark performance - shuffle optimization
map groupBy map join
Spark performance - shuffle optimization
map groupBy mapValues
Spark performance - shuffle optimization
map groupBy mapValues
Spark performance - shuffle optimization
map groupBy mapValues join
Spark performance - shuffle optimization
map groupBy mapValues join
Part III - The Fun & Profit
It’s all on github!
http://bit.do/scalapolis
RandomRDD
RandomRDD
sc.random()
.take(3)
.foreach(println)
RandomRDD
sc.random()
.take(3)
.foreach(println)
210
-321
21312
RandomRDD
sc.random()
.take(3)
.foreach(println)
RandomRDD
sc.random()
.take(3)
.foreach(println)
sc.random(maxSize = 10, numPartitions = 4)
.take(10)
.foreach(println)
CensorshipRDD
CensorshipRDD
val statement =
sc.parallelize(List("We", "all", "know that", "Hadoop rocks!"))
CensorshipRDD
val statement =
sc.parallelize(List("We", "all", "know that", "Hadoop rocks!"))
.censor()
.collect().toList.mkString(" ")
println(statement)
CensorshipRDD
CensorshipRDD
sc.parallelize(List("We", "all", "know that", "Hadoop rocks!"))
.censor().collectLegal().foreach(println)
CensorshipRDD
sc.parallelize(List("We", "all", "know that", "Hadoop rocks!"))
.censor().collectLegal().foreach(println)
We
all
know that
Fin
Fin
Paweł Szulc
Fin
Paweł Szulc
paul.szulc@gmail.com
Fin
Paweł Szulc
paul.szulc@gmail.com
Twitter: @rabbitonweb
Fin
Paweł Szulc
paul.szulc@gmail.com
Twitter: @rabbitonweb
http://rabbitonweb.com
Fin
Paweł Szulc
paul.szulc@gmail.com
Twitter: @rabbitonweb
http://rabbitonweb.com
http://github.com/rabbitonweb
Fin
Paweł Szulc
paul.szulc@gmail.com
Twitter: @rabbitonweb
http://rabbitonweb.com
http://github.com/rabbitonweb
http://bit.do/scalapolis

Writing your own RDD for fun and profit

  • 1.
    Writing your ownRDD for fun and profit by Paweł Szulc @rabbitonweb
  • 2.
    Writing my ownRDD? What for? ● To write your own RDD, you need to understand to some extent internal mechanics of Apache Spark ● Writing your own RDD will prove you understand them well ● When connecting to external storage, it is reasonable to create your own RDD for it
  • 3.
  • 4.
  • 5.
    Outline 1. The Recap 2.The Internals 3. The Fun & Profit
  • 6.
    Part I -The Recap
  • 7.
    RDD - thedefinition
  • 8.
    RDD - thedefinition RDD stands for resilient distributed dataset
  • 9.
    RDD - thedefinition RDD stands for resilient distributed dataset Dataset - initial data comes from some distributed storage
  • 10.
    RDD - thedefinition RDD stands for resilient distributed dataset Distributed - stored in nodes among the cluster Dataset - initial data comes from some distributed storage
  • 11.
    RDD - thedefinition RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated Distributed - stored in nodes among the cluster Dataset - initial data comes from some distributed storage
  • 12.
  • 13.
    RDD - example vallogs = sc.textFile("hdfs://logs.txt")
  • 14.
    RDD - example vallogs = sc.textFile("hdfs://logs.txt") From Hadoop Distributed File System
  • 15.
    RDD - example vallogs = sc.textFile("hdfs://logs.txt") From Hadoop Distributed File System This is the RDD
  • 16.
    RDD - example valnumbers = sc.parallelize(List(1, 2, 3, 4)) Programmatically from a collection of elements This is the RDD
  • 17.
    RDD - example vallogs = sc.textFile("logs.txt")
  • 18.
    RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase)
  • 19.
    RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) Creates a new RDD
  • 20.
    RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”))
  • 21.
    RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) And yet another RDD
  • 22.
    RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) And yet another RDD Performance Alert?!?!
  • 23.
    RDD - Operations 1.Transformations a. Map b. Filter c. FlatMap d. Sample e. Union f. Intersect g. Distinct h. GroupByKey i. …. 2. Actions a. Reduce b. Collect c. Count d. First e. Take(n) f. TakeSample g. SaveAsTextFile h. ….
  • 24.
    RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”))
  • 25.
    RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count
  • 26.
    RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count This will trigger the computation
  • 27.
    RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count This will the calculated value (Int) This will trigger the computation
  • 28.
  • 29.
    Partitions? A partition representssubset of data within your distributed collection.
  • 30.
    Partitions? A partition representssubset of data within your distributed collection. Number of partitions tightly coupled with level of parallelism.
  • 31.
    Partitions evaluation val counted= sc.textFile(..).count
  • 32.
    Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3
  • 33.
    Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3
  • 34.
    Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3
  • 35.
    Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3
  • 36.
    Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3
  • 37.
    Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3
  • 38.
    Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3
  • 39.
    Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3
  • 40.
    Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3
  • 41.
    Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3
  • 42.
    Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3
  • 43.
    Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3
  • 44.
    Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
    But what if... valstartings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }
  • 52.
    But what if... valstartings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }
  • 53.
    But what if... filter valstartings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }
  • 54.
    And now what? filter valstartings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }
  • 55.
    And now what? filtermapValues val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }
  • 56.
    And now what? filter valstartings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }
  • 57.
    Shuffling filter groupBy val startings= allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }
  • 58.
    Shuffling filter mapValuesgroupBy val startings= allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }
  • 59.
    Shuffling filter reduceByKeygroupBy val startings= allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length } mapValues
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
    Shuffling filter reduceByKey task Wait forcalculations on all partitions before moving on groupBy mapValues
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
    Part II -The Internals
  • 73.
  • 74.
    What is aRDD? Resilient Distributed Dataset
  • 75.
    What is aRDD? Resilient Distributed Dataset
  • 76.
    ... 10 10/05/2015 10:14:01UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?
  • 77.
    node 1 ... 10 10/05/201510:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 What is a RDD?
  • 78.
    node 1 ... 10 10/05/201510:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?
  • 79.
  • 80.
    What is aRDD? RDD needs to hold 3 chunks of information in order to do its work:
  • 81.
    What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent
  • 82.
    What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned
  • 83.
    What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data
  • 84.
    What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data
  • 85.
    What is apartition? A partition represents subset of data within your distributed collection.
  • 86.
    What is apartition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ???
  • 87.
    What is apartition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ??? How this subset is defined depends on type of the RDD
  • 88.
    example: HadoopRDD val journal= sc.textFile(“hdfs://journal/*”)
  • 89.
    example: HadoopRDD val journal= sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned?
  • 90.
    example: HadoopRDD val journal= sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned? In HadoopRDD partition is exactly the same as file chunks in HDFS
  • 91.
    example: HadoopRDD 10 10/05/201510:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 92.
    example: HadoopRDD node 1 1010/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 93.
    example: HadoopRDD node 1 1010/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 94.
    example: HadoopRDD node 1 1010/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 95.
    example: HadoopRDD node 1 1010/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 96.
    example: HadoopRDD node 1 1010/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 97.
    example: HadoopRDD class HadoopRDD[K,V](...) extends RDD[(K, V)](sc, Nil) with Logging { ... override def getPartitions: Array[Partition] = { val jobConf = getJobConf() SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }
  • 98.
    example: HadoopRDD class HadoopRDD[K,V](...) extends RDD[(K, V)](sc, Nil) with Logging { ... override def getPartitions: Array[Partition] = { val jobConf = getJobConf() SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }
  • 99.
    example: HadoopRDD class HadoopRDD[K,V](...) extends RDD[(K, V)](sc, Nil) with Logging { ... override def getPartitions: Array[Partition] = { val jobConf = getJobConf() SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }
  • 100.
    example: MapPartitionsRDD val journal= sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 101.
    example: MapPartitionsRDD val journal= sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned?
  • 102.
    example: MapPartitionsRDD val journal= sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned? MapPartitionsRDD inherits partition information from its parent RDD
  • 103.
    example: MapPartitionsRDD class MapPartitionsRDD[U:ClassTag, T: ClassTag](...) extends RDD[U](prev) { ... override def getPartitions: Array[Partition] = firstParent[T].partitions
  • 104.
    What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data
  • 105.
    What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data
  • 106.
    RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map{ case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)
  • 107.
    RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map{ case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)
  • 108.
    RDD parent sc.textFile() .groupBy() .map {} .filter { } .take() .foreach()
  • 109.
    Directed acyclic graph sc.textFile().groupBy() .map { } .filter { } .take() .foreach()
  • 110.
    Directed acyclic graph HadoopRDD sc.textFile().groupBy() .map { } .filter { } .take() .foreach()
  • 111.
    Directed acyclic graph HadoopRDD ShuffeledRDD sc.textFile().groupBy() .map { } .filter { } .take() .foreach()
  • 112.
    Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 113.
    Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 114.
    Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 115.
    Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two types of parent dependencies: 1. narrow dependency 2. wider dependency
  • 116.
    Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two types of parent dependencies: 1. narrow dependency 2. wider dependency
  • 117.
    Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two types of parent dependencies: 1. narrow dependency 2. wider dependency
  • 118.
    Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 119.
    Directed acyclic graph sc.textFile().groupBy() .map { } .filter { } .take() .foreach()
  • 120.
    Directed acyclic graph sc.textFile().groupBy() .map { } .filter { } .take() .foreach()
  • 121.
    Stage 1 Stage 2 Directedacyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 122.
    What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data
  • 123.
    What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data
  • 124.
    Stage 1 Stage 2 RunningJob aka materializing DAG sc.textFile() .groupBy() .map { } .filter { }
  • 125.
    Stage 1 Stage 2 RunningJob aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect()
  • 126.
    Stage 1 Stage 2 RunningJob aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action
  • 127.
    Stage 1 Stage 2 RunningJob aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action Actions are implemented using sc.runJob method
  • 128.
    Running Job akamaterializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( ): Array[U]
  • 129.
    Running Job akamaterializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], ): Array[U]
  • 130.
    Running Job akamaterializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], func: Iterator[T] => U, ): Array[U]
  • 131.
    Running Job akamaterializing DAG /** * Return an array that contains all of the elements in this RDD. */ def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) }
  • 132.
    Running Job akamaterializing DAG /** * Return the number of elements in the RDD. */ def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
  • 133.
    Multiple jobs forsingle action /** * Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit. */ def take(num: Int): Array[T] = { while (buf.size < num && partsScanned < totalParts) { (….) val left = num - buf.size val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true) (….) res.foreach(buf ++= _.take(num - buf.size)) partsScanned += numPartsToTry (….) } buf.toArray }
  • 134.
    Running Job akamaterializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], func: Iterator[T] => U, ): Array[U]
  • 135.
    Running Job akamaterializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], func: Iterator[T] => U, ): Array[U]
  • 136.
    Running Job akamaterializing DAG /** * :: DeveloperApi :: * Implemented by subclasses to compute a given partition. */ @DeveloperApi def compute(split: Partition, context: TaskContext): Iterator[T]
  • 137.
    What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data
  • 138.
    What is aRDD? RDD needs to hold 3 + 2 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data 4. data locality 5. paritioner
  • 139.
    What is aRDD? RDD needs to hold 3 + 2 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data 4. data locality 5. paritioner
  • 140.
    Data Locality: HDFSexample node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 141.
    What is aRDD? RDD needs to hold 3 + 2 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data 4. data locality 5. paritioner
  • 142.
    What is aRDD? RDD needs to hold 3 + 2 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data 4. data locality 5. paritioner
  • 143.
    Spark performance -shuffle optimization
  • 144.
    Spark performance -shuffle optimization join
  • 145.
    Spark performance -shuffle optimization join
  • 146.
    Spark performance -shuffle optimization map groupBy
  • 147.
    Spark performance -shuffle optimization map groupBy
  • 148.
    Spark performance -shuffle optimization map groupBy join
  • 149.
    Spark performance -shuffle optimization map groupBy join
  • 150.
    Spark performance -shuffle optimization map groupBy join Optimization: shuffle avoided if data is already partitioned
  • 151.
    Spark performance -shuffle optimization map groupBy
  • 152.
    Spark performance -shuffle optimization map groupBy map
  • 153.
    Spark performance -shuffle optimization map groupBy map
  • 154.
    Spark performance -shuffle optimization map groupBy map join
  • 155.
    Spark performance -shuffle optimization map groupBy map join
  • 156.
    Spark performance -shuffle optimization map groupBy mapValues
  • 157.
    Spark performance -shuffle optimization map groupBy mapValues
  • 158.
    Spark performance -shuffle optimization map groupBy mapValues join
  • 159.
    Spark performance -shuffle optimization map groupBy mapValues join
  • 160.
    Part III -The Fun & Profit
  • 161.
    It’s all ongithub! http://bit.do/scalapolis
  • 162.
  • 163.
  • 164.
  • 165.
  • 166.
  • 167.
  • 168.
    CensorshipRDD val statement = sc.parallelize(List("We","all", "know that", "Hadoop rocks!"))
  • 169.
    CensorshipRDD val statement = sc.parallelize(List("We","all", "know that", "Hadoop rocks!")) .censor() .collect().toList.mkString(" ") println(statement)
  • 170.
  • 171.
    CensorshipRDD sc.parallelize(List("We", "all", "knowthat", "Hadoop rocks!")) .censor().collectLegal().foreach(println)
  • 172.
    CensorshipRDD sc.parallelize(List("We", "all", "knowthat", "Hadoop rocks!")) .censor().collectLegal().foreach(println) We all know that
  • 173.
  • 174.
  • 175.
  • 176.
  • 177.
  • 178.
  • 179.