Writing your own RDD for fun and profit

1.
Writing your ownRDD for fun and profit by Paweł Szulc @rabbitonweb

2.
Writing my ownRDD? What for? ● To write your own RDD, you need to understand to some extent internal mechanics of Apache Spark ● Writing your own RDD will prove you understand them well ● When connecting to external storage, it is reasonable to create your own RDD for it

3.
Outline 1. The Recap

4.
Outline 1. The Recap 2.The Internals

5.
Outline 1. The Recap 2.The Internals 3. The Fun & Profit

6.
Part I -The Recap

7.
RDD - thedefinition

8.
RDD - thedefinition RDD stands for resilient distributed dataset

9.
RDD - thedefinition RDD stands for resilient distributed dataset Dataset - initial data comes from some distributed storage

10.
RDD - thedefinition RDD stands for resilient distributed dataset Distributed - stored in nodes among the cluster Dataset - initial data comes from some distributed storage

11.
RDD - thedefinition RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated Distributed - stored in nodes among the cluster Dataset - initial data comes from some distributed storage

12.
RDD - example

13.
RDD - example vallogs = sc.textFile("hdfs://logs.txt")

14.
RDD - example vallogs = sc.textFile("hdfs://logs.txt") From Hadoop Distributed File System

15.
RDD - example vallogs = sc.textFile("hdfs://logs.txt") From Hadoop Distributed File System This is the RDD

16.
RDD - example valnumbers = sc.parallelize(List(1, 2, 3, 4)) Programmatically from a collection of elements This is the RDD

17.
RDD - example vallogs = sc.textFile("logs.txt")

18.
RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase)

19.
RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) Creates a new RDD

20.
RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”))

21.
RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) And yet another RDD

22.
RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) And yet another RDD Performance Alert?!?!

23.
RDD - Operations 1.Transformations a. Map b. Filter c. FlatMap d. Sample e. Union f. Intersect g. Distinct h. GroupByKey i. …. 2. Actions a. Reduce b. Collect c. Count d. First e. Take(n) f. TakeSample g. SaveAsTextFile h. ….

24.
RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”))

25.
RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count

26.
RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count This will trigger the computation

27.
RDD - example vallogs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count This will the calculated value (Int) This will trigger the computation

28.
Partitions?

29.
Partitions? A partition representssubset of data within your distributed collection.

30.
Partitions? A partition representssubset of data within your distributed collection. Number of partitions tightly coupled with level of parallelism.

31.
Partitions evaluation val counted= sc.textFile(..).count

32.
Partitions evaluation val counted= sc.textFile(..).count node 1 node 2 node 3

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.
Pipeline

46.
Pipeline map

47.
Pipeline map count

48.
Pipeline map count task

49.

50.

51.
But what if... valstartings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

52.
But what if... valstartings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

53.
But what if... filter valstartings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

54.
And now what? filter valstartings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

55.
And now what? filtermapValues val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

56.
And now what? filter valstartings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

57.
Shuffling filter groupBy val startings= allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

58.
Shuffling filter mapValuesgroupBy val startings= allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

59.
Shuffling filter reduceByKeygroupBy val startings= allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length } mapValues

60.
Shuffling filter reduceByKeygroupBy mapValues

61.
Shuffling filter reduceByKey task groupBy mapValues

62.

63.

64.
Shuffling filter reduceByKey task Wait forcalculations on all partitions before moving on groupBy mapValues

65.

66.
Shuffling filter reduceByKey task groupBy Data flyingaround through cluster mapValues

67.

68.
Shuffling filter reduceByKey task task groupBymapValues

69.
Shuffling filter reduceByKeygroupBy mapValues

70.
stage1 Stage filter reduceByKeygroupBy mapValues

71.
sda stage2stage1 Stage filter reduceByKeygroupBy mapValues

72.
Part II -The Internals

73.
What is aRDD?

74.
What is aRDD? Resilient Distributed Dataset

75.
What is aRDD? Resilient Distributed Dataset

76.
... 10 10/05/2015 10:14:01UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?

77.
node 1 ... 10 10/05/201510:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 What is a RDD?

78.
node 1 ... 10 10/05/201510:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?

79.
What is aRDD?

80.
What is aRDD? RDD needs to hold 3 chunks of information in order to do its work:

81.
What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent

82.
What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned

83.
What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data

84.

85.
What is apartition? A partition represents subset of data within your distributed collection.

86.
What is apartition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ???

87.
What is apartition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ??? How this subset is defined depends on type of the RDD

88.
example: HadoopRDD val journal= sc.textFile(“hdfs://journal/*”)

89.
example: HadoopRDD val journal= sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned?

90.
example: HadoopRDD val journal= sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned? In HadoopRDD partition is exactly the same as file chunks in HDFS

91.
example: HadoopRDD 10 10/05/201510:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni

92.
example: HadoopRDD node 1 1010/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni

93.

94.

95.

96.

97.
example: HadoopRDD class HadoopRDD[K,V](...) extends RDD[(K, V)](sc, Nil) with Logging { ... override def getPartitions: Array[Partition] = { val jobConf = getJobConf() SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }

98.

99.

100.
example: MapPartitionsRDD val journal= sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

101.
example: MapPartitionsRDD val journal= sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned?

102.
example: MapPartitionsRDD val journal= sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned? MapPartitionsRDD inherits partition information from its parent RDD

103.
example: MapPartitionsRDD class MapPartitionsRDD[U:ClassTag, T: ClassTag](...) extends RDD[U](prev) { ... override def getPartitions: Array[Partition] = firstParent[T].partitions

104.

105.

106.
RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map{ case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)

107.
RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map{ case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)

108.
RDD parent sc.textFile() .groupBy() .map {} .filter { } .take() .foreach()

109.
Directed acyclic graph sc.textFile().groupBy() .map { } .filter { } .take() .foreach()

110.
Directed acyclic graph HadoopRDD sc.textFile().groupBy() .map { } .filter { } .take() .foreach()

111.
Directed acyclic graph HadoopRDD ShuffeledRDD sc.textFile().groupBy() .map { } .filter { } .take() .foreach()

112.
Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

113.
Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

114.

115.
Directed acyclic graph HadoopRDD ShuffeledRDDMapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two types of parent dependencies: 1. narrow dependency 2. wider dependency

116.

117.

118.

119.

120.

121.
Stage 1 Stage 2 Directedacyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

122.
What is aRDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data

123.

124.
Stage 1 Stage 2 RunningJob aka materializing DAG sc.textFile() .groupBy() .map { } .filter { }

125.
Stage 1 Stage 2 RunningJob aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect()

126.
Stage 1 Stage 2 RunningJob aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action

127.
Stage 1 Stage 2 RunningJob aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action Actions are implemented using sc.runJob method

128.
Running Job akamaterializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( ): Array[U]

129.
Running Job akamaterializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], ): Array[U]

130.
Running Job akamaterializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], func: Iterator[T] => U, ): Array[U]

131.
Running Job akamaterializing DAG /** * Return an array that contains all of the elements in this RDD. */ def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) }

132.
Running Job akamaterializing DAG /** * Return the number of elements in the RDD. */ def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

133.
Multiple jobs forsingle action /** * Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit. */ def take(num: Int): Array[T] = { while (buf.size < num && partsScanned < totalParts) { (….) val left = num - buf.size val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true) (….) res.foreach(buf ++= _.take(num - buf.size)) partsScanned += numPartsToTry (….) } buf.toArray }

134.

135.

136.
Running Job akamaterializing DAG /** * :: DeveloperApi :: * Implemented by subclasses to compute a given partition. */ @DeveloperApi def compute(split: Partition, context: TaskContext): Iterator[T]

137.

138.
What is aRDD? RDD needs to hold 3 + 2 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data 4. data locality 5. paritioner

139.

140.
Data Locality: HDFSexample node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni

141.

142.

143.
Spark performance -shuffle optimization

144.
Spark performance -shuffle optimization join

145.
Spark performance -shuffle optimization join

146.
Spark performance -shuffle optimization map groupBy

147.

148.
Spark performance -shuffle optimization map groupBy join

149.
Spark performance -shuffle optimization map groupBy join

150.
Spark performance -shuffle optimization map groupBy join Optimization: shuffle avoided if data is already partitioned

151.

152.
Spark performance -shuffle optimization map groupBy map

153.
Spark performance -shuffle optimization map groupBy map

154.
Spark performance -shuffle optimization map groupBy map join

155.
Spark performance -shuffle optimization map groupBy map join

156.
Spark performance -shuffle optimization map groupBy mapValues

157.
Spark performance -shuffle optimization map groupBy mapValues

158.
Spark performance -shuffle optimization map groupBy mapValues join

159.
Spark performance -shuffle optimization map groupBy mapValues join

160.
Part III -The Fun & Profit

161.
It’s all ongithub! http://bit.do/scalapolis

162.
RandomRDD

163.
RandomRDD sc.random() .take(3) .foreach(println)

164.
RandomRDD sc.random() .take(3) .foreach(println) 210 -321 21312

165.
RandomRDD sc.random() .take(3) .foreach(println)

166.
RandomRDD sc.random() .take(3) .foreach(println) sc.random(maxSize = 10,numPartitions = 4) .take(10) .foreach(println)

167.
CensorshipRDD

168.
CensorshipRDD val statement = sc.parallelize(List("We","all", "know that", "Hadoop rocks!"))

169.
CensorshipRDD val statement = sc.parallelize(List("We","all", "know that", "Hadoop rocks!")) .censor() .collect().toList.mkString(" ") println(statement)

170.
CensorshipRDD

171.
CensorshipRDD sc.parallelize(List("We", "all", "knowthat", "Hadoop rocks!")) .censor().collectLegal().foreach(println)

172.
CensorshipRDD sc.parallelize(List("We", "all", "knowthat", "Hadoop rocks!")) .censor().collectLegal().foreach(println) We all know that

173.
Fin

174.
Fin Paweł Szulc

175.
Fin Paweł Szulc paul.szulc@gmail.com

176.
Fin Paweł Szulc paul.szulc@gmail.com Twitter: @rabbitonweb

177.
Fin Paweł Szulc paul.szulc@gmail.com Twitter: @rabbitonweb http://rabbitonweb.com

178.
Fin Paweł Szulc paul.szulc@gmail.com Twitter: @rabbitonweb http://rabbitonweb.com http://github.com/rabbitonweb

179.
Fin Paweł Szulc paul.szulc@gmail.com Twitter: @rabbitonweb http://rabbitonweb.com http://github.com/rabbitonweb http://bit.do/scalapolis

Writing your own RDD for fun and profit

More Related Content

What's hot

Viewers also liked

Similar to Writing your own RDD for fun and profit

More from Pawel Szulc

Recently uploaded

Writing your own RDD for fun and profit