Migrating ETL
workflow to Spark at
scale in Pinterest
Daniel Dai, Zirui Li
Pinterest Inc
About Us
• Daniel Dai
• Tech Lead at Pinterest
• PMC member for Apache Hive and Pig
• Zirui Li
• Software Engineer at Pinterest Spark Platform Team
• Focus on building Pinterest in-house Spark platform & functionalities
Agenda
▪ Spark @ Pinterest
▪ Cascading/Scalding to
Spark Conversion
▪ Technical Challenges
▪ Migration Process
▪ Result and Future Plan
Agenda
• Spark @ Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
We Are on Cloud
• We use AWS
• However, we build our
own clusters
• Avoid vendor lockdown
• Timely support by our own team
• We store everything on
S3
• Cost less than HDFS
• HDFS is for temporary storage
S3
EC2
HDFS
Yarn
EC2
HDFS
Yarn
EC2
HDFS
Yarn
Spark Clusters
• We have a couple of Spark clusters
• From several hundred nodes to 1000+ nodes
• Spark only cluster and mixed use cluster
• Cross cluster routing
• R5D instance type for Spark only cluster
• Faster local disk
• High memory to cpu ratio
Spark Versions and Use Cases
• We are running Spark 2.4
• With quite a few internal fixes
• Will migrate to 3.1 this year
• Use cases
• Production use cases
• SparkSQL, PySpark, Spark Native via airflow
• Adhoc use case
• SparkSQL via Querybook, PySpark via Jupyter
Migration Plan
• 40% workloads are already
Spark
• The number is 12% one year ago
• Migration in progress
• Hive to SparkSQL
• Cacading/Scalding to Spark
• Hadoop streaming to Spark pipe
Hive
Cascading/Scalding
Hadoop Streaming
Where are we?
Migration Plan
• Still half workloads are on
Cascading/Scalding
• ETL use cases
• Spark Future
• Query engine: Presto/SparkSQL
• ETL: Spark native
• Machine learning: PySpark
Agenda
• Spark in Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
Cascading
• Simple DAG
• Only 6 different pipes
• Most logic in UDF
• Each – UDF in map
• Every – UDF in reduce
• Java API
Source
Each
GroupBy
Every
Sink
Pattern 1
Source
Each
CoGroup
Every
Sink
Pattern 2
Source
Each
Scalding
• Rich set of operators on top of Cascading
• Operators are very similar to Spark RDD
• Scala API
Migration Path
+
▪ UDF
interface is
private
▪ SQL easy to
migrate to
any engine
Recommend if there’s not
many UDFs
SparkSQL
−
PySpark
▪ Suboptimal
performanc
e, especially
for Python
UDF
▪ Rich Python
libraries
available to
use
+ −
Recommended for Machine
Learning only
+
Native Spark
▪ most structured path to enjoin
rich spark syntax
▪ Work for almost all
Cascading/Scalding
applications
Default & Recommended for
general cases
Spark API
▪ Newer & Recommended API
RDD
Spark Dataframe/Dataset
▪ Most inputs are thrift sequence files
▪ Encode/Decode thrift object to/from
dataframe is slow
Recommended only for non-thrift-
sequence file
▪ More Flexible on handling thrift object
serialization / deserialization
▪ Semantically close to Scalding
▪ Older API
▪ Less performant than Dataframe
Default choice for the conversion
+
−
+
−
Rewrite the
application manually
Reuse most of
Cascading/Scalding
library code
▪ However, avoid
Cascading
specific structure
Automatic tool to help
result validation &
performance tuning
Approach
Translate Cascading
• DAG is usually simple
• Most Cascading pipe has one-to-one mapping to Spark transformation
// val processedInput: RDD[(String, Token)]
// val tokenFreq: RDD[(String, Double)]
val tokenFreqVar = spark.sparkContext.broadcast(tokenFreq.collectAsMap())
val joined = processedInput.map {
t => (t._1, (t._2, tokenFreqVar.value.get(t._1)))
}
Cascading Pipe Spark RDD Operator Note
Each Map side UDF
Every Reduce side UDF
Merge union
CoGroup join/leftOuterJoin/right
OuterJoin/fullOuterJoin
GroupBy GroupBy/GroupByKey secondary sort might be needed
HashJoin Broadcast join no native support in RDD, simulate via broadcast variable
• Complexity is in UDF
UDF Translation
Semantic Difference
Multi-threading
UDF initialization
and cleanup
▪ Do both filtering &
transformation
▪ Java
▪ map + filter
▪ Scala
▪ Multi-thread model
▪ Worst case set
executor-cores=1
▪ Single-thread model
▪ Class with initialization &
cleanup
▪ No init / cleanup hook
▪ mapPartitions to
simulate
Cascading UDF Spark
VS
.mapPartitions{iter =>
// Expensive initialization block
// init block
while (iter.hasNext()) {
val event = iter.next
process(event)
}
// cleanup block
}
Translate Scalding
• Most operator has 1 to 1 mapping to RDD operator
• UDF can be used in Spark without change
Scalding Operator Spark RDD Operator Note
map map
flatMap flatMap
filter filter
filterNot filter Spark does not have filterNot, use filter with negative condition
groupBy groupBy
group groupByKey
groupAll groupBy(t=>1)
...
Agenda
• Spark in Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
Secondary Sort
• Use “repartitionAndSortWithinPartitions” in Spark
• There’s gap in semantics: Use GroupSortedIterator to fill the gap
output = new GroupBy(output, new Fields("user_id"), new Fields("sec_key"));
group key sort key
(2, 2), "apple"
(1, 3), "facebook"
(1, 1), "pinterest"
(1, 2), "twitter"
(3, 2), "google"
input
iterator for key 1:
(1, 1), "pinterest"
(1, 2), "twitter"
(1, 3), "facebook"
iterator for key 2:
(2, 2), "apple"
iterator for key 3:
(3, 2), "google"
Cascading
(1, 1), "pinterest"
(1, 2), "twitter"
(1, 3), "facebook"
(2, 2), "apple"
(3, 2), "google"
Spark
Accumulators
• Spark accumulator is not
accurate
• Stage retry
• Same code run multiple times in different
stage
• Solution
• Deduplicate with stage+partition
• persist
val sc = new SparkContext(conf);
val inputRecords = sc.longAccumulator("Input")
val a = sc.textFile("studenttab10k");
val b = a.map(line => line.split("t"));
val c = b.map { t =>
inputRecords.add(1L)
(t(0), t(1).toInt, t(2).toDouble)
};
val sumScore = c.map(t => t._3).sum()
// c.persist()
c.map { t =>
(t._1, t._3/sumScore)
}.saveAsTextFile("output")
Accumulator Continue
• Retrieve the Accumulator of
the Earliest Stage
• Exception: user intentionally
use the same accumulator in
different stages
NUM_OUTPUT_TOKENS
Stage 14: 168006868318
Stage 21: 336013736636
val sc = new SparkContext(conf);
val inputRecords = sc.longAccumulator("Input")
val input1 = sc.textFile("input1");
val input1_processed = input1.map { t =>
inputRecords.add(1L)
(t(0), (t(1).toInt, t(2).toDouble))
};
val input2 = sc.textFile("input2");
val input2_processed = input2.map { t =>
inputRecords.add(1L)
(t(0), (t(1).toInt, t(2).toDouble))
};
input1_processed.join(input2_processed)
.saveAsTextFile("output")
Accumulator Tab in Spark UI
• SPARK-35197
Profiling
• Visualize frame graph using Nebula
• Realtime
• Ability to segment into stage/task
• Focus on only useful threads
OutputCommitter
• Issue with OutputCommitter
• slow metadata operation
• 503 errors
• Netflix s3committer
• Wrapper for Spark RDD
• s3committer only support old API
Agenda
• Spark @ Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
Automatic Migration Service (AMS)
• A tool to automate majority of migration process
Data Validation
Row counts
Checksum
Comparison
Create a table around
output
SparkSQL UDF
CountAndChecksumUdaf
Doesn’t work for
double/float
Doesn’t work for array if
order is different
−
Input depends
on current
timestamp
There's
random
number
generator in
the code
Rounding
differences
which result
differences in
filter condition
test
Unstable top
result if there's
a tie
Source of Uncertainty
Performance Tuning
Collect runtime
memory/vcore usage
Tuning passed if
criterias meet:
▪ Runtime reduced
▪ Vcore*sec
reduced 20%+
▪ Memory increase
less than 100%
Retry with tuned
memory / vcore if
necessary
Balancing Performance
• Trade-offs
• More executors
• Better performance, but cost more
• Use more cores per executor
• Save on memory, but cost more on cpu
• Use dynamic allocation usually save cost
• Skew won’t cost more with dynamic allocation
• Control parallelism
• spark.default.parallelism for RDD
• spark.sql.shuffle.partitions for dataframe/dataset/SparkSQL
▪ Automatically pick Spark over
Cascading/Scalding during runtime if
condition meets
▪ Data Validation Pass
▪ Performance Optimization Pass
▪ Automatically handle failure with handlers if
applicable
▪ Configuration incorrectness
▪ OutOfMemory
▪ ...
▪ Manual troubleshooting is needed for other
uncaught failures
Failure handling
Automatic Migration
Automatic Migration & Failure Handling
Agenda
• Spark @ Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
Result
• 40% performance improvement
• 47% cost saving on cpu
• Use 33% more memory
Future Plan
• Manual conversion for application still evolving
• Spark backend for legacy application
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Migrating ETL Workflow to Apache Spark at Scale in Pinterest

  • 1.
    Migrating ETL workflow toSpark at scale in Pinterest Daniel Dai, Zirui Li Pinterest Inc
  • 2.
    About Us • DanielDai • Tech Lead at Pinterest • PMC member for Apache Hive and Pig • Zirui Li • Software Engineer at Pinterest Spark Platform Team • Focus on building Pinterest in-house Spark platform & functionalities
  • 3.
    Agenda ▪ Spark @Pinterest ▪ Cascading/Scalding to Spark Conversion ▪ Technical Challenges ▪ Migration Process ▪ Result and Future Plan
  • 4.
    Agenda • Spark @Pinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  • 5.
    We Are onCloud • We use AWS • However, we build our own clusters • Avoid vendor lockdown • Timely support by our own team • We store everything on S3 • Cost less than HDFS • HDFS is for temporary storage S3 EC2 HDFS Yarn EC2 HDFS Yarn EC2 HDFS Yarn
  • 6.
    Spark Clusters • Wehave a couple of Spark clusters • From several hundred nodes to 1000+ nodes • Spark only cluster and mixed use cluster • Cross cluster routing • R5D instance type for Spark only cluster • Faster local disk • High memory to cpu ratio
  • 7.
    Spark Versions andUse Cases • We are running Spark 2.4 • With quite a few internal fixes • Will migrate to 3.1 this year • Use cases • Production use cases • SparkSQL, PySpark, Spark Native via airflow • Adhoc use case • SparkSQL via Querybook, PySpark via Jupyter
  • 8.
    Migration Plan • 40%workloads are already Spark • The number is 12% one year ago • Migration in progress • Hive to SparkSQL • Cacading/Scalding to Spark • Hadoop streaming to Spark pipe Hive Cascading/Scalding Hadoop Streaming Where are we?
  • 9.
    Migration Plan • Stillhalf workloads are on Cascading/Scalding • ETL use cases • Spark Future • Query engine: Presto/SparkSQL • ETL: Spark native • Machine learning: PySpark
  • 10.
    Agenda • Spark inPinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  • 11.
    Cascading • Simple DAG •Only 6 different pipes • Most logic in UDF • Each – UDF in map • Every – UDF in reduce • Java API Source Each GroupBy Every Sink Pattern 1 Source Each CoGroup Every Sink Pattern 2 Source Each
  • 12.
    Scalding • Rich setof operators on top of Cascading • Operators are very similar to Spark RDD • Scala API
  • 13.
    Migration Path + ▪ UDF interfaceis private ▪ SQL easy to migrate to any engine Recommend if there’s not many UDFs SparkSQL − PySpark ▪ Suboptimal performanc e, especially for Python UDF ▪ Rich Python libraries available to use + − Recommended for Machine Learning only + Native Spark ▪ most structured path to enjoin rich spark syntax ▪ Work for almost all Cascading/Scalding applications Default & Recommended for general cases
  • 14.
    Spark API ▪ Newer& Recommended API RDD Spark Dataframe/Dataset ▪ Most inputs are thrift sequence files ▪ Encode/Decode thrift object to/from dataframe is slow Recommended only for non-thrift- sequence file ▪ More Flexible on handling thrift object serialization / deserialization ▪ Semantically close to Scalding ▪ Older API ▪ Less performant than Dataframe Default choice for the conversion + − + −
  • 15.
    Rewrite the application manually Reusemost of Cascading/Scalding library code ▪ However, avoid Cascading specific structure Automatic tool to help result validation & performance tuning Approach
  • 16.
    Translate Cascading • DAGis usually simple • Most Cascading pipe has one-to-one mapping to Spark transformation // val processedInput: RDD[(String, Token)] // val tokenFreq: RDD[(String, Double)] val tokenFreqVar = spark.sparkContext.broadcast(tokenFreq.collectAsMap()) val joined = processedInput.map { t => (t._1, (t._2, tokenFreqVar.value.get(t._1))) } Cascading Pipe Spark RDD Operator Note Each Map side UDF Every Reduce side UDF Merge union CoGroup join/leftOuterJoin/right OuterJoin/fullOuterJoin GroupBy GroupBy/GroupByKey secondary sort might be needed HashJoin Broadcast join no native support in RDD, simulate via broadcast variable • Complexity is in UDF
  • 17.
    UDF Translation Semantic Difference Multi-threading UDFinitialization and cleanup ▪ Do both filtering & transformation ▪ Java ▪ map + filter ▪ Scala ▪ Multi-thread model ▪ Worst case set executor-cores=1 ▪ Single-thread model ▪ Class with initialization & cleanup ▪ No init / cleanup hook ▪ mapPartitions to simulate Cascading UDF Spark VS .mapPartitions{iter => // Expensive initialization block // init block while (iter.hasNext()) { val event = iter.next process(event) } // cleanup block }
  • 18.
    Translate Scalding • Mostoperator has 1 to 1 mapping to RDD operator • UDF can be used in Spark without change Scalding Operator Spark RDD Operator Note map map flatMap flatMap filter filter filterNot filter Spark does not have filterNot, use filter with negative condition groupBy groupBy group groupByKey groupAll groupBy(t=>1) ...
  • 19.
    Agenda • Spark inPinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  • 20.
    Secondary Sort • Use“repartitionAndSortWithinPartitions” in Spark • There’s gap in semantics: Use GroupSortedIterator to fill the gap output = new GroupBy(output, new Fields("user_id"), new Fields("sec_key")); group key sort key (2, 2), "apple" (1, 3), "facebook" (1, 1), "pinterest" (1, 2), "twitter" (3, 2), "google" input iterator for key 1: (1, 1), "pinterest" (1, 2), "twitter" (1, 3), "facebook" iterator for key 2: (2, 2), "apple" iterator for key 3: (3, 2), "google" Cascading (1, 1), "pinterest" (1, 2), "twitter" (1, 3), "facebook" (2, 2), "apple" (3, 2), "google" Spark
  • 21.
    Accumulators • Spark accumulatoris not accurate • Stage retry • Same code run multiple times in different stage • Solution • Deduplicate with stage+partition • persist val sc = new SparkContext(conf); val inputRecords = sc.longAccumulator("Input") val a = sc.textFile("studenttab10k"); val b = a.map(line => line.split("t")); val c = b.map { t => inputRecords.add(1L) (t(0), t(1).toInt, t(2).toDouble) }; val sumScore = c.map(t => t._3).sum() // c.persist() c.map { t => (t._1, t._3/sumScore) }.saveAsTextFile("output")
  • 22.
    Accumulator Continue • Retrievethe Accumulator of the Earliest Stage • Exception: user intentionally use the same accumulator in different stages NUM_OUTPUT_TOKENS Stage 14: 168006868318 Stage 21: 336013736636 val sc = new SparkContext(conf); val inputRecords = sc.longAccumulator("Input") val input1 = sc.textFile("input1"); val input1_processed = input1.map { t => inputRecords.add(1L) (t(0), (t(1).toInt, t(2).toDouble)) }; val input2 = sc.textFile("input2"); val input2_processed = input2.map { t => inputRecords.add(1L) (t(0), (t(1).toInt, t(2).toDouble)) }; input1_processed.join(input2_processed) .saveAsTextFile("output")
  • 23.
    Accumulator Tab inSpark UI • SPARK-35197
  • 24.
    Profiling • Visualize framegraph using Nebula • Realtime • Ability to segment into stage/task • Focus on only useful threads
  • 25.
    OutputCommitter • Issue withOutputCommitter • slow metadata operation • 503 errors • Netflix s3committer • Wrapper for Spark RDD • s3committer only support old API
  • 26.
    Agenda • Spark @Pinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  • 27.
    Automatic Migration Service(AMS) • A tool to automate majority of migration process
  • 28.
    Data Validation Row counts Checksum Comparison Createa table around output SparkSQL UDF CountAndChecksumUdaf Doesn’t work for double/float Doesn’t work for array if order is different −
  • 29.
    Input depends on current timestamp There's random number generatorin the code Rounding differences which result differences in filter condition test Unstable top result if there's a tie Source of Uncertainty
  • 30.
    Performance Tuning Collect runtime memory/vcoreusage Tuning passed if criterias meet: ▪ Runtime reduced ▪ Vcore*sec reduced 20%+ ▪ Memory increase less than 100% Retry with tuned memory / vcore if necessary
  • 31.
    Balancing Performance • Trade-offs •More executors • Better performance, but cost more • Use more cores per executor • Save on memory, but cost more on cpu • Use dynamic allocation usually save cost • Skew won’t cost more with dynamic allocation • Control parallelism • spark.default.parallelism for RDD • spark.sql.shuffle.partitions for dataframe/dataset/SparkSQL
  • 32.
    ▪ Automatically pickSpark over Cascading/Scalding during runtime if condition meets ▪ Data Validation Pass ▪ Performance Optimization Pass ▪ Automatically handle failure with handlers if applicable ▪ Configuration incorrectness ▪ OutOfMemory ▪ ... ▪ Manual troubleshooting is needed for other uncaught failures Failure handling Automatic Migration Automatic Migration & Failure Handling
  • 33.
    Agenda • Spark @Pinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  • 34.
    Result • 40% performanceimprovement • 47% cost saving on cpu • Use 33% more memory
  • 35.
    Future Plan • Manualconversion for application still evolving • Spark backend for legacy application
  • 36.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.