Optimize the Large Scale Graph Applications by
using Apache Spark with 4-5x Performance
Improvements
Agenda
© 2020 PayPal Inc. Confidential and proprietary.
Challenges
Our Lesson & Learn
• Improve the scalability of the large graph
computation
• Optimization & Enhancement in the production
environment
Learning Summary
Challenges
The main challenges we are facing
2+ billion Vertices
100+ billion Edges
Degrees
• Avg: 110
• Max : 2+ million
© 2020 PayPal Inc. Confidential and proprietary.
• Large graph with the data skew in nature • Strict SLA but various limitations in the
production
Limited Resources
Various production guidelines
Dedicated pool but shared
common services, E.g.,
NameNode
Our Lesson and Learn
•
•
Use Case#1 Community detection
© 2020 PayPal Inc. Confidential and proprietary.
• Using the Connected Component to
group the communities
• Reference the paper - Connected
Components in MapReduce and Beyond
SCALABILITY
1
5
4
6
3
2
Sample undirected graph
Find Connected Component
(1,2)
(1,3)
(1,4)
(1,5)
(1,6)
Community – (1)
The data skew in nature caused “Buckets effect” & OOM
© 2020 PayPal Inc. Confidential and proprietary.
Sample illustration
SCALABILITY
(6,1)
(6,1)
(6,2)
…
(5,2)
(4,2 )
(3,2)
Group by
starting
node
1. Find smallest node
in each group
2. Generate new
pairs by linking
node to smallest
node in each
group
(1,6)
(2,6)
(2,5)
(2,4)
(2,3)
(6,1)
(6,2)
(5,2)
(4,2)
(3,2)
(1,6)
(2,3)
(2,4)
(2,5)
(2,6)
Make it
directed
( 1, [6,2] )
( 2, [1,3,4,5,6]
)
( 3, [2] )
( 4, [2] )
( 5, [2] )
( 6, [1,2] )
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
Group by
starting
node
1. Find smallest
node in each
group
2. Generate new
pairs by linking
node to smallest
node in each
group
(6,1)
(6,2)
(2,1)
(3,2)
(4,2)
(5,2)
(1,6)
(2,6)
(1,2)
(2,3)
(2,4)
(2,5)
(6,1)
(6,2)
(2,1)
(3,2)
(4,2)
(5,2)
Make it
directed
Iteration#1
Iteration#2
(6,1)
(6,2)
(2,1)
(3,2)
(4,2)
(5,2)
Identify unique
representative vertex within
the community
Find connected components
in Reducer
(1, [6])
(6, [1,2])
(2,
[3,4,5,6])
(5, [2])
(4, [2])
(3, [2])
Iteration#1
1
5
4
6
3
2
Intermediate graph -
1
Iteration#2
1
5
4
6
3
2
Intermediate graph - 2
Dedup
(6,1)
(2,1)
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
…
Dedup
The data skew in nature caused “Buckets effect” & OOM
© 2020 PayPal Inc. Confidential and proprietary.
Sample illustration – Cont.
( 1, [2,3,4,5,6] )
( 2, [1,3,4,5] )
( 3, [1,2] )
( 4, [1,2] )
( 5, [1,2] )
( 6, [1] )
Group by
starting
node
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
(1,6)
(1,2)
(1,3)
(1,4)
(1,5)
(2,3)
(2,4)
(2,5)
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
Make it
directed
SCALABILITY
Find connected components
in Reducer
(6,1)
(5,1)
(4,1)
(3,1)
(2,1)
Identify unique
representative vertex within
the community
• The size of connected components increases significantly in each iteration.
• It caused “bucket effect” (Slow Reduce tasks)
• Keeping the connected components in memory caused OOM in some Reducer
For example:
• 50,000,000+ nodes connected
Iteration#3
1
5 4
6
3
2
Found one connected component,
id is 1, members are [1,2,3,4,5,6]
Iteration#3
( 6,1 )
( 5,1 )
( 4,1 )
…
( 5,1 )
( 5,2 )
( 6,1)
1. Find smallest
node in each
group
2. Generate new
pairs by linking
node to smallest
node in each
group
Dedup
Our approach to resolve “Buckets effect” & OOM
© 2020 PayPal Inc. Confidential and proprietary.
SCALABILITY
Separate huge and
normal keys
(2,1)
(3,1)
(3,2)
(4,1)
(4,2)
(5,1)
(5,2)
(6,1)
(1,2)
(1,3)
(1,4)
(1,5)
(1,6)
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
(1,6)
(1,2)
(1,3)
(1,4)
(1,5)
(2,3)
(2,4)
(2,5)
1. Find min for
each huge key
2. Divide the key
by adding
random
number as
prefix
(01,6)
(01,2)
(11,3)
(11,4)
(11,5)
Processed as introduced :
1. Group by starting node
2. Find smallest node in each group
3. explode the map to rows
4. Dedup
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
1. Group by key
2. Spill to mmap file if list
length of single key
exceed threshold
3. Keep remaining list in
memory
4. Keep min value of original
key in each group (11, ([file1],
[5],1))
(01, ([file2],[],1))
• Spilled [3,4] into file1
• Spilled [2,6] into file2
• Min value of original key is
1
Read list of files and in-
memory list, then
generate new pairs
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
Merge &
Dedup
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
1.
Separate
keys
2. Splitting
huge keys
3. Spill to
disk
Our Lesson & Learn of the scalability
v Don’t blame Spark when you see OOM
v Elegant memory usage is the KING
v Inevitable data skew, but scalability can be achieved
v Split huge key
v Spill to disk when necessary
© 2020 PayPal Inc. Confidential and proprietary.
Our Lesson and Learn
•
•
Use Case#2 Prepare the graph data by using Hive
© 2020 PayPal Inc. Confidential and proprietary.
How to choose the proper join solution in Spark?
PERFORMANCE
Note:
• spark 2.3.0
• join without joining keys is not included here
canBroadcastByHints ?
BroadCastJoin ShuffleHashJoin SortMergeJoin
canBroadcastBySizes ?
preferSortMergeJoin ?
Y
Y
N
N
Y
canBuildLocalHashMap ?
N
Y
N
--Quiz: Broadcast ? LocalHashMap ? SortMergeJoin?
select * from A inner join B on A.id=B.id where B.dt = ‘2020-
06-25’
• Both Table A and Table B are extra large table
• Table B contains one partition on Date(dt) column; The partition size is
around 1M.
• Inner join between small partition in Table B and an extra-large Table A
• Broadcast
• Smaller table
broadcasted
• No shuffle
• LocalHashMap
• Shuffle needed
• Build hash map for
smaller side in reducer
• SortMergeJoin
• Shuffle needed
• Sorting each partition
of both sides before
merge
Comparison among various join solutions
Use Case#2 Prepare the graph data by using Hive
© 2020 PayPal Inc. Confidential and proprietary.
How to choose the proper join solution in Spark?
PERFORMANCE
Note:
• spark 2.3.0
• join without joining keys is not included here
canBroadcastByHints ?
BroadCastJoin ShuffleHashJoin SortMergeJoin
canBroadcastBySizes ?
preferSortMergeJoin ?
Y
Y
N
N
Y
canBuildLocalHashMap ?
N
Y
N
--Quiz: Broadcast ? LocalHashMap ? SortMergeJoin?
select * from A inner join B on A.id=B.id where B.dt = ‘2020-
06-25’
• Both Table A and Table B are extra large table
• Table B contains one partition on Date(dt) column; The partition size is
around 1M.
• Inner join between small partition in Table B and an extra-large Table A
Expectation Execution …
• Broadcast
• Smaller table
broadcasted
• No shuffle
• LocalHashMap
• Shuffle needed
• Build hash map for
smaller side in reducer
• SortMergeJoin
• Shuffle needed
• Sorting each partition
of both sides before
merge
Comparison among various join solutions
Our approach to enable broadcast join with 3x performance improved
© 2020 PayPal Inc. Confidential and proprietary.
select *
from A inner join B on A.id = B.id
where B.dt = ‘2020-06-25’
Parser
‘Project (*)
‘Filter (dt=‘2020-06-25’)
‘Join (A.id=B.id)
‘UnresolvedRelation A ‘UnresolvedRelation B
Project (*)
Filter (dt=‘2020-06-25’)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1GB)
Analyzer
Optimizer
Project (*)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1GB)
Filter (dt=‘2020-06-
25’)
Spark Strategies
(including JoinSelection)
ProjectExec (*)
SortMergeJoinExec (A.id=B.id)
HiveTableScanExec A HiveTableScanExec B
FilterExec (dt=‘2020-06-25’)
Before :
PERFORMANCE
Our approach to enable broadcast join with 3x performance improved
© 2020 PayPal Inc. Confidential and proprietary.
After:
select *
from A inner join B on A.id =
B.id
where B.dt = ‘2020-06-25’
Parser
‘Project (*)
‘Filter (dt=‘2020-06-25’)
‘Join (A.id=B.id)
‘UnresolvedRelation A ‘UnresolvedRelation B
Project (*)
Filter (dt=‘2020-06-25’)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1GB)
Analyzer
Optimizer with rule
PruneHiveTablePartition
s
Project (*)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1MB)
Filter (dt=‘2020-06-
25’)
Spark Strategies
(including JoinSelection)
ProjectExec (*)
BroadcastHashJoinExec (A.id=B.id)
HiveTableScanExec A HiveTableScanExec B
FilterExec (dt=‘2020-06-25’)
See PR #26805
merged in Spark 3.0
PERFORMANCE
Prune partitions and
update sizeInBytes
sizeInBytes updated
Broadcast join
selected
Use Case#3 Persist the graph data into Hive tables
© 2020 PayPal Inc. Confidential and proprietary.
Step 1. DDL Auditing process Step 2. Manipulate the data in Dataframe
Mis-partitioning the column(s) overloaded the HDFS namenode in production
PERFORMANCE
DDL Query Example reviewed and :
create table default.emp (
dept_id int, --1
emp_id int, --2
age int, --3
gender string, --4
address string --5
) partitioned by
(
country string, --6
city string --7
)
DML Query Example reviewed and :
// new a dataframe df1 from the other logic
df1.registerTempTable(“tmpTable”)
val df2 = sparkSession.sql(
“select
department_id as dept_id, --1
employee_id as emp_id, --2
emp_age as age, --3
emp_gender as gender, --4
cnty as country, --5
addr as address , --6
city_name as city --7
from tempTable“)
df2.write.insertInto(“default.emp”)
• address column has been mis-matched to country column
• country has 200+ distinct value while address has 10+ million distinct value
• Tons of new folders and files were created
• Generated platform alerts due to overloading the namenode continuously
Before :
Our approach to refine the interface explicitly
© 2020 PayPal Inc. Confidential and proprietary.
That avoids the column or partitioned column mis-match in compiling your code
Step 1. DDL Auditing process
Step 2. Manipulate the data in Dataframe
DML Query Example reviewed and :
// new a dataframe df1 from the other logic
df1.registerTempTable(“tmpTable”)
val df2 = sparkSession.sql(
“select
department_id as dept_id, --1
employee_id as emp_id, --2
emp_age as age, --3
emp_gender as gender, --4
cnty as country, --5
addr as address , --6
city_name as city --7
from tempTable“)
df2.write.insertInto(“default.emp”, true)
def insertInto(tableName: String, byName: Boolean): Unit
If byName is true, spark will do :
1. Match the columns between data frame and target table by name
2. Throw exception if column name in data frame does not exist in target
table
PERFORMANCE
Step 2. Manipulate the data in Dataframe
After:
DDL Query Example reviewed and :
create table default.emp (
dept_id int, --1
emp_id int, --2
age int, --3
gender string, --4
address string --5
) partitioned by
(
country string, --6
city string --7
)
Our Lesson & Learn of optimization & enhancement in production
Ø Nothing is too tiny to optimize performance
Ø Deep understanding of spark internals is helpful
Ø Misusage may lead to serious impact on shared service
Ø Explicit interface help avoid misusage
Ø Overall, the performance has been improved by 4-5x
© 2020 PayPal Inc. Confidential and proprietary.
Learning Summary
Our Learning summary
Ø Use memory elegantly in user code to improve scalability
Ø Understanding Spark deeply is helpful for optimization
Ø Achieve performance improvement from 2 days to around 10 hours
Open to the new learning journey by connecting with you all.
© 2020 PayPal Inc. Confidential and proprietary.
From our practices of the real cases in production
Q & A

Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x Performance Improvements

  • 1.
    Optimize the LargeScale Graph Applications by using Apache Spark with 4-5x Performance Improvements
  • 2.
    Agenda © 2020 PayPalInc. Confidential and proprietary. Challenges Our Lesson & Learn • Improve the scalability of the large graph computation • Optimization & Enhancement in the production environment Learning Summary
  • 3.
  • 4.
    The main challengeswe are facing 2+ billion Vertices 100+ billion Edges Degrees • Avg: 110 • Max : 2+ million © 2020 PayPal Inc. Confidential and proprietary. • Large graph with the data skew in nature • Strict SLA but various limitations in the production Limited Resources Various production guidelines Dedicated pool but shared common services, E.g., NameNode
  • 5.
    Our Lesson andLearn • •
  • 6.
    Use Case#1 Communitydetection © 2020 PayPal Inc. Confidential and proprietary. • Using the Connected Component to group the communities • Reference the paper - Connected Components in MapReduce and Beyond SCALABILITY 1 5 4 6 3 2 Sample undirected graph Find Connected Component (1,2) (1,3) (1,4) (1,5) (1,6) Community – (1)
  • 7.
    The data skewin nature caused “Buckets effect” & OOM © 2020 PayPal Inc. Confidential and proprietary. Sample illustration SCALABILITY (6,1) (6,1) (6,2) … (5,2) (4,2 ) (3,2) Group by starting node 1. Find smallest node in each group 2. Generate new pairs by linking node to smallest node in each group (1,6) (2,6) (2,5) (2,4) (2,3) (6,1) (6,2) (5,2) (4,2) (3,2) (1,6) (2,3) (2,4) (2,5) (2,6) Make it directed ( 1, [6,2] ) ( 2, [1,3,4,5,6] ) ( 3, [2] ) ( 4, [2] ) ( 5, [2] ) ( 6, [1,2] ) (6,1) (2,1) (3,1) (4,1) (5,1) (3,2) (4,2) (5,2) Group by starting node 1. Find smallest node in each group 2. Generate new pairs by linking node to smallest node in each group (6,1) (6,2) (2,1) (3,2) (4,2) (5,2) (1,6) (2,6) (1,2) (2,3) (2,4) (2,5) (6,1) (6,2) (2,1) (3,2) (4,2) (5,2) Make it directed Iteration#1 Iteration#2 (6,1) (6,2) (2,1) (3,2) (4,2) (5,2) Identify unique representative vertex within the community Find connected components in Reducer (1, [6]) (6, [1,2]) (2, [3,4,5,6]) (5, [2]) (4, [2]) (3, [2]) Iteration#1 1 5 4 6 3 2 Intermediate graph - 1 Iteration#2 1 5 4 6 3 2 Intermediate graph - 2 Dedup (6,1) (2,1) (2,1) (3,1) (4,1) (5,1) (6,1) … Dedup
  • 8.
    The data skewin nature caused “Buckets effect” & OOM © 2020 PayPal Inc. Confidential and proprietary. Sample illustration – Cont. ( 1, [2,3,4,5,6] ) ( 2, [1,3,4,5] ) ( 3, [1,2] ) ( 4, [1,2] ) ( 5, [1,2] ) ( 6, [1] ) Group by starting node (6,1) (2,1) (3,1) (4,1) (5,1) (3,2) (4,2) (5,2) (1,6) (1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) (6,1) (2,1) (3,1) (4,1) (5,1) (3,2) (4,2) (5,2) Make it directed SCALABILITY Find connected components in Reducer (6,1) (5,1) (4,1) (3,1) (2,1) Identify unique representative vertex within the community • The size of connected components increases significantly in each iteration. • It caused “bucket effect” (Slow Reduce tasks) • Keeping the connected components in memory caused OOM in some Reducer For example: • 50,000,000+ nodes connected Iteration#3 1 5 4 6 3 2 Found one connected component, id is 1, members are [1,2,3,4,5,6] Iteration#3 ( 6,1 ) ( 5,1 ) ( 4,1 ) … ( 5,1 ) ( 5,2 ) ( 6,1) 1. Find smallest node in each group 2. Generate new pairs by linking node to smallest node in each group Dedup
  • 9.
    Our approach toresolve “Buckets effect” & OOM © 2020 PayPal Inc. Confidential and proprietary. SCALABILITY Separate huge and normal keys (2,1) (3,1) (3,2) (4,1) (4,2) (5,1) (5,2) (6,1) (1,2) (1,3) (1,4) (1,5) (1,6) (6,1) (2,1) (3,1) (4,1) (5,1) (3,2) (4,2) (5,2) (1,6) (1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) 1. Find min for each huge key 2. Divide the key by adding random number as prefix (01,6) (01,2) (11,3) (11,4) (11,5) Processed as introduced : 1. Group by starting node 2. Find smallest node in each group 3. explode the map to rows 4. Dedup (2,1) (3,1) (4,1) (5,1) (6,1) 1. Group by key 2. Spill to mmap file if list length of single key exceed threshold 3. Keep remaining list in memory 4. Keep min value of original key in each group (11, ([file1], [5],1)) (01, ([file2],[],1)) • Spilled [3,4] into file1 • Spilled [2,6] into file2 • Min value of original key is 1 Read list of files and in- memory list, then generate new pairs (2,1) (3,1) (4,1) (5,1) (6,1) Merge & Dedup (2,1) (3,1) (4,1) (5,1) (6,1) 1. Separate keys 2. Splitting huge keys 3. Spill to disk
  • 10.
    Our Lesson &Learn of the scalability v Don’t blame Spark when you see OOM v Elegant memory usage is the KING v Inevitable data skew, but scalability can be achieved v Split huge key v Spill to disk when necessary © 2020 PayPal Inc. Confidential and proprietary.
  • 11.
    Our Lesson andLearn • •
  • 12.
    Use Case#2 Preparethe graph data by using Hive © 2020 PayPal Inc. Confidential and proprietary. How to choose the proper join solution in Spark? PERFORMANCE Note: • spark 2.3.0 • join without joining keys is not included here canBroadcastByHints ? BroadCastJoin ShuffleHashJoin SortMergeJoin canBroadcastBySizes ? preferSortMergeJoin ? Y Y N N Y canBuildLocalHashMap ? N Y N --Quiz: Broadcast ? LocalHashMap ? SortMergeJoin? select * from A inner join B on A.id=B.id where B.dt = ‘2020- 06-25’ • Both Table A and Table B are extra large table • Table B contains one partition on Date(dt) column; The partition size is around 1M. • Inner join between small partition in Table B and an extra-large Table A • Broadcast • Smaller table broadcasted • No shuffle • LocalHashMap • Shuffle needed • Build hash map for smaller side in reducer • SortMergeJoin • Shuffle needed • Sorting each partition of both sides before merge Comparison among various join solutions
  • 13.
    Use Case#2 Preparethe graph data by using Hive © 2020 PayPal Inc. Confidential and proprietary. How to choose the proper join solution in Spark? PERFORMANCE Note: • spark 2.3.0 • join without joining keys is not included here canBroadcastByHints ? BroadCastJoin ShuffleHashJoin SortMergeJoin canBroadcastBySizes ? preferSortMergeJoin ? Y Y N N Y canBuildLocalHashMap ? N Y N --Quiz: Broadcast ? LocalHashMap ? SortMergeJoin? select * from A inner join B on A.id=B.id where B.dt = ‘2020- 06-25’ • Both Table A and Table B are extra large table • Table B contains one partition on Date(dt) column; The partition size is around 1M. • Inner join between small partition in Table B and an extra-large Table A Expectation Execution … • Broadcast • Smaller table broadcasted • No shuffle • LocalHashMap • Shuffle needed • Build hash map for smaller side in reducer • SortMergeJoin • Shuffle needed • Sorting each partition of both sides before merge Comparison among various join solutions
  • 14.
    Our approach toenable broadcast join with 3x performance improved © 2020 PayPal Inc. Confidential and proprietary. select * from A inner join B on A.id = B.id where B.dt = ‘2020-06-25’ Parser ‘Project (*) ‘Filter (dt=‘2020-06-25’) ‘Join (A.id=B.id) ‘UnresolvedRelation A ‘UnresolvedRelation B Project (*) Filter (dt=‘2020-06-25’) Join (A.id=B.id) HiveTableRelation A (sizeInBytes=1GB) HiveTableRelation B (sizeInBytes=1GB) Analyzer Optimizer Project (*) Join (A.id=B.id) HiveTableRelation A (sizeInBytes=1GB) HiveTableRelation B (sizeInBytes=1GB) Filter (dt=‘2020-06- 25’) Spark Strategies (including JoinSelection) ProjectExec (*) SortMergeJoinExec (A.id=B.id) HiveTableScanExec A HiveTableScanExec B FilterExec (dt=‘2020-06-25’) Before : PERFORMANCE
  • 15.
    Our approach toenable broadcast join with 3x performance improved © 2020 PayPal Inc. Confidential and proprietary. After: select * from A inner join B on A.id = B.id where B.dt = ‘2020-06-25’ Parser ‘Project (*) ‘Filter (dt=‘2020-06-25’) ‘Join (A.id=B.id) ‘UnresolvedRelation A ‘UnresolvedRelation B Project (*) Filter (dt=‘2020-06-25’) Join (A.id=B.id) HiveTableRelation A (sizeInBytes=1GB) HiveTableRelation B (sizeInBytes=1GB) Analyzer Optimizer with rule PruneHiveTablePartition s Project (*) Join (A.id=B.id) HiveTableRelation A (sizeInBytes=1GB) HiveTableRelation B (sizeInBytes=1MB) Filter (dt=‘2020-06- 25’) Spark Strategies (including JoinSelection) ProjectExec (*) BroadcastHashJoinExec (A.id=B.id) HiveTableScanExec A HiveTableScanExec B FilterExec (dt=‘2020-06-25’) See PR #26805 merged in Spark 3.0 PERFORMANCE Prune partitions and update sizeInBytes sizeInBytes updated Broadcast join selected
  • 16.
    Use Case#3 Persistthe graph data into Hive tables © 2020 PayPal Inc. Confidential and proprietary. Step 1. DDL Auditing process Step 2. Manipulate the data in Dataframe Mis-partitioning the column(s) overloaded the HDFS namenode in production PERFORMANCE DDL Query Example reviewed and : create table default.emp ( dept_id int, --1 emp_id int, --2 age int, --3 gender string, --4 address string --5 ) partitioned by ( country string, --6 city string --7 ) DML Query Example reviewed and : // new a dataframe df1 from the other logic df1.registerTempTable(“tmpTable”) val df2 = sparkSession.sql( “select department_id as dept_id, --1 employee_id as emp_id, --2 emp_age as age, --3 emp_gender as gender, --4 cnty as country, --5 addr as address , --6 city_name as city --7 from tempTable“) df2.write.insertInto(“default.emp”) • address column has been mis-matched to country column • country has 200+ distinct value while address has 10+ million distinct value • Tons of new folders and files were created • Generated platform alerts due to overloading the namenode continuously Before :
  • 17.
    Our approach torefine the interface explicitly © 2020 PayPal Inc. Confidential and proprietary. That avoids the column or partitioned column mis-match in compiling your code Step 1. DDL Auditing process Step 2. Manipulate the data in Dataframe DML Query Example reviewed and : // new a dataframe df1 from the other logic df1.registerTempTable(“tmpTable”) val df2 = sparkSession.sql( “select department_id as dept_id, --1 employee_id as emp_id, --2 emp_age as age, --3 emp_gender as gender, --4 cnty as country, --5 addr as address , --6 city_name as city --7 from tempTable“) df2.write.insertInto(“default.emp”, true) def insertInto(tableName: String, byName: Boolean): Unit If byName is true, spark will do : 1. Match the columns between data frame and target table by name 2. Throw exception if column name in data frame does not exist in target table PERFORMANCE Step 2. Manipulate the data in Dataframe After: DDL Query Example reviewed and : create table default.emp ( dept_id int, --1 emp_id int, --2 age int, --3 gender string, --4 address string --5 ) partitioned by ( country string, --6 city string --7 )
  • 18.
    Our Lesson &Learn of optimization & enhancement in production Ø Nothing is too tiny to optimize performance Ø Deep understanding of spark internals is helpful Ø Misusage may lead to serious impact on shared service Ø Explicit interface help avoid misusage Ø Overall, the performance has been improved by 4-5x © 2020 PayPal Inc. Confidential and proprietary.
  • 19.
  • 20.
    Our Learning summary ØUse memory elegantly in user code to improve scalability Ø Understanding Spark deeply is helpful for optimization Ø Achieve performance improvement from 2 days to around 10 hours Open to the new learning journey by connecting with you all. © 2020 PayPal Inc. Confidential and proprietary. From our practices of the real cases in production
  • 21.