Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x Performance Improvements

Optimize the Large Scale Graph Applications by
using Apache Spark with 4-5x Performance
Improvements

Agenda
© 2020 PayPal Inc. Confidential and proprietary.
Challenges
Our Lesson & Learn
• Improve the scalability of the large graph
computation
• Optimization & Enhancement in the production
environment
Learning Summary

The main challenges we are facing
2+ billion Vertices
100+ billion Edges
Degrees
• Avg: 110
• Max : 2+ million
• Large graph with the data skew in nature • Strict SLA but various limitations in the
production
Limited Resources
Various production guidelines
Dedicated pool but shared
common services, E.g.,
NameNode

Use Case#1 Community detection
• Using the Connected Component to
group the communities
• Reference the paper - Connected
Components in MapReduce and Beyond
SCALABILITY
1
5
4
6
3
2
Sample undirected graph
Find Connected Component
(1,2)
(1,3)
(1,4)
(1,5)
(1,6)
Community – (1)

The data skew in nature caused “Buckets effect” & OOM
Sample illustration
SCALABILITY
(6,1)
(6,1)
(6,2)
…
(5,2)
(4,2 )
(3,2)
Group by
starting
node
1. Find smallest node
in each group
2. Generate new
pairs by linking
node to smallest
node in each
group
(1,6)
(2,6)
(2,5)
(2,4)
(2,3)
(6,1)
(6,2)
(5,2)
(4,2)
(3,2)
(1,6)
(2,3)
(2,4)
(2,5)
(2,6)
Make it
directed
( 1, [6,2] )
( 2, [1,3,4,5,6]
)
( 3, [2] )
( 4, [2] )
( 5, [2] )
( 6, [1,2] )
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
Group by
starting
node
1. Find smallest
node in each
group
2. Generate new
pairs by linking
node to smallest
node in each
group
(6,1)
(6,2)
(2,1)
(3,2)
(4,2)
(5,2)
(1,6)
(2,6)
(1,2)
(2,3)
(2,4)
(2,5)
(6,1)
(6,2)
(2,1)
(3,2)
(4,2)
(5,2)
Make it
directed
Iteration#1
Iteration#2
(6,1)
(6,2)
(2,1)
(3,2)
(4,2)
(5,2)
Identify unique
representative vertex within
the community
Find connected components
in Reducer
(1, [6])
(6, [1,2])
(2,
[3,4,5,6])
(5, [2])
(4, [2])
(3, [2])
Iteration#1
1
5
4
6
3
2
Intermediate graph -
1
Iteration#2
1
5
4
6
3
2
Intermediate graph - 2
Dedup
(6,1)
(2,1)
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
…
Dedup

The data skew in nature caused “Buckets effect” & OOM
Sample illustration – Cont.
( 1, [2,3,4,5,6] )
( 2, [1,3,4,5] )
( 3, [1,2] )
( 4, [1,2] )
( 5, [1,2] )
( 6, [1] )
Group by
starting
node
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
(1,6)
(1,2)
(1,3)
(1,4)
(1,5)
(2,3)
(2,4)
(2,5)
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
Make it
directed
SCALABILITY
Find connected components
in Reducer
(6,1)
(5,1)
(4,1)
(3,1)
(2,1)
Identify unique
representative vertex within
the community
• The size of connected components increases significantly in each iteration.
• It caused “bucket effect” (Slow Reduce tasks)
• Keeping the connected components in memory caused OOM in some Reducer
For example:
• 50,000,000+ nodes connected
Iteration#3
1
5 4
6
3
2
Found one connected component,
id is 1, members are [1,2,3,4,5,6]
Iteration#3
( 6,1 )
( 5,1 )
( 4,1 )
…
( 5,1 )
( 5,2 )
( 6,1)
1. Find smallest
node in each
group
2. Generate new
pairs by linking
node to smallest
node in each
group
Dedup

Our approach to resolve “Buckets effect” & OOM
SCALABILITY
Separate huge and
normal keys
(2,1)
(3,1)
(3,2)
(4,1)
(4,2)
(5,1)
(5,2)
(6,1)
(1,2)
(1,3)
(1,4)
(1,5)
(1,6)
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
(1,6)
(1,2)
(1,3)
(1,4)
(1,5)
(2,3)
(2,4)
(2,5)
1. Find min for
each huge key
2. Divide the key
by adding
random
number as
prefix
(01,6)
(01,2)
(11,3)
(11,4)
(11,5)
Processed as introduced :
1. Group by starting node
2. Find smallest node in each group
3. explode the map to rows
4. Dedup
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
1. Group by key
2. Spill to mmap file if list
length of single key
exceed threshold
3. Keep remaining list in
memory
4. Keep min value of original
key in each group (11, ([file1],
[5],1))
(01, ([file2],[],1))
• Spilled [3,4] into file1
• Spilled [2,6] into file2
• Min value of original key is
1
Read list of files and in-
memory list, then
generate new pairs
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
Merge &
Dedup
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
1.
Separate
keys
2. Splitting
huge keys
3. Spill to
disk

Our Lesson & Learn of the scalability
v Don’t blame Spark when you see OOM
v Elegant memory usage is the KING
v Inevitable data skew, but scalability can be achieved
v Split huge key
v Spill to disk when necessary

Use Case#2 Prepare the graph data by using Hive
How to choose the proper join solution in Spark?
PERFORMANCE
Note:
• spark 2.3.0
• join without joining keys is not included here
canBroadcastByHints ?
BroadCastJoin ShuffleHashJoin SortMergeJoin
canBroadcastBySizes ?
preferSortMergeJoin ?
Y
Y
N
N
Y
canBuildLocalHashMap ?
N
Y
N
--Quiz: Broadcast ？ LocalHashMap ？ SortMergeJoin？
select * from A inner join B on A.id=B.id where B.dt = ‘2020-
06-25’
• Both Table A and Table B are extra large table
• Table B contains one partition on Date(dt) column; The partition size is
around 1M.
• Inner join between small partition in Table B and an extra-large Table A
• Broadcast
• Smaller table
broadcasted
• No shuffle
• LocalHashMap
• Shuffle needed
• Build hash map for
smaller side in reducer
• SortMergeJoin
• Shuffle needed
• Sorting each partition
of both sides before
merge
Comparison among various join solutions

Use Case#2 Prepare the graph data by using Hive
How to choose the proper join solution in Spark?
PERFORMANCE
Note:
• spark 2.3.0
• join without joining keys is not included here
canBroadcastByHints ?
BroadCastJoin ShuffleHashJoin SortMergeJoin
canBroadcastBySizes ?
preferSortMergeJoin ?
Y
Y
N
N
Y
canBuildLocalHashMap ?
N
Y
N
--Quiz: Broadcast ？ LocalHashMap ？ SortMergeJoin？
select * from A inner join B on A.id=B.id where B.dt = ‘2020-
06-25’
• Both Table A and Table B are extra large table
• Table B contains one partition on Date(dt) column; The partition size is
around 1M.
• Inner join between small partition in Table B and an extra-large Table A
Expectation Execution …
• Broadcast
• Smaller table
broadcasted
• No shuffle
• LocalHashMap
• Shuffle needed
• Build hash map for
smaller side in reducer
• SortMergeJoin
• Shuffle needed
• Sorting each partition
of both sides before
merge
Comparison among various join solutions

Our approach to enable broadcast join with 3x performance improved
select *
from A inner join B on A.id = B.id
where B.dt = ‘2020-06-25’
Parser
‘Project (*)
‘Filter (dt=‘2020-06-25’)
‘Join (A.id=B.id)
‘UnresolvedRelation A ‘UnresolvedRelation B
Project (*)
Filter (dt=‘2020-06-25’)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1GB)
Analyzer
Optimizer
Project (*)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1GB)
Filter (dt=‘2020-06-
25’)
Spark Strategies
(including JoinSelection)
ProjectExec (*)
SortMergeJoinExec (A.id=B.id)
HiveTableScanExec A HiveTableScanExec B
FilterExec (dt=‘2020-06-25’)
Before :
PERFORMANCE

Our approach to enable broadcast join with 3x performance improved
After:
select *
from A inner join B on A.id =
B.id
where B.dt = ‘2020-06-25’
Parser
‘Project (*)
‘Filter (dt=‘2020-06-25’)
‘Join (A.id=B.id)
‘UnresolvedRelation A ‘UnresolvedRelation B
Project (*)
Filter (dt=‘2020-06-25’)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1GB)
Analyzer
Optimizer with rule
PruneHiveTablePartition
s
Project (*)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1MB)
Filter (dt=‘2020-06-
25’)
Spark Strategies
(including JoinSelection)
ProjectExec (*)
BroadcastHashJoinExec (A.id=B.id)
HiveTableScanExec A HiveTableScanExec B
FilterExec (dt=‘2020-06-25’)
See PR #26805
merged in Spark 3.0
PERFORMANCE
Prune partitions and
update sizeInBytes
sizeInBytes updated
Broadcast join
selected

Use Case#3 Persist the graph data into Hive tables
Step 1. DDL Auditing process Step 2. Manipulate the data in Dataframe
Mis-partitioning the column(s) overloaded the HDFS namenode in production
PERFORMANCE
DDL Query Example reviewed and :
create table default.emp (
dept_id int, --1
emp_id int, --2
age int, --3
gender string, --4
address string --5
) partitioned by
(
country string, --6
city string --7
)
DML Query Example reviewed and :
// new a dataframe df1 from the other logic
df1.registerTempTable(“tmpTable”)
val df2 = sparkSession.sql(
“select
department_id as dept_id, --1
employee_id as emp_id, --2
emp_age as age, --3
emp_gender as gender, --4
cnty as country, --5
addr as address , --6
city_name as city --7
from tempTable“)
df2.write.insertInto(“default.emp”)
• address column has been mis-matched to country column
• country has 200+ distinct value while address has 10+ million distinct value
• Tons of new folders and files were created
• Generated platform alerts due to overloading the namenode continuously
Before :

Our approach to refine the interface explicitly
That avoids the column or partitioned column mis-match in compiling your code
Step 1. DDL Auditing process
Step 2. Manipulate the data in Dataframe
DML Query Example reviewed and :
// new a dataframe df1 from the other logic
df1.registerTempTable(“tmpTable”)
val df2 = sparkSession.sql(
“select
department_id as dept_id, --1
employee_id as emp_id, --2
emp_age as age, --3
emp_gender as gender, --4
cnty as country, --5
addr as address , --6
city_name as city --7
from tempTable“)
df2.write.insertInto(“default.emp”, true)
def insertInto(tableName: String, byName: Boolean): Unit
If byName is true, spark will do :
1. Match the columns between data frame and target table by name
2. Throw exception if column name in data frame does not exist in target
table
PERFORMANCE
Step 2. Manipulate the data in Dataframe
After:
DDL Query Example reviewed and :
create table default.emp (
dept_id int, --1
emp_id int, --2
age int, --3
gender string, --4
address string --5
) partitioned by
(
country string, --6
city string --7
)

Our Lesson & Learn of optimization & enhancement in production
Ø Nothing is too tiny to optimize performance
Ø Deep understanding of spark internals is helpful
Ø Misusage may lead to serious impact on shared service
Ø Explicit interface help avoid misusage
Ø Overall, the performance has been improved by 4-5x

Our Learning summary
Ø Use memory elegantly in user code to improve scalability
Ø Understanding Spark deeply is helpful for optimization
Ø Achieve performance improvement from 2 days to around 10 hours
Open to the new learning journey by connecting with you all.
From our practices of the real cases in production

Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x Performance Improvements

More Related Content

What's hot

Similar to Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x Performance Improvements

More from Databricks

Recently uploaded

Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x Performance Improvements