Spark + Scikit-Learn -
Performance Tuning
Who am I ?
● Kent (施晨揚)
● 熱愛 Machine Learning & Big
Data
● 兩個孩子的爸
https://www.facebook.com/texib
What ? Key Factor - Influence Performance
• Key Factor - Influence Performance
• Large Raw Data Size (4 Billion Record)
• Large Number of Cookies (40 Million Records )
• Machine Learning Library - Prediction Function Cost
How ?
● Spark
o Parallel Computing
o Scaleable
o Very Powerful Data Processing Tool
o 但 Machine Learning Library ….
● Python Scikit-Learn
o Very Powerful Machine Learning Library
o 但部份都只能用到單核 XD
So We Have a Idea !
Data Size
Training
Data Prediction Data
Major
Problem
<~ 100 times
Use Python Prepare Prediction Data
Prepare Data
Train Model
Prepare Prediction
Data
About 30 Mins
> 1 Weeks
Do Prediction
Aggregation - Where is Slow ?
50%
• Aggregate 4 Billion Rows to 40 Million Cookies is a Very
Consuming Job
Use mapPartitions()
• Instead of Using ReduceByKey() with Yours Aggregation
Logic
• How :
• 1 step : use db(redshift) to prepare prediction data
order by cookie
• 2 step : use local map partitions to do batch prediction
Use ReduceByKey
(A,1)
(B,2)
(A,3)
(A,3)
(B,4)
(C,2)
(B,2)
(C,2)
(D.1)
(C,1)
(D,2)
(D,1)
(A,4)
(B,2)
(A,3)
(B,4)
(C,2)
(B,2)
(C,2)
(D.1)
(C,1)
(D,3)
(A,7)
(B,6)
(C,2)
(B,2)
(C,3)
(D,4)
(A,7)
(B,8)
(C,5)
(D,4)
24 hours!!
Use DB to Pre-Sort and mapPartitions
(A,1)
(A,3)
(A,3)
(B,2)
(B,4)
(B,2)
(C,2)
(C,2)
(C,1)
(D,1)
(D,2)
(D,1)
(A,7)
(B,8)
(D,4)
(C,5)
12 hours!!
Prediction - Atomic Job
Do Prediction
Prediction - Batch Job
Do Prediction
Conclusion
• Use db to do presort data better than do aggregation by
spark
• Use batch better than atomic
Another Case - Spam Article Classifier
• Article Structure Classifier
• Article Content Classifier
• Bag of Word
• High Dimension Feature Space
• Very Sparse Vector
• Large Number of Documents
Original
sc.textfile
RDD
text to terms
RDD
collect to python
list
2 Millions
Docs
Bang!!
dict
vectorize
sparse vectorstd-idf transformbuild classifier
New
sc.textfile
RDD
text to terms
RDD
distinct rdd
tf-idf transformtf-idf sparse matrix
collect terms
and
Build Vectorize
terms to sparse
vector
RDD
collect sparse
vector to list
Use Vstack
list to sparse
martix
build classifier

Spark + Scikit Learn- Performance Tuning