Spark + Scikit Learn- Performance Tuning

Spark + Scikit-Learn -
Performance Tuning

Who am I ?
● Kent (施晨揚)
● 熱愛 Machine Learning & Big
Data
● 兩個孩子的爸
https://www.facebook.com/texib

What ? Key Factor - Influence Performance
• Key Factor - Influence Performance
• Large Raw Data Size (4 Billion Record)
• Large Number of Cookies (40 Million Records )
• Machine Learning Library - Prediction Function Cost

How ?
● Spark
o Parallel Computing
o Scaleable
o Very Powerful Data Processing Tool
o 但 Machine Learning Library ….
● Python Scikit-Learn
o Very Powerful Machine Learning Library
o 但部份都只能用到單核 XD

Data Size
Training
Data Prediction Data
Major
Problem
<~ 100 times

Use Python Prepare Prediction Data
Prepare Data
Train Model
Prepare Prediction
Data
About 30 Mins
> 1 Weeks
Do Prediction

Aggregation - Where is Slow ?
50%
• Aggregate 4 Billion Rows to 40 Million Cookies is a Very
Consuming Job

Use mapPartitions()
• Instead of Using ReduceByKey() with Yours Aggregation
Logic
• How :
• 1 step : use db(redshift) to prepare prediction data
order by cookie
• 2 step : use local map partitions to do batch prediction

Use ReduceByKey
(A,1)
(B,2)
(A,3)
(A,3)
(B,4)
(C,2)
(B,2)
(C,2)
(D.1)
(C,1)
(D,2)
(D,1)
(A,4)
(B,2)
(A,3)
(B,4)
(C,2)
(B,2)
(C,2)
(D.1)
(C,1)
(D,3)
(A,7)
(B,6)
(C,2)
(B,2)
(C,3)
(D,4)
(A,7)
(B,8)
(C,5)
(D,4)
24 hours!!

Use DB to Pre-Sort and mapPartitions
(A,1)
(A,3)
(A,3)
(B,2)
(B,4)
(B,2)
(C,2)
(C,2)
(C,1)
(D,1)
(D,2)
(D,1)
(A,7)
(B,8)
(D,4)
(C,5)
12 hours!!

Prediction - Atomic Job
Do Prediction

Prediction - Batch Job
Do Prediction

Conclusion
• Use db to do presort data better than do aggregation by
spark
• Use batch better than atomic

Another Case - Spam Article Classifier
• Article Structure Classifier
• Article Content Classifier
• Bag of Word
• High Dimension Feature Space
• Very Sparse Vector
• Large Number of Documents

Original
sc.textfile
RDD
text to terms
RDD
collect to python
list
2 Millions
Docs
Bang!!
dict
vectorize
sparse vectorstd-idf transformbuild classifier

New
sc.textfile
RDD
text to terms
RDD
distinct rdd
tf-idf transformtf-idf sparse matrix
collect terms
and
Build Vectorize
terms to sparse
vector
RDD
collect sparse
vector to list
Use Vstack
list to sparse
martix
build classifier

Spark + Scikit Learn- Performance Tuning

More Related Content

What's hot

Viewers also liked

More from 晨揚 施

Recently uploaded

Spark + Scikit Learn- Performance Tuning

More from 晨揚施