Building aBuilding a
RecommenderRecommender
SystemSystem
in Pysparkin Pyspark
Will JohnsonWill Johnson
- Uline- Uline
- DePaul- DePaul
LearnBy
Marketing.com
AGENDAAGENDA
- RecSys- RecSys
* Basics* Basics
* MF* MF
* Evaluation* Evaluation
* Advanced* Advanced
- PySpark- PySpark
* Basics* Basics
* ALS* ALS
User Based Collaborative Filtering
4.5
4.0
5.0
4.5
3.0
4.0
2.0
1.0 2.0
1.5
4.5
User Based Collaborative Filtering
4.5
4.0
5.0
4.5
3.0
4.0
3.8 2.0
1.0 2.0
1.5
4.5
Item Based Collaborative Filtering
Item Based Collaborative Filtering
Matrix Factorization
Matrix Factorization
Evaluation
RMSE =
√∑(Predicted−Actual)2
n
Precision Recall
|hitsu|
|RecoSetu|
|hitsu|
|TestSetu|
Expert Review: Novelty, Context
CRISP-DM
Data
Understanding
movielens = sc.textFile("../in/ml-100k/u.data")
Data
Understanding
movielens.first()
movielens.count() 100,000
u'196t242t3t881250949'
Data
Understanding
clean_data = movielens.map(lambda x:x.split('t'))
rate = clean_data.map(lambda y: int(y[2]))
rate.mean() 3.52986
3
users = clean_data.map(lambda y: int(y[0]))
users.distinct().count() 943
clean_data.map(lambda y: int(y[1])).
distinct().count() 1,682
Data
Preparation
from pyspark.mllib.recommendation
import ALS, MatrixFactorizationModel, Rating
mls = movielens.map(lambda l: l.split('t'))
ratings = mls.map(lambda x:
Rating(int(x[0]), int(x[1]), float(x[2])))
Rating(user=196, product=242, rating=3.0)
Data
Preparation
train, test = ratings.randomSplit([0.7,0.3],7856)
train.count()
70,005
test.count()
29,995
train.cache()
test.cache()
Modeling
rank = 5 # Latent Factors to be made
numIterations = 10 # Times to repeat process
#Create the model on the training data
model = ALS.train(train, rank, numIterations)
Modeling /
Evaluation
model.userFeatures()
model.productFeatures()
Modeling /
Evaluation
# For Product X, Find N Users to Sell To
model.recommendUsers(242,100)
# For User Y Find N Products to Promote
model.recommendProducts(196,10)
#Predict Single Product for Single User
model.predict(196, 242)
Modeling /
Evaluation
# Predict Multi Users and Multi Products
# Pre-Processing
pred_input = train.map(lambda x:(x[0],x[1]))
# Lots of Predictions
pred = model.predictAll(pred_input)
#Returns Ratings(user, item, prediction)
(196, 242)
Rating(user=894, product=1560, rating=3.845)
Evaluation
User Item Actual Pred
196 242 3.0 3.91
186 302 3.0 3.29
22 377 1.0 1.09
244 51 2.0 3.66
298 474 4.0 4.11
TRAINING
RMSE: 0.763
Evaluation
#Organize the data to make (user, product) the key)
true_reorg = train.map(lambda x:((x[0],x[1]), x[2]))
pred_reorg = pred.map(lambda x:((x[0],x[1]), x[2]))
#Do the actual join
true_pred = true_reorg.join(pred_reorg)
from math import sqrt
MSE = true_pred.map(lambda r: (r[1][0] - r[1][1])**2).mean()
RMSE = sqrt(MSE)
#Results in 0.7629908117414474
((582, 1014), (4.0, 3.397))
((196, 242), 3.0)
Evaluation
test_input = test.map(lambda x:(x[0],x[1]))
pred_test = model.predictAll(test_input)
test_reorg = test.map(lambda x:((x[0],x[1]), x[2]))
pred_reorg = pred_test.map(lambda x:
((x[0],x[1]), x[2]))
test_pred = test_reorg.join(pred_reorg)
test_MSE = test_pred.map(lambda r:
(r[1][0] - r[1][1])**2).mean()
test_RMSE = sqrt(test_MSE)
TEST
RMSE: 1.0145
CRISP-DM
RECAP
RecSys are Nearest Neighbors or MF Based
ALS is Implemented in Spark
RECAP
rank = 5; numIterations = 10;
#Create the model on the training data
model = ALS.train(train, rank, numIterations)
# Lots of Predictions
pred = model.predictAll(pred_input)
#Examine Model Features
model.productFeatures()
# Save your model!
model.save(sc,"../out/ml-model")
Questions?Questions?
LearnBy
Marketing.com

Recommender Systems with Apache Spark's ALS Function