Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and Spark
The document discusses the development of a large-scale adaptive recommendation engine using Apache Flink and Spark, funded by the European Union's Horizon 2020 program. It covers key topics including recommendation systems, matrix factorization, and differences between batch and online methods, while comparing the capabilities of Spark and Flink for implementing these systems. The conclusion emphasizes lessons learned regarding code stability, performance, and handling data skew in machine learning applications.
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and Spark
1.
Building a Large-Scale,Adaptive
Recommendation Engine with Apache
Flink and Spark
Zoltán Zvara
zoltan.zvara@ilab.sztaki.hu
Gábor Hermann
ghermann@ilab.sztaki.hu
This project has received funding from the European Union’s Horizon 2020
research and innovation program under grant agreement No 688191.
2.
About us
• Institutefor Computer Science and Control, Hungarian Academy of
Sciences (MTA SZTAKI)
• Informatics Laboratory
• „Big Data – Momemtum” research group
• „Data Mining and Search” research group
• Research group with strong industry ties
• Ericsson, Rovio, Portugal Telekom, etc.
𝑅
Recommendation with matrixfactorization
5
1
3
5
2
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
7.
𝑅
Recommendation with matrixfactorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
8.
𝑅
Recommendation with matrixfactorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
min
𝑢∗,𝑖∗
(𝑝,𝑞)∈𝜅 𝑅
𝑟𝑝𝑞 − 𝜇 − 𝑏 𝑝 − 𝑏 𝑞 − 𝑢 𝑝 𝑖 𝑞
2
+
+𝜆
𝑝∈𝜅 𝑈
( 𝑢 𝑝
2
+ 𝑏 𝑝
2
) + 𝜆
𝑞∈𝜅 𝐼
( 𝑖 𝑞
2
+ 𝑏 𝑞
2
)
Zoltán rated Rogue One
with 5 stars
9.
𝑅
Recommendation with matrixfactorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
10.
𝑅
Recommendation with matrixfactorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
11.
𝑅
Recommendation with matrixfactorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
5 4 -4
3
2
5
12.
𝑅
Recommendation with matrixfactorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
5 4 -4
3
2
5
3
13.
𝑅
Recommendation with matrixfactorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
3
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
5 4 -4
3
2
5
3
14.
[user; item; time;rating]
𝑅
Batch training
𝑈
item vector
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
15.
[user; item; time;rating]
𝑅
Batch training
𝑈
item vector
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
16.
[user; item; time;rating]
𝑅
Batch training
𝑈
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
But how toscale?
• Spotify streamed 20 billion hours of music in 2015
• YouTube over a billion users, billions of video views every day
• Use distributed data-analytics frameworks
• How can we combine batch + online?
𝑅
Distributed online matrixfactorization
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
3
2
6
25 -6 -2
need to co-locate
26.
𝑅
Distributed online matrixfactorization
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
1
3
5
24 -3 -1
need to co-locate
then update
27.
𝑅
Distributed online matrixfactorization
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
1
3
5
24 -3 -1
need to co-locate
then update
send updates
28.
𝑅
Distributed online matrixfactorization
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
29.
𝑅
Distributed online matrixfactorization
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
30.
𝑅
Distributed online matrixfactorization
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
• Concurrent modification
• Similar problem with batch SGD
• Distributed SGD
(Gemulla et al. 2011)
31.
Online MF inSpark
val ratings: DStream[Rating] = ...
we have our input
32.
Online MF inSpark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
33.
Online MF inSpark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
updateStateByKey?
34.
Online MF inSpark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
updateStateByKey?
Use batch DSGD for online updates!
(discussion issue SPARK-6407)
35.
Online MF inSpark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...
var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
need to represent factor matrices
36.
Online MF inSpark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...
var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
ratings.transform { (rs: RDD[Rating]) =>
we have our input
would like to have output like this
use transform to allow RDD operations
need to represent factor matrices
37.
Online MF inSpark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...
var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
ratings.transform { (rs: RDD[Rating]) =>
val updates = batchDSGD(rs, users, items)
we have our input
would like to have output like this
use transform to allow RDD operations
need to represent factor matrices
compute updates
38.
Online MF inSpark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...
var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
ratings.transform { (rs: RDD[Rating]) =>
val updates = batchDSGD(rs, users, items)
users = applyUserUpdates(users, updates)
items = applyItemUpdates(items, updates)
updates
}
we have our input
would like to have output like this
use transform to allow RDD operations
need to represent factor matrices
compute updates
apply updates to get updated matrices
Combining batch +online in Spark
• Easy: can run batch training periodically on whole dataset
52.
Combining batch +online in Flink
• Combining Flink Batch API with Streaming API
• Could only do it with an external system
53.
Combining batch +online in Flink
• Combining Flink Batch API with Streaming API
• Could only do it with an external system
• Batch with Streaming API
• Feasible!
• Asynchronous training
(Schelter et al. 2014)
54.
Combining batch +online in Flink
• Combining Flink Batch API with Streaming API
• Could only do it with an external system
• Batch with Streaming API
• Feasible!
• Asynchronous training
(Schelter et al. 2014)
• Batch + online
• Both with Streaming API
• Share matrices in common state
• Parameter Server approach
Lessons learned
Flink Spark
ImplementationMore complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
58.
Lessons learned
Flink Spark
ImplementationMore complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough
(e.g. Loops API)
More mature
59.
Lessons learned
Flink Spark
ImplementationMore complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough
(e.g. Loops API)
More mature
Performance Optimal for online learning,
can perform well on batch
Not always optimal for online
learning (e.g. online MF)
60.
Lessons learned
Flink Spark
ImplementationMore complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough
(e.g. Loops API)
More mature
Performance Optimal for online learning,
can perform well on batch
Not always optimal for online
learning (e.g. online MF)
Handling
data skew
Currently hard to relocate
long-running operators
Periodic scheduling enables easier
modification of partitioning
61.
Lessons learned
Flink Spark
ImplementationMore complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough
(e.g. Loops API)
More mature
Performance Optimal for online learning,
can perform well on batch
Not always optimal for online
learning (e.g. online MF)
Handling
data skew
Currently hard to relocate
long-running operators
Periodic scheduling enables easier
modification of partitioning
Machine learning Non-complete ML library
and other efforts for ML in Flink
Spark MLlib is mature
and used in production
62.
Thank you foryour attention
Zoltán Zvara
zoltan.zvara@ilab.sztaki.hu
Gábor Hermann
ghermann@ilab.sztaki.hu
Source code:
https://github.com/gaborhermann/large-scale-recommendation
Batch + onlinecombination
• 30M music listening Last.fm dataset
• Weekly batch training
• Evaluation weekly average
• on every incoming listening
• Around 45.000 users
65.
Online MF: Sparkvs. Flink
• 30M music listening Last.fm dataset read from 12 Kafka partitions
• Spark batch duration: 5 sec
• Time of processing X ratings
• DSGD algorithm
• Using 6 nodes, 4 cores each
• Spark 2.1.0, Flink 1.2.0
66.
Batch on FlinkStreaming
• Movielens 1M movie rating dataset
• Using 6 nodes, 4 cores each
Editor's Notes
#4 Say that we focus on comparing the two systems for this use-case.
#5 Say that we focus on comparing the two systems for this use-case.
#6 Say that we focus on comparing the two systems for this use-case.
#21 Story: turned out it is worth to combine these two?Message: batch + online is better than batch alone, or online alone.DCG: Discounted Cumulative Gain, measures ranking quality, higher-betterhttps://en.wikipedia.org/wiki/Discounted_cumulative_gain