Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and Spark

Building a Large-Scale, Adaptive
Recommendation Engine with Apache
Flink and Spark
Zoltán Zvara
zoltan.zvara@ilab.sztaki.hu
Gábor Hermann
ghermann@ilab.sztaki.hu
This project has received funding from the European Union’s Horizon 2020
research and innovation program under grant agreement No 688191.

About us
• Institute for Computer Science and Control, Hungarian Academy of
Sciences (MTA SZTAKI)
• Informatics Laboratory
• „Big Data – Momemtum” research group
• „Data Mining and Search” research group
• Research group with strong industry ties
• Ericsson, Rovio, Portugal Telekom, etc.

Agenda
1. Recommendation systems and matrix factorization
2. Batch vs. online
3. Matrix factorization
1. Online
2. Batch + online
4. Solution in Spark & Flink
5. Conclusions

𝑅
Recommendation with matrix factorization
5
1
3
5
2
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars

𝑅
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
0
0
0
0
Latent
factors
Zoltán
Gábor
with 5 stars

𝑅
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
0
0
0
0
Latent
factors
Zoltán
Gábor
min
𝑢∗,𝑖∗
(𝑝,𝑞)∈𝜅 𝑅
𝑟𝑝𝑞 − 𝜇 − 𝑏 𝑝 − 𝑏 𝑞 − 𝑢 𝑝 𝑖 𝑞
2
+
+𝜆
𝑝∈𝜅 𝑈
( 𝑢 𝑝
2
+ 𝑏 𝑝
2
) + 𝜆
𝑞∈𝜅 𝐼
( 𝑖 𝑞
2
+ 𝑏 𝑞
2
)
with 5 stars

𝑅
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
with 5 stars
Would Gábor like Interstellar?

𝑅
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
with 5 stars
5 4 -4
3
2
5

𝑅
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
with 5 stars
5 4 -4
3
2
5
3

𝑅
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
3
0
0
0
0
Latent
factors
Zoltán
Gábor
with 5 stars
5 4 -4
3
2
5
3

[user; item; time; rating]
𝑅
Batch training
𝑈
item vector
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
PERSISTENT STORAGE

𝑅
Batch training
𝑈
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
PERSISTENT STORAGE

𝑅
Online training
𝑈
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5 3
0
0
0
0
Zoltán
Gábor
2 5 4 2 4

𝑅
Online training
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
5 4 2 4

𝑅
Online training
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
5 4 2 4

But how to scale?
• Spotify streamed 20 billion hours of music in 2015
• YouTube over a billion users, billions of video views every day
• Use distributed data-analytics frameworks
• How can we combine batch + online?

𝑅
Distributed online matrix factorization
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
1
3
user
vector
3
0
0
0
0
Zoltán
Gábor
2 5 4 2 4

𝑅
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
5 4 2 4

𝑅
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
5 4 2 4
3
2
6
25 -6 -2
need to co-locate

𝑅
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
5 4 2 4
1
3
5
24 -3 -1
need to co-locate
then update

𝑅
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
5 4 2 4
1
3
5
24 -3 -1
need to co-locate
then update
send updates

𝑅
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
5 4 2 4
process two ratings in parallel

𝑅
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
5 4 2 4
process two ratings in parallel
• Concurrent modification
• Similar problem with batch SGD
• Distributed SGD
(Gemulla et al. 2011)

Online MF in Spark
val ratings: DStream[Rating] = ...
we have our input

Online MF in Spark
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this

Online MF in Spark
we have our input
updateStateByKey?

Online MF in Spark
we have our input
updateStateByKey?
Use batch DSGD for online updates!
(discussion issue SPARK-6407)

Online MF in Spark
var users: RDD[(UserId, Vector)] = ...
var items: RDD[(ItemId, Vector)] = ...
we have our input
need to represent factor matrices

Online MF in Spark
ratings.transform { (rs: RDD[Rating]) =>
we have our input
use transform to allow RDD operations

Online MF in Spark
val updates = batchDSGD(rs, users, items)
we have our input
compute updates

Online MF in Spark
val updates = batchDSGD(rs, users, items)
users = applyUserUpdates(users, updates)
items = applyItemUpdates(items, updates)
updates
}
we have our input
compute updates
apply updates to get updated matrices

Online MF in Spark
• Performance decreases by time

Online MF in Spark
• Performance decreases by time
• Problem: tracking lineage graph
• Solution: use checkpointing

Online MF in Flink
user
vectors
item
vectors
long-running operators with state

Online MF in Flink
user
vectors
item
vectors
long-running operators with state
backward edge in dataflow
(stream loop)

Online MF in Flink
1. rating event
2
user
vectors
item
vectors

Online MF in Flink
1. rating event 2. rating event & user vector
25 -6 -22
user
vectors
item
vectors

Online MF in Flink
1. rating event 2. rating event & user vector 25 -6 -2
3
2
6
25 -6 -22
user
vectors
item
vectors

Online MF in Flink
3. apply update
2
25 -6 -22
user
vectors
item
vectors
4 -3 -1
1
3
5

Online MF in Flink
4. user vector update
3. apply update
2
25 -6 -22
user
vectors
item
vectors
4 -3 -1
1
3
5
4 -3 -1

Online MF in Flink
WARNING!
Loops API (iterative streams) not mature enough yet,
but there is ongoing effort
4. user vector update
3. apply update
2
25 -6 -22
user
vectors
item
vectors
4 -3 -1
1
3
5
4 -3 -1

Combining batch + online in Spark
• Easy: can run batch training periodically on whole dataset

Combining batch + online in Flink
• Combining Flink Batch API with Streaming API
• Could only do it with an external system

• Batch with Streaming API
• Feasible!
• Asynchronous training
(Schelter et al. 2014)

• Batch with Streaming API
• Feasible!
• Asynchronous training
(Schelter et al. 2014)
• Batch + online
• Both with Streaming API
• Share matrices in common state
• Parameter Server approach

Lessons learned
Flink Spark
Implementation More complex solution,
harder to implement
Easier to use:
could use batch for streaming

Lessons learned
Flink Spark
harder to implement
Easier to use:
Generality Can express finer grained updates Updates limited by mini-batch

Lessons learned
Flink Spark
harder to implement
Easier to use:
Code stability Some parts are not mature enough
(e.g. Loops API)
More mature

Lessons learned
Flink Spark
harder to implement
Easier to use:
(e.g. Loops API)
More mature
Performance Optimal for online learning,
can perform well on batch
Not always optimal for online
learning (e.g. online MF)

Lessons learned
Flink Spark
harder to implement
Easier to use:
(e.g. Loops API)
More mature
Handling
data skew
Currently hard to relocate
long-running operators
Periodic scheduling enables easier
modification of partitioning

Lessons learned
Flink Spark
harder to implement
Easier to use:
(e.g. Loops API)
More mature
Handling
data skew
Currently hard to relocate
long-running operators
Periodic scheduling enables easier
modification of partitioning
Machine learning Non-complete ML library
and other efforts for ML in Flink
Spark MLlib is mature
and used in production

Thank you for your attention
Zoltán Zvara
zoltan.zvara@ilab.sztaki.hu
Gábor Hermann
ghermann@ilab.sztaki.hu
Source code:
https://github.com/gaborhermann/large-scale-recommendation

Batch + online combination
• 30M music listening Last.fm dataset
• Weekly batch training
• Evaluation weekly average
• on every incoming listening
• Around 45.000 users

Online MF: Spark vs. Flink
• 30M music listening Last.fm dataset read from 12 Kafka partitions
• Spark batch duration: 5 sec
• Time of processing X ratings
• DSGD algorithm
• Using 6 nodes, 4 cores each
• Spark 2.1.0, Flink 1.2.0

Batch on Flink Streaming
• Movielens 1M movie rating dataset
• Using 6 nodes, 4 cores each

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and Spark

In this document