Building a Large-Scale, Adaptive
Recommendation Engine with Apache
Flink and Spark
Zoltán Zvara
zoltan.zvara@ilab.sztaki.hu
Gábor Hermann
ghermann@ilab.sztaki.hu
This project has received funding from the European Union’s Horizon 2020
research and innovation program under grant agreement No 688191.
About us
• Institute for Computer Science and Control, Hungarian Academy of
Sciences (MTA SZTAKI)
• Informatics Laboratory
• „Big Data – Momemtum” research group
• „Data Mining and Search” research group
• Research group with strong industry ties
• Ericsson, Rovio, Portugal Telekom, etc.
Agenda
1. Recommendation systems and matrix factorization
2. Batch vs. online
3. Matrix factorization
1. Online
2. Batch + online
4. Solution in Spark & Flink
5. Conclusions
Recommendation systems
Recommendation systems
𝑅
Recommendation with matrix factorization
5
1
3
5
2
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
min
𝑢∗,𝑖∗
(𝑝,𝑞)∈𝜅 𝑅
𝑟𝑝𝑞 − 𝜇 − 𝑏 𝑝 − 𝑏 𝑞 − 𝑢 𝑝 𝑖 𝑞
2
+
+𝜆
𝑝∈𝜅 𝑈
( 𝑢 𝑝
2
+ 𝑏 𝑝
2
) + 𝜆
𝑞∈𝜅 𝐼
( 𝑖 𝑞
2
+ 𝑏 𝑞
2
)
Zoltán rated Rogue One
with 5 stars
𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
5 4 -4
3
2
5
𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
5 4 -4
3
2
5
3
𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
3
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
5 4 -4
3
2
5
3
[user; item; time; rating]
𝑅
Batch training
𝑈
item vector
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
[user; item; time; rating]
𝑅
Batch training
𝑈
item vector
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
[user; item; time; rating]
𝑅
Batch training
𝑈
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
𝑅
Online training
𝑈
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5 3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
2 5 4 2 4
𝑅
Online training
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
𝑅
Online training
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
Batch + online combination
But how to scale?
• Spotify streamed 20 billion hours of music in 2015
• YouTube over a billion users, billions of video views every day
• Use distributed data-analytics frameworks
• How can we combine batch + online?
Apache Spark vs. Apache Flink
𝑅
Distributed online matrix factorization
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
1
3
user
vector
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
2 5 4 2 4
𝑅
Distributed online matrix factorization
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
𝑅
Distributed online matrix factorization
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
3
2
6
25 -6 -2
need to co-locate
𝑅
Distributed online matrix factorization
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
1
3
5
24 -3 -1
need to co-locate
then update
𝑅
Distributed online matrix factorization
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
1
3
5
24 -3 -1
need to co-locate
then update
send updates
𝑅
Distributed online matrix factorization
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
𝑅
Distributed online matrix factorization
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
𝑅
Distributed online matrix factorization
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
• Concurrent modification
• Similar problem with batch SGD
• Distributed SGD
(Gemulla et al. 2011)
Online MF in Spark
val ratings: DStream[Rating] = ...
we have our input
Online MF in Spark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
Online MF in Spark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
updateStateByKey?
Online MF in Spark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
updateStateByKey?
Use batch DSGD for online updates!
(discussion issue SPARK-6407)
Online MF in Spark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...
var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
need to represent factor matrices
Online MF in Spark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...
var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
ratings.transform { (rs: RDD[Rating]) =>
we have our input
would like to have output like this
use transform to allow RDD operations
need to represent factor matrices
Online MF in Spark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...
var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
ratings.transform { (rs: RDD[Rating]) =>
val updates = batchDSGD(rs, users, items)
we have our input
would like to have output like this
use transform to allow RDD operations
need to represent factor matrices
compute updates
Online MF in Spark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...
var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
ratings.transform { (rs: RDD[Rating]) =>
val updates = batchDSGD(rs, users, items)
users = applyUserUpdates(users, updates)
items = applyItemUpdates(items, updates)
updates
}
we have our input
would like to have output like this
use transform to allow RDD operations
need to represent factor matrices
compute updates
apply updates to get updated matrices
Online MF in Spark
• Performance decreases by time
Online MF in Spark
• Performance decreases by time
• Problem: tracking lineage graph
• Solution: use checkpointing
Online MF in Spark
• Performance decreases by time
• Problem: tracking lineage graph
• Solution: use checkpointing
Online MF in Flink
user
vectors
item
vectors
long-running operators with state
Online MF in Flink
user
vectors
item
vectors
long-running operators with state
backward edge in dataflow
(stream loop)
Online MF in Flink
1. rating event
2
user
vectors
item
vectors
Online MF in Flink
1. rating event 2. rating event & user vector
25 -6 -22
user
vectors
item
vectors
Online MF in Flink
1. rating event 2. rating event & user vector 25 -6 -2
3
2
6
25 -6 -22
user
vectors
item
vectors
Online MF in Flink
1. rating event 2. rating event & user vector
3. apply update
2
25 -6 -22
user
vectors
item
vectors
4 -3 -1
1
3
5
Online MF in Flink
1. rating event 2. rating event & user vector
4. user vector update
3. apply update
2
25 -6 -22
user
vectors
item
vectors
4 -3 -1
1
3
5
4 -3 -1
Online MF in Flink
WARNING!
Loops API (iterative streams) not mature enough yet,
but there is ongoing effort
1. rating event 2. rating event & user vector
4. user vector update
3. apply update
2
25 -6 -22
user
vectors
item
vectors
4 -3 -1
1
3
5
4 -3 -1
Online MF: Spark vs. Flink
Combining batch + online in Spark
• Easy: can run batch training periodically on whole dataset
Combining batch + online in Flink
• Combining Flink Batch API with Streaming API
• Could only do it with an external system
Combining batch + online in Flink
• Combining Flink Batch API with Streaming API
• Could only do it with an external system
• Batch with Streaming API
• Feasible!
• Asynchronous training
(Schelter et al. 2014)
Combining batch + online in Flink
• Combining Flink Batch API with Streaming API
• Could only do it with an external system
• Batch with Streaming API
• Feasible!
• Asynchronous training
(Schelter et al. 2014)
• Batch + online
• Both with Streaming API
• Share matrices in common state
• Parameter Server approach
Lessons learned
Lessons learned
Flink Spark
Implementation More complex solution,
harder to implement
Easier to use:
could use batch for streaming
Lessons learned
Flink Spark
Implementation More complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Lessons learned
Flink Spark
Implementation More complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough
(e.g. Loops API)
More mature
Lessons learned
Flink Spark
Implementation More complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough
(e.g. Loops API)
More mature
Performance Optimal for online learning,
can perform well on batch
Not always optimal for online
learning (e.g. online MF)
Lessons learned
Flink Spark
Implementation More complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough
(e.g. Loops API)
More mature
Performance Optimal for online learning,
can perform well on batch
Not always optimal for online
learning (e.g. online MF)
Handling
data skew
Currently hard to relocate
long-running operators
Periodic scheduling enables easier
modification of partitioning
Lessons learned
Flink Spark
Implementation More complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough
(e.g. Loops API)
More mature
Performance Optimal for online learning,
can perform well on batch
Not always optimal for online
learning (e.g. online MF)
Handling
data skew
Currently hard to relocate
long-running operators
Periodic scheduling enables easier
modification of partitioning
Machine learning Non-complete ML library
and other efforts for ML in Flink
Spark MLlib is mature
and used in production
Thank you for your attention
Zoltán Zvara
zoltan.zvara@ilab.sztaki.hu
Gábor Hermann
ghermann@ilab.sztaki.hu
Source code:
https://github.com/gaborhermann/large-scale-recommendation
Measurements
Batch + online combination
• 30M music listening Last.fm dataset
• Weekly batch training
• Evaluation weekly average
• on every incoming listening
• Around 45.000 users
Online MF: Spark vs. Flink
• 30M music listening Last.fm dataset read from 12 Kafka partitions
• Spark batch duration: 5 sec
• Time of processing X ratings
• DSGD algorithm
• Using 6 nodes, 4 cores each
• Spark 2.1.0, Flink 1.2.0
Batch on Flink Streaming
• Movielens 1M movie rating dataset
• Using 6 nodes, 4 cores each

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and Spark

  • 1.
    Building a Large-Scale,Adaptive Recommendation Engine with Apache Flink and Spark Zoltán Zvara zoltan.zvara@ilab.sztaki.hu Gábor Hermann ghermann@ilab.sztaki.hu This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 688191.
  • 2.
    About us • Institutefor Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI) • Informatics Laboratory • „Big Data – Momemtum” research group • „Data Mining and Search” research group • Research group with strong industry ties • Ericsson, Rovio, Portugal Telekom, etc.
  • 3.
    Agenda 1. Recommendation systemsand matrix factorization 2. Batch vs. online 3. Matrix factorization 1. Online 2. Batch + online 4. Solution in Spark & Flink 5. Conclusions
  • 4.
  • 5.
  • 6.
    𝑅 Recommendation with matrixfactorization 5 1 3 5 2 0 0 0 0 Zoltán Gábor Rogue One Interstellar Zoltán rated Rogue One with 5 stars
  • 7.
    𝑅 Recommendation with matrixfactorization 𝑈 𝑈 ∙ 𝐼 ≈ 𝑅 item vector 3 2 5 5 3 2 5 -6 -1 5 4 -4 5 1 3 user vector 5 2 Level of action Level of drama X factor 0 0 0 0 Latent factors Zoltán Gábor Rogue One Interstellar Zoltán rated Rogue One with 5 stars
  • 8.
    𝑅 Recommendation with matrixfactorization 𝑈 𝑈 ∙ 𝐼 ≈ 𝑅 item vector 3 2 5 5 3 2 5 -6 -1 5 4 -4 5 1 3 user vector 5 2 Level of action Level of drama X factor 0 0 0 0 Latent factors Zoltán Gábor Rogue One Interstellar min 𝑢∗,𝑖∗ (𝑝,𝑞)∈𝜅 𝑅 𝑟𝑝𝑞 − 𝜇 − 𝑏 𝑝 − 𝑏 𝑞 − 𝑢 𝑝 𝑖 𝑞 2 + +𝜆 𝑝∈𝜅 𝑈 ( 𝑢 𝑝 2 + 𝑏 𝑝 2 ) + 𝜆 𝑞∈𝜅 𝐼 ( 𝑖 𝑞 2 + 𝑏 𝑞 2 ) Zoltán rated Rogue One with 5 stars
  • 9.
    𝑅 Recommendation with matrixfactorization 𝑈 𝑈 ∙ 𝐼 ≈ 𝑅 item vector 3 2 5 5 3 2 5 -6 -1 5 4 -4 5 1 3 user vector 5 2 Level of action Level of drama X factor ? 0 0 0 0 Latent factors Zoltán Gábor Rogue One Interstellar Zoltán rated Rogue One with 5 stars Would Gábor like Interstellar?
  • 10.
    𝑅 Recommendation with matrixfactorization 𝑈 𝑈 ∙ 𝐼 ≈ 𝑅 item vector 3 2 5 5 3 2 5 -6 -1 5 4 -4 5 1 3 user vector 5 2 Level of action Level of drama X factor ? 0 0 0 0 Latent factors Zoltán Gábor Rogue One Interstellar Zoltán rated Rogue One with 5 stars Would Gábor like Interstellar?
  • 11.
    𝑅 Recommendation with matrixfactorization 𝑈 𝑈 ∙ 𝐼 ≈ 𝑅 item vector 3 2 5 5 3 2 5 -6 -1 5 4 -4 5 1 3 user vector 5 2 Level of action Level of drama X factor ? 0 0 0 0 Latent factors Zoltán Gábor Rogue One Interstellar Zoltán rated Rogue One with 5 stars Would Gábor like Interstellar? 5 4 -4 3 2 5
  • 12.
    𝑅 Recommendation with matrixfactorization 𝑈 𝑈 ∙ 𝐼 ≈ 𝑅 item vector 3 2 5 5 3 2 5 -6 -1 5 4 -4 5 1 3 user vector 5 2 Level of action Level of drama X factor ? 0 0 0 0 Latent factors Zoltán Gábor Rogue One Interstellar Zoltán rated Rogue One with 5 stars Would Gábor like Interstellar? 5 4 -4 3 2 5 3
  • 13.
    𝑅 Recommendation with matrixfactorization 𝑈 𝑈 ∙ 𝐼 ≈ 𝑅 item vector 3 2 5 5 3 2 5 -6 -1 5 4 -4 5 1 3 user vector 5 2 Level of action Level of drama X factor 3 0 0 0 0 Latent factors Zoltán Gábor Rogue One Interstellar Zoltán rated Rogue One with 5 stars Would Gábor like Interstellar? 5 4 -4 3 2 5 3
  • 14.
    [user; item; time;rating] 𝑅 Batch training 𝑈 item vector 5 1 3 user vector 5 2 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar PERSISTENT STORAGE
  • 15.
    [user; item; time;rating] 𝑅 Batch training 𝑈 item vector 5 1 3 user vector 5 2 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar PERSISTENT STORAGE
  • 16.
    [user; item; time;rating] 𝑅 Batch training 𝑈 item vector 3 2 5 5 3 2 5 -6 -1 5 4 -4 5 1 3 user vector 5 2 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar PERSISTENT STORAGE
  • 17.
    𝑅 Online training 𝑈 item vector 3 2 5 5 3 2 5-6 -1 5 4 -4 5 1 3 user vector 5 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar [user; item; time; rating] 2 5 4 2 4
  • 18.
    𝑅 Online training 𝑈 item vector 3 2 6 5 3 2 5-6 -2 5 4 -4 5 1 3 user vector 5 2 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar [user; item; time; rating] 5 4 2 4
  • 19.
    𝑅 Online training 𝑈 item vector 1 3 5 5 3 2 4-5 -1 5 4 -4 5 1 3 user vector 5 2 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar [user; item; time; rating] 5 4 2 4
  • 20.
    Batch + onlinecombination
  • 21.
    But how toscale? • Spotify streamed 20 billion hours of music in 2015 • YouTube over a billion users, billions of video views every day • Use distributed data-analytics frameworks • How can we combine batch + online?
  • 22.
    Apache Spark vs.Apache Flink
  • 23.
    𝑅 Distributed online matrixfactorization 𝑈 item vector 3 2 6 5 3 2 5 -6 -2 5 4 -4 1 3 user vector 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar [user; item; time; rating] 2 5 4 2 4
  • 24.
    𝑅 Distributed online matrixfactorization 𝑈 item vector 3 2 6 5 3 2 5 -6 -2 5 4 -4 1 3 user vector 2 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar [user; item; time; rating] 5 4 2 4
  • 25.
    𝑅 Distributed online matrixfactorization 𝑈 item vector 3 2 6 5 3 2 5 -6 -2 5 4 -4 1 3 user vector 2 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar [user; item; time; rating] 5 4 2 4 3 2 6 25 -6 -2 need to co-locate
  • 26.
    𝑅 Distributed online matrixfactorization 𝑈 item vector 3 2 6 5 3 2 5 -6 -2 5 4 -4 1 3 user vector 2 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar [user; item; time; rating] 5 4 2 4 1 3 5 24 -3 -1 need to co-locate then update
  • 27.
    𝑅 Distributed online matrixfactorization 𝑈 item vector 1 3 5 5 3 2 4 -5 -1 5 4 -4 1 3 user vector 2 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar [user; item; time; rating] 5 4 2 4 1 3 5 24 -3 -1 need to co-locate then update send updates
  • 28.
    𝑅 Distributed online matrixfactorization 𝑈 item vector 1 3 5 5 3 2 4 -5 -1 5 4 -4 5 1 3 user vector 5 2 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar 5 4 2 4 process two ratings in parallel
  • 29.
    𝑅 Distributed online matrixfactorization 𝑈 item vector 1 3 5 5 3 2 4 -5 -1 5 4 -4 5 1 3 user vector 5 2 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar 5 4 2 4 process two ratings in parallel
  • 30.
    𝑅 Distributed online matrixfactorization 𝑈 item vector 1 3 5 5 3 2 4 -5 -1 5 4 -4 5 1 3 user vector 5 2 3 0 0 0 0 Zoltán Gábor Rogue One Interstellar 5 4 2 4 process two ratings in parallel • Concurrent modification • Similar problem with batch SGD • Distributed SGD (Gemulla et al. 2011)
  • 31.
    Online MF inSpark val ratings: DStream[Rating] = ... we have our input
  • 32.
    Online MF inSpark val ratings: DStream[Rating] = ... val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] = we have our input would like to have output like this
  • 33.
    Online MF inSpark val ratings: DStream[Rating] = ... val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] = we have our input would like to have output like this updateStateByKey?
  • 34.
    Online MF inSpark val ratings: DStream[Rating] = ... val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] = we have our input would like to have output like this updateStateByKey? Use batch DSGD for online updates! (discussion issue SPARK-6407)
  • 35.
    Online MF inSpark val ratings: DStream[Rating] = ... var users: RDD[(UserId, Vector)] = ... var items: RDD[(ItemId, Vector)] = ... val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] = we have our input would like to have output like this need to represent factor matrices
  • 36.
    Online MF inSpark val ratings: DStream[Rating] = ... var users: RDD[(UserId, Vector)] = ... var items: RDD[(ItemId, Vector)] = ... val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] = ratings.transform { (rs: RDD[Rating]) => we have our input would like to have output like this use transform to allow RDD operations need to represent factor matrices
  • 37.
    Online MF inSpark val ratings: DStream[Rating] = ... var users: RDD[(UserId, Vector)] = ... var items: RDD[(ItemId, Vector)] = ... val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] = ratings.transform { (rs: RDD[Rating]) => val updates = batchDSGD(rs, users, items) we have our input would like to have output like this use transform to allow RDD operations need to represent factor matrices compute updates
  • 38.
    Online MF inSpark val ratings: DStream[Rating] = ... var users: RDD[(UserId, Vector)] = ... var items: RDD[(ItemId, Vector)] = ... val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] = ratings.transform { (rs: RDD[Rating]) => val updates = batchDSGD(rs, users, items) users = applyUserUpdates(users, updates) items = applyItemUpdates(items, updates) updates } we have our input would like to have output like this use transform to allow RDD operations need to represent factor matrices compute updates apply updates to get updated matrices
  • 39.
    Online MF inSpark • Performance decreases by time
  • 40.
    Online MF inSpark • Performance decreases by time • Problem: tracking lineage graph • Solution: use checkpointing
  • 41.
    Online MF inSpark • Performance decreases by time • Problem: tracking lineage graph • Solution: use checkpointing
  • 42.
    Online MF inFlink user vectors item vectors long-running operators with state
  • 43.
    Online MF inFlink user vectors item vectors long-running operators with state backward edge in dataflow (stream loop)
  • 44.
    Online MF inFlink 1. rating event 2 user vectors item vectors
  • 45.
    Online MF inFlink 1. rating event 2. rating event & user vector 25 -6 -22 user vectors item vectors
  • 46.
    Online MF inFlink 1. rating event 2. rating event & user vector 25 -6 -2 3 2 6 25 -6 -22 user vectors item vectors
  • 47.
    Online MF inFlink 1. rating event 2. rating event & user vector 3. apply update 2 25 -6 -22 user vectors item vectors 4 -3 -1 1 3 5
  • 48.
    Online MF inFlink 1. rating event 2. rating event & user vector 4. user vector update 3. apply update 2 25 -6 -22 user vectors item vectors 4 -3 -1 1 3 5 4 -3 -1
  • 49.
    Online MF inFlink WARNING! Loops API (iterative streams) not mature enough yet, but there is ongoing effort 1. rating event 2. rating event & user vector 4. user vector update 3. apply update 2 25 -6 -22 user vectors item vectors 4 -3 -1 1 3 5 4 -3 -1
  • 50.
  • 51.
    Combining batch +online in Spark • Easy: can run batch training periodically on whole dataset
  • 52.
    Combining batch +online in Flink • Combining Flink Batch API with Streaming API • Could only do it with an external system
  • 53.
    Combining batch +online in Flink • Combining Flink Batch API with Streaming API • Could only do it with an external system • Batch with Streaming API • Feasible! • Asynchronous training (Schelter et al. 2014)
  • 54.
    Combining batch +online in Flink • Combining Flink Batch API with Streaming API • Could only do it with an external system • Batch with Streaming API • Feasible! • Asynchronous training (Schelter et al. 2014) • Batch + online • Both with Streaming API • Share matrices in common state • Parameter Server approach
  • 55.
  • 56.
    Lessons learned Flink Spark ImplementationMore complex solution, harder to implement Easier to use: could use batch for streaming
  • 57.
    Lessons learned Flink Spark ImplementationMore complex solution, harder to implement Easier to use: could use batch for streaming Generality Can express finer grained updates Updates limited by mini-batch
  • 58.
    Lessons learned Flink Spark ImplementationMore complex solution, harder to implement Easier to use: could use batch for streaming Generality Can express finer grained updates Updates limited by mini-batch Code stability Some parts are not mature enough (e.g. Loops API) More mature
  • 59.
    Lessons learned Flink Spark ImplementationMore complex solution, harder to implement Easier to use: could use batch for streaming Generality Can express finer grained updates Updates limited by mini-batch Code stability Some parts are not mature enough (e.g. Loops API) More mature Performance Optimal for online learning, can perform well on batch Not always optimal for online learning (e.g. online MF)
  • 60.
    Lessons learned Flink Spark ImplementationMore complex solution, harder to implement Easier to use: could use batch for streaming Generality Can express finer grained updates Updates limited by mini-batch Code stability Some parts are not mature enough (e.g. Loops API) More mature Performance Optimal for online learning, can perform well on batch Not always optimal for online learning (e.g. online MF) Handling data skew Currently hard to relocate long-running operators Periodic scheduling enables easier modification of partitioning
  • 61.
    Lessons learned Flink Spark ImplementationMore complex solution, harder to implement Easier to use: could use batch for streaming Generality Can express finer grained updates Updates limited by mini-batch Code stability Some parts are not mature enough (e.g. Loops API) More mature Performance Optimal for online learning, can perform well on batch Not always optimal for online learning (e.g. online MF) Handling data skew Currently hard to relocate long-running operators Periodic scheduling enables easier modification of partitioning Machine learning Non-complete ML library and other efforts for ML in Flink Spark MLlib is mature and used in production
  • 62.
    Thank you foryour attention Zoltán Zvara zoltan.zvara@ilab.sztaki.hu Gábor Hermann ghermann@ilab.sztaki.hu Source code: https://github.com/gaborhermann/large-scale-recommendation
  • 63.
  • 64.
    Batch + onlinecombination • 30M music listening Last.fm dataset • Weekly batch training • Evaluation weekly average • on every incoming listening • Around 45.000 users
  • 65.
    Online MF: Sparkvs. Flink • 30M music listening Last.fm dataset read from 12 Kafka partitions • Spark batch duration: 5 sec • Time of processing X ratings • DSGD algorithm • Using 6 nodes, 4 cores each • Spark 2.1.0, Flink 1.2.0
  • 66.
    Batch on FlinkStreaming • Movielens 1M movie rating dataset • Using 6 nodes, 4 cores each

Editor's Notes

  • #4 Say that we focus on comparing the two systems for this use-case.
  • #5 Say that we focus on comparing the two systems for this use-case.
  • #6 Say that we focus on comparing the two systems for this use-case.
  • #7 Ratings in a sparse matrix
  • #21 Story: turned out it is worth to combine these two? Message: batch + online is better than batch alone, or online alone. DCG: Discounted Cumulative Gain, measures ranking quality, higher-better https://en.wikipedia.org/wiki/Discounted_cumulative_gain
  • #22 Sources: Spotify 2015 data https://techcrunch.com/2015/12/01/spotify-claims-streaming-music-throne-worldwide-but-pandora-is-still-top-service-in-u-s/?ncid=rss#.uuccs9:VA8w YT https://www.youtube.com/yt/press/en-GB/statistics.html
  • #43 Vs. mini-batch. Send records without global synchronization.
  • #44 Vs. mini-batch. Send records without global synchronization.
  • #45 TODO: 4 dia „animalas”
  • #46 TODO: 4 dia „animalas”
  • #47 TODO: 4 dia „animalas”
  • #48 TODO: 4 dia „animalas”
  • #49 TODO: 4 dia „animalas”
  • #50 TODO: 4 dia „animalas”