Maja Kabiljo & Aleksandar Ilic, Facebook
Large scale Collaborative Filtering
using Apache Giraph
Conclusion
04
01
02
05
03
What is Apache Giraph?
Collaborative Filtering problem
Neighborhood-based models
Matrix factorization
What is
Apache Giraph?
What is Apache Giraph?
Iterative and graph processing on massive datasets
Billion vertices, trillion edges
Data mapped to a graph
•Vertex ids and values
•Edges and edge values
“Think like a vertex”
10
5
1
3
What is Apache Giraph?
Runs on top of Hadoop
Map only jobs
Keeps data in memory
Mappers communicate through network
Giraph workflow
Worker 1
Worker 2
Worker 3
Collaborative Filtering
Problem
Collaborative Filtering
Predict user’s interests based on many other users
Disney Roller coasters Disneyland Six Flags
Collaborative Filtering
Main challenge: Facebook data
•Billion users, 100 billion ratings
•Skewed item degrees
•No explicit ratings
Common approaches:
•Neighborhood based models
•Matrix factorization
Neighborhood
Based Models
Neighborhood based CF
Start from user item ratings
Calculate item similarities
For each item pair:
•Users who rated first item
•Users who rated second item
•Users who rated both items
?
u
u
u
u
u
u
I1 I2
Neighborhood based CF
Calculate user recommendations
For every user:
•Items rated by user
•Most similar items to these items
?
?
?
?
I4
I5
I6
I7
I1
I2
I3
u
Configurable formulas
Accommodating different use cases
Each calculation step is configurable
•User’s contribution to item similarities
•Item similarities based on all user’s contributions
•User to item recommendation score
Passing a piece of Java code through configuration
intersection / Math.sqrt(degree1 * degree2)
Users to items edges
Preprocessing:
•Filter out low degree ones
•Calculate global item stats
Users send item lists to items
•Items need other items’ global stats
to calculate similarities
Worker 1
Worker 2
Worker 3
Our solution
i
u
u
u
u
i
i
iu
Optimizations
Make item info globally available
•Using reduce/broadcast api
Striping technique
•Split computation across multiple supersteps
•In each stripe process one subset of items
Applications
Direct user recommendations
Context aware recommendations
User explore
Item similarities implemented using Hive join
•Remapping all items to 1..N first
Comparison with Hive
150M users
15M items
4B ratings
1.3B users
35M items
15B ratings
2.4B users
8M items
220B ratings
Hive CPU hours
(after int
remapping)
10 227 963
Giraph CPU hours 3 16 87
Ease of use
ratings = i2iRatings(table = ‘user_item_ratings')
similarities = i2iSimilarities(table = 'item_similarities')
recommendations = i2iRecommendations(table = 'user_recommendations')
i2iCalculateSimilarities(ratings,
similarities,
similarity_formula = '...',
num_workers = 10)
i2iCalculateRecommendations(ratings,
similarities,
recommendations,
scoring_formula = '...',
num_workers = 50)
Matrix
Factorization
?
? ?
? ?
?
? ? ?
Matrix factorization CF
4 4 1 3
5 3 1
1 2 4
5 3 4 5
2 3
...
. . .
...
U1
U2
U3
U4
users
...
U5
. . .I1 I2 I3 I4
items
I5
?
Basic form
Objective function
Two iterative approaches:
•Stochastic Gradient Descent
•Alternating Least Squares
regularization
Standard approach
A bipartite graph:
•Users and items are vertices
•Known ratings are edges
•Feature vectors sent through edges
Problems:
•Data sent per iteration: #knownRatings * #features
•Memory
•Large degree items
•SGD modifications are different than in the sequential solution
Worker 1
Worker 2
Worker 3
I2
I1
I3
I4
Our solution
Extending Giraph
•Worker data
•Worker to worker messages
Users are vertices, items are worker data
Our solution - rotational approach
Worker 1
Worker 2
Worker 3
item
set 3
item
set 1
item
set 2
•Network traffic?
•Memory?
•Skewed item degrees?
•SGD calculation?
Recommendations
Finding top inner products
Each (user, item) pair is unfeasible
Creating Ball Tree from item vectors
•Greedy tree traversal
•Pruning subtrees
•100-1000x faster
Additional features
Tracking rmse, average rank and precision/recall
Combining SGD & ALS
Using other objective functions
•CF for implicit feedback
•Biases
•Degree based regularization
•Optimizing ranks
Applications
Add user and item feature vectors in ranking
Get user to item score in realtime
Direct user recommendations
Training / testing metrics exampleRMSE
0
0.2
0.4
0.6
0.8
Iterations
0 4 8 12 16 20 24 28 32 36 40 44
Train f=8
Test f=8
Train f=128
Test f=128
Comparison with Spark MLlib
Performance of Spark MLlib ALS CF published in July 2014
On scaled copies of Amazon reviews datasetCpuminutes
0
150
300
450
600
Millions examples
0 300 600 900 1200
Standard (in Spark)
Rotational (in Giraph)
Ease of use
ratings = CFRatings(table = 'cf_ratings')
feature_vectors = CFFeatureVectors(table = 'cf_feature_vectors')
CFTrain(ratings,
feature_vectors,
CFSettings(features_size = 10, iterations = 20),
num_workers = 5)
CFRecommend(ratings,
feature_vectors,
CFRecommendations(top_items_table = 'cf_top_items'),
num_workers = 50)
Conclusion
Conclusion
Scalable implementation of Collaborative Filtering
On top of Apache Giraph
Highly performant (>100 billion ratings)
Neighborhood-based models
Matrix factorization
Group and Page recommendations at Facebook
Thank you!
tinyurl.com/fb-mf-cf
Questions?

large scale collaborative filtering using Apache Giraph

  • 1.
    Maja Kabiljo &Aleksandar Ilic, Facebook Large scale Collaborative Filtering using Apache Giraph
  • 2.
    Conclusion 04 01 02 05 03 What is ApacheGiraph? Collaborative Filtering problem Neighborhood-based models Matrix factorization
  • 3.
  • 4.
    What is ApacheGiraph? Iterative and graph processing on massive datasets Billion vertices, trillion edges Data mapped to a graph •Vertex ids and values •Edges and edge values “Think like a vertex” 10 5 1 3
  • 5.
    What is ApacheGiraph? Runs on top of Hadoop Map only jobs Keeps data in memory Mappers communicate through network
  • 6.
  • 7.
  • 8.
    Collaborative Filtering Predict user’sinterests based on many other users Disney Roller coasters Disneyland Six Flags
  • 9.
    Collaborative Filtering Main challenge:Facebook data •Billion users, 100 billion ratings •Skewed item degrees •No explicit ratings Common approaches: •Neighborhood based models •Matrix factorization
  • 10.
  • 11.
    Neighborhood based CF Startfrom user item ratings Calculate item similarities For each item pair: •Users who rated first item •Users who rated second item •Users who rated both items ? u u u u u u I1 I2
  • 12.
    Neighborhood based CF Calculateuser recommendations For every user: •Items rated by user •Most similar items to these items ? ? ? ? I4 I5 I6 I7 I1 I2 I3 u
  • 13.
    Configurable formulas Accommodating differentuse cases Each calculation step is configurable •User’s contribution to item similarities •Item similarities based on all user’s contributions •User to item recommendation score Passing a piece of Java code through configuration intersection / Math.sqrt(degree1 * degree2)
  • 14.
    Users to itemsedges Preprocessing: •Filter out low degree ones •Calculate global item stats Users send item lists to items •Items need other items’ global stats to calculate similarities Worker 1 Worker 2 Worker 3 Our solution i u u u u i i iu
  • 15.
    Optimizations Make item infoglobally available •Using reduce/broadcast api Striping technique •Split computation across multiple supersteps •In each stripe process one subset of items
  • 16.
    Applications Direct user recommendations Contextaware recommendations User explore
  • 17.
    Item similarities implementedusing Hive join •Remapping all items to 1..N first Comparison with Hive 150M users 15M items 4B ratings 1.3B users 35M items 15B ratings 2.4B users 8M items 220B ratings Hive CPU hours (after int remapping) 10 227 963 Giraph CPU hours 3 16 87
  • 18.
    Ease of use ratings= i2iRatings(table = ‘user_item_ratings') similarities = i2iSimilarities(table = 'item_similarities') recommendations = i2iRecommendations(table = 'user_recommendations') i2iCalculateSimilarities(ratings, similarities, similarity_formula = '...', num_workers = 10) i2iCalculateRecommendations(ratings, similarities, recommendations, scoring_formula = '...', num_workers = 50)
  • 19.
  • 20.
    ? ? ? ? ? ? ?? ? Matrix factorization CF 4 4 1 3 5 3 1 1 2 4 5 3 4 5 2 3 ... . . . ... U1 U2 U3 U4 users ... U5 . . .I1 I2 I3 I4 items I5 ?
  • 21.
    Basic form Objective function Twoiterative approaches: •Stochastic Gradient Descent •Alternating Least Squares regularization
  • 22.
    Standard approach A bipartitegraph: •Users and items are vertices •Known ratings are edges •Feature vectors sent through edges Problems: •Data sent per iteration: #knownRatings * #features •Memory •Large degree items •SGD modifications are different than in the sequential solution Worker 1 Worker 2 Worker 3 I2 I1 I3 I4
  • 23.
    Our solution Extending Giraph •Workerdata •Worker to worker messages Users are vertices, items are worker data
  • 24.
    Our solution -rotational approach Worker 1 Worker 2 Worker 3 item set 3 item set 1 item set 2 •Network traffic? •Memory? •Skewed item degrees? •SGD calculation?
  • 25.
    Recommendations Finding top innerproducts Each (user, item) pair is unfeasible Creating Ball Tree from item vectors •Greedy tree traversal •Pruning subtrees •100-1000x faster
  • 26.
    Additional features Tracking rmse,average rank and precision/recall Combining SGD & ALS Using other objective functions •CF for implicit feedback •Biases •Degree based regularization •Optimizing ranks
  • 27.
    Applications Add user anditem feature vectors in ranking Get user to item score in realtime Direct user recommendations
  • 28.
    Training / testingmetrics exampleRMSE 0 0.2 0.4 0.6 0.8 Iterations 0 4 8 12 16 20 24 28 32 36 40 44 Train f=8 Test f=8 Train f=128 Test f=128
  • 29.
    Comparison with SparkMLlib Performance of Spark MLlib ALS CF published in July 2014 On scaled copies of Amazon reviews datasetCpuminutes 0 150 300 450 600 Millions examples 0 300 600 900 1200 Standard (in Spark) Rotational (in Giraph)
  • 30.
    Ease of use ratings= CFRatings(table = 'cf_ratings') feature_vectors = CFFeatureVectors(table = 'cf_feature_vectors') CFTrain(ratings, feature_vectors, CFSettings(features_size = 10, iterations = 20), num_workers = 5) CFRecommend(ratings, feature_vectors, CFRecommendations(top_items_table = 'cf_top_items'), num_workers = 50)
  • 31.
  • 32.
    Conclusion Scalable implementation ofCollaborative Filtering On top of Apache Giraph Highly performant (>100 billion ratings) Neighborhood-based models Matrix factorization Group and Page recommendations at Facebook
  • 33.