1
Institute of Manufacturing Information and Systems (製造資訊與系統研究所)
Institute of Engineering Management (工程管理碩士在職專班)
National Cheng Kung University (國立成功大學)
指導教授:李家岩 博士
報 告 者:洪紹嚴
日期:2016/05/06
Online Optimization Problem(1)
Productivity Optimization Lab Shao-Yen Hung
0. Agenda
1. Problem Definition
2. Background Knowledge
 Convex Function, Subgradient
 Lagrange Multiplier, KKT Conditions
 Loss Function
 Regularization
 BGD(Batch Gradient Descent) vs SGD (Stochastic Gradient Descent)
3. SCR (Simple coefficient Rounding);TG (Truncated Gradient)
4. FOBOS (Forward-Backward Splitting / Forward Looking Subgradients)
5. RDA (Regularized Dual Averaging)
6. FTRL (Follow-the-Regularized-Leader)
2
Productivity Optimization Lab Shao-Yen Hung
0. Agenda
1. Problem Definition
2. Background Knowledge
 Convex Function, Subgradient
 Lagrange Multiplier, KKT Conditions
 Loss Function
 Regularization
 BGD(Batch Gradient Descent) vs SGD (Stochastic Gradient Descent)
3. SCR (Simple coefficient Rounding);TG (Truncated Gradient)
4. FOBOS (Forward-Backward Splitting / Forward Looking Subgradients)
5. RDA (Regularized Dual Averaging)
6. FTRL (Follow-the-Regularized-Leader)
3
Productivity Optimization Lab Shao-Yen Hung
1. Problem Definition (Online Optimization)
• 訓練模型(OLS, NN, SVM……)就是一個最佳化的過程(e.g. 找出最佳
的w*,使預測和實際值之間的誤差總和最小)
• 傳統上,當有新資料進來時,會把訓練過的資料再訓練過一遍,求得
新的w*,這樣的做法稱為Batch Learning(批量學習)。
 缺點:速度慢,記憶體需求大
• 只根據新的資料進行w*的修正,這樣的做法稱為Online Learning。
 優點:速度快,記憶體需求小
• 案例:廣告點擊分析,投資組合管理,推薦商品……(大數據範疇)
4
Productivity Optimization Lab Shao-Yen Hung
0. Agenda
1. Problem Definition
2. Background Knowledge
 Convex Function, Subgradient
 Lagrange Multiplier, KKT Conditions
 Loss Function
 Regularization
 BGD(Batch Gradient Descent) vs SGD (Stochastic Gradient Descent)
3. SCR (Simple coefficient Rounding);TG (Truncated Gradient)
4. FOBOS (Forward-Backward Splitting / Forward Looking Subgradients)
5. RDA (Regularized Dual Averaging)
6. FTRL (Follow-the-Regularized-Leader)
5
Productivity Optimization Lab Shao-Yen Hung
2. Background Knowledge
 Convex Function
𝜕2 𝑓
𝜕𝑥2 ≥ 0
 Gradient and Subgradient
6
(Gradient)
x
y y=f(x)=|x|
At x=0, subgradient ∂f ∈ [-1, 1]
Productivity Optimization Lab Shao-Yen Hung
2. Background Knowledge
7
Solve
If f(x) is a convex function:
(1)
(2)
Solve
(Lagrange Multiplier)
(3)
Solve
(KKT Conditions)
Productivity Optimization Lab Shao-Yen Hung
2. Background Knowledge
最佳化問題描述:
8
• l(W,Z) = 損失函數(loss function)
• Z = 觀測樣本集合
• 𝑋𝑗 = 第j個樣本的特徵向量
• 𝑦𝑗 = h(W, 𝑋𝑗) = 第j個的樣本的預測值
• W = 特徵權重(求解的參數)
損失函數可視為各樣本損失函數的累加:
Linear Regression
Logistic Regression
Productivity Optimization Lab Shao-Yen Hung
2. Background Knowledge
• Regularization
 avoid overfitting problem
 generate sparsity
9
稱為正則化因子(Regularization),一個和W有關的函數
convex
convex
(L1, L2 and sparsity)
Lasso
Ridge
Productivity Optimization Lab Shao-Yen Hung
2. Background Knowledge
Batch Gradient Descent vs Stochastic Gradient Descent :
10
const
current iteration
全部資料
新資料
Productivity Optimization Lab Shao-Yen Hung
0. Agenda
1. Problem Definition
2. Background Knowledge
 Convex Function, Subgradient
 Lagrange Multiplier, KKT Conditions
 Loss Function
 Regularization
 BGD(Batch Gradient Descent) vs SGD (Stochastic Gradient Descent)
3. SCR (Simple coefficient Rounding);TG (Truncated Gradient)
4. FOBOS (Forward-Backward Splitting / Forward Looking Subgradients)
5. RDA (Regularized Dual Averaging)
6. FTRL (Follow-the-Regularized-Leader)
11
Productivity Optimization Lab Shao-Yen Hung
3. SCR (Simple coefficient Rounding)
12
• Solve the problem that L1 Regularization in SGD doesn’t
generate sparsity.
• 3 parameters:
 θ : threshold for deciding whether coefficient is 0 or not
 K : doing truncating after k online steps
 η : learning rate
𝑊𝑡+1 =
𝑊𝑡 − η
𝜕 𝑙 𝑊𝑡, 𝑍
𝜕𝑊𝑡
, 𝑖𝑓 𝑚𝑜𝑑 𝑡, 𝐾 ≠ 0
𝑇0(𝑊𝑡 − η
𝜕 𝑙 𝑊𝑡, 𝑍
𝜕𝑊𝑡
, θ) , 𝑖𝑓 𝑚𝑜𝑑 𝑡, 𝐾 = 0
𝑇0(𝑣𝑖, θ) =
0 , 𝑖𝑓 𝑣𝑖 < θ
𝑣𝑖 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
We first apply the standard stochastic gradient descent rule, and then round
small coefficients to zero.(Langford et al., 2009)
Productivity Optimization Lab Shao-Yen Hung
3. TG (Truncated Gradient)
13
• We observe that the direct rounding to zero is too aggressive. A
less aggressive version is to shrink the coefficient to zero by a
smaller amount. We call this idea truncated gradient.(Langford et
al., 2009)
(Lasso)
Productivity Optimization Lab Shao-Yen Hung
3. TG (Truncated Gradient)
14
𝑇1(𝑣𝑖, α, θ) =
max 0, 𝑣𝑖 − α , 𝑖𝑓 𝑣𝑖∈ [0, θ]
min 0, 𝑣𝑖 + α , 𝑖𝑓 𝑣𝑖∈ [−θ, 0]
𝑣𝑖 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑊𝑡+1 = 𝑇1(𝑊𝑡 − η
𝜕 𝑙 𝑊𝑡,𝑍
𝜕𝑊𝑡
, η𝒈𝒊, θ)
𝒈𝒊 =
0 , 𝑖𝑓 𝑚𝑜𝑑 𝑡, 𝐾 ≠ 0
𝐾𝑔 , 𝑖𝑓 𝑚𝑜𝑑 𝑡, 𝐾 = 0
• 4 parameters:
 θ : threshold for deciding whether coefficient is 0 or not
 K : doing truncating after k online steps
 η : learning rate
 g:gravity parameter
Productivity Optimization Lab Shao-Yen Hung
3. TG (Truncated Gradient)
15
𝑊𝑡+1 =
𝑇1(𝑊𝑡 − η
𝜕 𝑙 𝑊𝑡, 𝑍
𝜕𝑊𝑡
, η𝒈𝒊, θ) , 𝑖𝑓 𝑚𝑜𝑑 𝑡, 𝐾 = 0
𝑊𝑡 − η
𝜕 𝑙 𝑊𝑡, 𝑍
𝜕𝑊𝑡
, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Rewrite the TG formulation:
𝑇1(𝑣𝑖, α, θ) =
0 , 𝑖𝑓 |𝑣𝑖| < α
𝑣𝑖 −α ∗ 𝑠𝑔𝑛 𝑣𝑖 , 𝑖𝑓α ≤ 𝑣𝑖 < θ
𝑣𝑖 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
1) If α = θ SCR(簡單截斷法)
2) If K=1, θ = INF 𝑊𝑡+1 = 𝑇1(𝑊𝑡 − η
𝜕 𝑙 𝑊𝑡,𝑍
𝜕𝑊𝑡
, η𝒈𝒊, 𝑰𝑵𝑭)
= 𝑊𝑡 −η
𝜕 𝑙 𝑊𝑡,𝑍
𝜕𝑊𝑡
− η𝒈𝒊sgn( 𝑊𝑡 −η
𝜕 𝑙 𝑊𝑡,𝑍
𝜕𝑊𝑡
)
L1 Regularization (Lasso)
(if α ≤ |𝑣𝑖|)
Productivity Optimization Lab Shao-Yen Hung
3. SCR vs TG vs LASSO
16
α = θ
K=1, θ = INF
−α
α
Productivity Optimization Lab Shao-Yen Hung
3. TG (Truncated Gradient)
17
Productivity Optimization Lab Shao-Yen Hung
0. Agenda
1. Problem Definition
2. Background Knowledge
 Convex Function, Subgradient
 Lagrange Multiplier, KKT Conditions
 Loss Function
 Regularization
 BGD(Batch Gradient Descent) vs SGD (Stochastic Gradient Descent)
3. SCR (Simple coefficient Rounding);TG (Truncated Gradient)
4. FOBOS (Forward-Backward Splitting / Forward Looking Subgradients)
5. RDA (Regularized Dual Averaging)
6. FTRL (Follow-the-Regularized-Leader)
18
Productivity Optimization Lab Shao-Yen Hung
4. FOBOS (Forward Backward Splitting)
19
(1) 標準梯度下降公式:
(2) L1-FOBOS的梯度下降公式,可以再細分為兩部分:
 前部分:微調發生在梯度下降的結果(𝑾 𝒕+
𝟏
𝟐
)附近
 後部分:處理正則化,產生稀疏性
r(w) = regularization functions
• 事實上,這個方法應該叫FOBAS,可是原作者John Langfold 一開始稱
呼這個方法為FOLOS (Forward Looking Subgradients),為了避免困擾,
於是把A改成O,變成FOBOS。
(加上L1 regularization的FOBOS)
Productivity Optimization Lab Shao-Yen Hung
4. FOBOS (Forward Backward Splitting)
20
(3) 要求得(2)最佳解的充分條件: 0 屬於其subgradient set之中
(4) 因為 ,(3) 可以改寫成:
(5) 換句話說,把(4)移項之後:
 迭代前的狀態𝑾 𝒕 與梯度
backward
 當次迭代的正則項資訊 𝝏𝒓(𝑾𝒕+𝟏)
forward
Productivity Optimization Lab Shao-Yen Hung
4. FOBOS (Forward Backward Splitting)
21
• L1-FOBOS’s Sparsity
(1)改寫一下原式:
(1)
(2)
λ
r(w) = λ ||w|| 1
Productivity Optimization Lab Shao-Yen Hung
4. FOBOS (Forward Backward Splitting)
22
• L1-FOBOS’s Sparsity
(2)可以拆解成每一維度權重的總和:
(2)
(3)
Productivity Optimization Lab Shao-Yen Hung
4. FOBOS (Forward Backward Splitting)
23
• L1-FOBOS’s Sparsity
(3)如果𝒘∗是某一維度𝒘𝒋的最佳解,則𝒘𝒋‧𝒗𝒋 ≥ 0:
(3)
(反證法)
如果上述不成立,表示 𝒘𝒋 ‧ 𝒗𝒋 < 0
因此:
𝟏
𝟐
𝒗 𝟐 <
𝟏
𝟐
𝒗 𝟐 − 𝒘∗‧𝒗 +
𝟏
𝟐
(𝒘∗) 𝟐 <
𝟏
𝟐
𝒗 − 𝒘∗ 𝟐 + λ 𝒘∗
這和𝒘𝒊
∗
為最佳解不符,故𝒘𝒋‧𝒗𝒋 ≥ 0
Productivity Optimization Lab Shao-Yen Hung
4. FOBOS (Forward Backward Splitting)
24
• L1-FOBOS’s Sparsity
(4)當𝒘𝒋‧𝒗𝒋 ≥ 0:
If 𝐯𝐣 ≥ 0:
a) 𝒘∗ > 0:Since βw*=0, β=0 w* = v - 𝜆
b) 𝒘∗
= 0:Since β≥ 0, 𝑣𝑖 - 𝜆 ≤ 0
That is, w* = max(0, 𝑣𝑖 - 𝜆)
s.t −𝒘𝒋 ≤ 𝟎
]=0 and βw=0
𝜕
𝜕𝑤
[
KKT
𝑤∗
(5) Same as 𝒗𝒋 < 0: w* = -max(0, −𝑣𝑖 - 𝜆)
Productivity Optimization Lab Shao-Yen Hung
4. FOBOS (Forward Backward Splitting)
25
• L1-FOBOS’s Sparsity
(6)綜合(4)(5)的結論:
𝑊𝑖
𝑡+1
= sgn 𝑣𝑖 max(0, |vi| − λ)
𝑊𝑖
𝑡+1
= sgn 𝑊𝑖
𝑡
− η 𝑡 𝑔𝑖
𝑡
max(0, |𝑊𝑖
𝑡
− η 𝑡 𝑔𝑖
𝑡
| − η 𝑡+
1
2
λ)
𝑣𝑖 = 𝑊𝑖
𝑡
− η 𝑡 𝑔𝑖
𝑡
Productivity Optimization Lab Shao-Yen Hung
4. FOBOS (Forward Backward Splitting)
26
• L1-FOBOS’s Sparsity
(7)根據(6)的式子:
可以發現到,當|𝑊𝑖
𝑡
− η 𝑡 𝑔𝑖
𝑡
| ≤ η 𝑡+
1
2
λ 時,會對𝑊𝑖
𝑡+1
進行截斷(Truncating)
換句話說:
𝑊𝑖
𝑡+1
= sgn 𝑊𝑖
𝑡
− η 𝑡 𝑔𝑖
𝑡
max(0, |𝑊𝑖
𝑡
− η 𝑡 𝑔𝑖
𝑡
| − η 𝑡+
1
2
λ )
𝑊𝑡+1
=
0 , 𝑖𝑓 𝑊𝑖
𝑡
− η 𝑡 𝑔𝑖
𝑡
≤ η
𝑡+
1
2
λ
𝑊𝑖
𝑡
− η 𝑡 𝑔𝑖
𝑡
− sgn 𝑊𝑖
𝑡
− η 𝑡 𝑔𝑖
𝑡
η
𝑡+
1
2
λ , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(可以這麼解釋)
當一個新樣本產生的梯度,不足以讓該維度權重產生足夠大的變化時,
認為該維度在本次更新中不重要,因此令其權重為 0
Productivity Optimization Lab Shao-Yen Hung
4. TG vs FOBOS
27
• L1-FOBOS
𝑊𝑡+1
=
0 , 𝑖𝑓 𝑊𝑖
𝑡
− η 𝑡 𝑔𝑖
𝑡
≤ η
𝑡+
1
2
λ
𝑊𝑖
𝑡
− η 𝑡 𝑔𝑖
𝑡
− sgn 𝑊𝑖
𝑡
− η 𝑡 𝑔𝑖
𝑡
η
𝑡+
1
2
λ , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
有趣的是,當 K=1,
θ = INF, TG = L1-FOBOS
α = η 𝑡+
1
2
λ
• TG
𝑊𝑡+1 =
𝑇1(𝑊𝑡 − η
𝜕 𝑙 𝑊𝑡, 𝑍
𝜕𝑊𝑡
, η𝒈𝒊, θ) , 𝑖𝑓 𝑚𝑜𝑑 𝑡, 𝐾 = 0
𝑊𝑡 − η
𝜕 𝑙 𝑊𝑡, 𝑍
𝜕𝑊𝑡
, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑇1(𝑣𝑖, α, θ) =
0 , 𝑖𝑓 |𝑣𝑖| ≤ α
𝑣𝑖 −α ∗ 𝑠𝑔𝑛 𝑣𝑖 , 𝑖𝑓α ≤ | 𝑣𝑖 | ≤ θ
𝑣𝑖 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Productivity Optimization Lab Shao-Yen Hung
0. Agenda
1. Problem Definition
2. Background Knowledge
 Convex Function, Subgradient
 Lagrange Multiplier, KKT Conditions
 Loss Function
 Regularization
 BGD(Batch Gradient Descent) vs SGD (Stochastic Gradient Descent)
3. SCR (Simple coefficient Rounding);TG (Truncated Gradient)
4. FOBOS (Forward-Backward Splitting / Forward Looking Subgradients)
5. RDA (Regularized Dual Averaging)
6. FTRL (Follow-the-Regularized-Leader)
28
Productivity Optimization Lab Shao-Yen Hung
Reference
[1] John Langford, Lihong Li & Tong Zhang. Sparse Online Learning via
Truncated Gradient. Journal of Machine Learning Research, 2009.
[2] John Duchi & Yoram Singer. Efficient Online and Batch Learning using
Forward Backward Splitting. Journal of Machine Learning Research, 2009.
[3] Lin Xiao. Dual Averaging Methods for Regularized Stochastic Learning
and Online Optimization. Journal of Machine Learning Research, 2010.
[4] H. B. McMahan. Follow-the-regularized-leader and mirror descent:
Equivalence theorems and L1 regularization. In AISTATS, 2011.
[5] H. Brendan McMahan,Gary Holt, D. Sculley et al. Ad Click Prediction: a
View from the Trenches. In KDD , 2013.
29

Online Optimization Problem-1 (Online machine learning)

  • 1.
    1 Institute of ManufacturingInformation and Systems (製造資訊與系統研究所) Institute of Engineering Management (工程管理碩士在職專班) National Cheng Kung University (國立成功大學) 指導教授:李家岩 博士 報 告 者:洪紹嚴 日期:2016/05/06 Online Optimization Problem(1)
  • 2.
    Productivity Optimization LabShao-Yen Hung 0. Agenda 1. Problem Definition 2. Background Knowledge  Convex Function, Subgradient  Lagrange Multiplier, KKT Conditions  Loss Function  Regularization  BGD(Batch Gradient Descent) vs SGD (Stochastic Gradient Descent) 3. SCR (Simple coefficient Rounding);TG (Truncated Gradient) 4. FOBOS (Forward-Backward Splitting / Forward Looking Subgradients) 5. RDA (Regularized Dual Averaging) 6. FTRL (Follow-the-Regularized-Leader) 2
  • 3.
    Productivity Optimization LabShao-Yen Hung 0. Agenda 1. Problem Definition 2. Background Knowledge  Convex Function, Subgradient  Lagrange Multiplier, KKT Conditions  Loss Function  Regularization  BGD(Batch Gradient Descent) vs SGD (Stochastic Gradient Descent) 3. SCR (Simple coefficient Rounding);TG (Truncated Gradient) 4. FOBOS (Forward-Backward Splitting / Forward Looking Subgradients) 5. RDA (Regularized Dual Averaging) 6. FTRL (Follow-the-Regularized-Leader) 3
  • 4.
    Productivity Optimization LabShao-Yen Hung 1. Problem Definition (Online Optimization) • 訓練模型(OLS, NN, SVM……)就是一個最佳化的過程(e.g. 找出最佳 的w*,使預測和實際值之間的誤差總和最小) • 傳統上,當有新資料進來時,會把訓練過的資料再訓練過一遍,求得 新的w*,這樣的做法稱為Batch Learning(批量學習)。  缺點:速度慢,記憶體需求大 • 只根據新的資料進行w*的修正,這樣的做法稱為Online Learning。  優點:速度快,記憶體需求小 • 案例:廣告點擊分析,投資組合管理,推薦商品……(大數據範疇) 4
  • 5.
    Productivity Optimization LabShao-Yen Hung 0. Agenda 1. Problem Definition 2. Background Knowledge  Convex Function, Subgradient  Lagrange Multiplier, KKT Conditions  Loss Function  Regularization  BGD(Batch Gradient Descent) vs SGD (Stochastic Gradient Descent) 3. SCR (Simple coefficient Rounding);TG (Truncated Gradient) 4. FOBOS (Forward-Backward Splitting / Forward Looking Subgradients) 5. RDA (Regularized Dual Averaging) 6. FTRL (Follow-the-Regularized-Leader) 5
  • 6.
    Productivity Optimization LabShao-Yen Hung 2. Background Knowledge  Convex Function 𝜕2 𝑓 𝜕𝑥2 ≥ 0  Gradient and Subgradient 6 (Gradient) x y y=f(x)=|x| At x=0, subgradient ∂f ∈ [-1, 1]
  • 7.
    Productivity Optimization LabShao-Yen Hung 2. Background Knowledge 7 Solve If f(x) is a convex function: (1) (2) Solve (Lagrange Multiplier) (3) Solve (KKT Conditions)
  • 8.
    Productivity Optimization LabShao-Yen Hung 2. Background Knowledge 最佳化問題描述: 8 • l(W,Z) = 損失函數(loss function) • Z = 觀測樣本集合 • 𝑋𝑗 = 第j個樣本的特徵向量 • 𝑦𝑗 = h(W, 𝑋𝑗) = 第j個的樣本的預測值 • W = 特徵權重(求解的參數) 損失函數可視為各樣本損失函數的累加: Linear Regression Logistic Regression
  • 9.
    Productivity Optimization LabShao-Yen Hung 2. Background Knowledge • Regularization  avoid overfitting problem  generate sparsity 9 稱為正則化因子(Regularization),一個和W有關的函數 convex convex (L1, L2 and sparsity) Lasso Ridge
  • 10.
    Productivity Optimization LabShao-Yen Hung 2. Background Knowledge Batch Gradient Descent vs Stochastic Gradient Descent : 10 const current iteration 全部資料 新資料
  • 11.
    Productivity Optimization LabShao-Yen Hung 0. Agenda 1. Problem Definition 2. Background Knowledge  Convex Function, Subgradient  Lagrange Multiplier, KKT Conditions  Loss Function  Regularization  BGD(Batch Gradient Descent) vs SGD (Stochastic Gradient Descent) 3. SCR (Simple coefficient Rounding);TG (Truncated Gradient) 4. FOBOS (Forward-Backward Splitting / Forward Looking Subgradients) 5. RDA (Regularized Dual Averaging) 6. FTRL (Follow-the-Regularized-Leader) 11
  • 12.
    Productivity Optimization LabShao-Yen Hung 3. SCR (Simple coefficient Rounding) 12 • Solve the problem that L1 Regularization in SGD doesn’t generate sparsity. • 3 parameters:  θ : threshold for deciding whether coefficient is 0 or not  K : doing truncating after k online steps  η : learning rate 𝑊𝑡+1 = 𝑊𝑡 − η 𝜕 𝑙 𝑊𝑡, 𝑍 𝜕𝑊𝑡 , 𝑖𝑓 𝑚𝑜𝑑 𝑡, 𝐾 ≠ 0 𝑇0(𝑊𝑡 − η 𝜕 𝑙 𝑊𝑡, 𝑍 𝜕𝑊𝑡 , θ) , 𝑖𝑓 𝑚𝑜𝑑 𝑡, 𝐾 = 0 𝑇0(𝑣𝑖, θ) = 0 , 𝑖𝑓 𝑣𝑖 < θ 𝑣𝑖 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 We first apply the standard stochastic gradient descent rule, and then round small coefficients to zero.(Langford et al., 2009)
  • 13.
    Productivity Optimization LabShao-Yen Hung 3. TG (Truncated Gradient) 13 • We observe that the direct rounding to zero is too aggressive. A less aggressive version is to shrink the coefficient to zero by a smaller amount. We call this idea truncated gradient.(Langford et al., 2009) (Lasso)
  • 14.
    Productivity Optimization LabShao-Yen Hung 3. TG (Truncated Gradient) 14 𝑇1(𝑣𝑖, α, θ) = max 0, 𝑣𝑖 − α , 𝑖𝑓 𝑣𝑖∈ [0, θ] min 0, 𝑣𝑖 + α , 𝑖𝑓 𝑣𝑖∈ [−θ, 0] 𝑣𝑖 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑊𝑡+1 = 𝑇1(𝑊𝑡 − η 𝜕 𝑙 𝑊𝑡,𝑍 𝜕𝑊𝑡 , η𝒈𝒊, θ) 𝒈𝒊 = 0 , 𝑖𝑓 𝑚𝑜𝑑 𝑡, 𝐾 ≠ 0 𝐾𝑔 , 𝑖𝑓 𝑚𝑜𝑑 𝑡, 𝐾 = 0 • 4 parameters:  θ : threshold for deciding whether coefficient is 0 or not  K : doing truncating after k online steps  η : learning rate  g:gravity parameter
  • 15.
    Productivity Optimization LabShao-Yen Hung 3. TG (Truncated Gradient) 15 𝑊𝑡+1 = 𝑇1(𝑊𝑡 − η 𝜕 𝑙 𝑊𝑡, 𝑍 𝜕𝑊𝑡 , η𝒈𝒊, θ) , 𝑖𝑓 𝑚𝑜𝑑 𝑡, 𝐾 = 0 𝑊𝑡 − η 𝜕 𝑙 𝑊𝑡, 𝑍 𝜕𝑊𝑡 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Rewrite the TG formulation: 𝑇1(𝑣𝑖, α, θ) = 0 , 𝑖𝑓 |𝑣𝑖| < α 𝑣𝑖 −α ∗ 𝑠𝑔𝑛 𝑣𝑖 , 𝑖𝑓α ≤ 𝑣𝑖 < θ 𝑣𝑖 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 1) If α = θ SCR(簡單截斷法) 2) If K=1, θ = INF 𝑊𝑡+1 = 𝑇1(𝑊𝑡 − η 𝜕 𝑙 𝑊𝑡,𝑍 𝜕𝑊𝑡 , η𝒈𝒊, 𝑰𝑵𝑭) = 𝑊𝑡 −η 𝜕 𝑙 𝑊𝑡,𝑍 𝜕𝑊𝑡 − η𝒈𝒊sgn( 𝑊𝑡 −η 𝜕 𝑙 𝑊𝑡,𝑍 𝜕𝑊𝑡 ) L1 Regularization (Lasso) (if α ≤ |𝑣𝑖|)
  • 16.
    Productivity Optimization LabShao-Yen Hung 3. SCR vs TG vs LASSO 16 α = θ K=1, θ = INF −α α
  • 17.
    Productivity Optimization LabShao-Yen Hung 3. TG (Truncated Gradient) 17
  • 18.
    Productivity Optimization LabShao-Yen Hung 0. Agenda 1. Problem Definition 2. Background Knowledge  Convex Function, Subgradient  Lagrange Multiplier, KKT Conditions  Loss Function  Regularization  BGD(Batch Gradient Descent) vs SGD (Stochastic Gradient Descent) 3. SCR (Simple coefficient Rounding);TG (Truncated Gradient) 4. FOBOS (Forward-Backward Splitting / Forward Looking Subgradients) 5. RDA (Regularized Dual Averaging) 6. FTRL (Follow-the-Regularized-Leader) 18
  • 19.
    Productivity Optimization LabShao-Yen Hung 4. FOBOS (Forward Backward Splitting) 19 (1) 標準梯度下降公式: (2) L1-FOBOS的梯度下降公式,可以再細分為兩部分:  前部分:微調發生在梯度下降的結果(𝑾 𝒕+ 𝟏 𝟐 )附近  後部分:處理正則化,產生稀疏性 r(w) = regularization functions • 事實上,這個方法應該叫FOBAS,可是原作者John Langfold 一開始稱 呼這個方法為FOLOS (Forward Looking Subgradients),為了避免困擾, 於是把A改成O,變成FOBOS。 (加上L1 regularization的FOBOS)
  • 20.
    Productivity Optimization LabShao-Yen Hung 4. FOBOS (Forward Backward Splitting) 20 (3) 要求得(2)最佳解的充分條件: 0 屬於其subgradient set之中 (4) 因為 ,(3) 可以改寫成: (5) 換句話說,把(4)移項之後:  迭代前的狀態𝑾 𝒕 與梯度 backward  當次迭代的正則項資訊 𝝏𝒓(𝑾𝒕+𝟏) forward
  • 21.
    Productivity Optimization LabShao-Yen Hung 4. FOBOS (Forward Backward Splitting) 21 • L1-FOBOS’s Sparsity (1)改寫一下原式: (1) (2) λ r(w) = λ ||w|| 1
  • 22.
    Productivity Optimization LabShao-Yen Hung 4. FOBOS (Forward Backward Splitting) 22 • L1-FOBOS’s Sparsity (2)可以拆解成每一維度權重的總和: (2) (3)
  • 23.
    Productivity Optimization LabShao-Yen Hung 4. FOBOS (Forward Backward Splitting) 23 • L1-FOBOS’s Sparsity (3)如果𝒘∗是某一維度𝒘𝒋的最佳解,則𝒘𝒋‧𝒗𝒋 ≥ 0: (3) (反證法) 如果上述不成立,表示 𝒘𝒋 ‧ 𝒗𝒋 < 0 因此: 𝟏 𝟐 𝒗 𝟐 < 𝟏 𝟐 𝒗 𝟐 − 𝒘∗‧𝒗 + 𝟏 𝟐 (𝒘∗) 𝟐 < 𝟏 𝟐 𝒗 − 𝒘∗ 𝟐 + λ 𝒘∗ 這和𝒘𝒊 ∗ 為最佳解不符,故𝒘𝒋‧𝒗𝒋 ≥ 0
  • 24.
    Productivity Optimization LabShao-Yen Hung 4. FOBOS (Forward Backward Splitting) 24 • L1-FOBOS’s Sparsity (4)當𝒘𝒋‧𝒗𝒋 ≥ 0: If 𝐯𝐣 ≥ 0: a) 𝒘∗ > 0:Since βw*=0, β=0 w* = v - 𝜆 b) 𝒘∗ = 0:Since β≥ 0, 𝑣𝑖 - 𝜆 ≤ 0 That is, w* = max(0, 𝑣𝑖 - 𝜆) s.t −𝒘𝒋 ≤ 𝟎 ]=0 and βw=0 𝜕 𝜕𝑤 [ KKT 𝑤∗ (5) Same as 𝒗𝒋 < 0: w* = -max(0, −𝑣𝑖 - 𝜆)
  • 25.
    Productivity Optimization LabShao-Yen Hung 4. FOBOS (Forward Backward Splitting) 25 • L1-FOBOS’s Sparsity (6)綜合(4)(5)的結論: 𝑊𝑖 𝑡+1 = sgn 𝑣𝑖 max(0, |vi| − λ) 𝑊𝑖 𝑡+1 = sgn 𝑊𝑖 𝑡 − η 𝑡 𝑔𝑖 𝑡 max(0, |𝑊𝑖 𝑡 − η 𝑡 𝑔𝑖 𝑡 | − η 𝑡+ 1 2 λ) 𝑣𝑖 = 𝑊𝑖 𝑡 − η 𝑡 𝑔𝑖 𝑡
  • 26.
    Productivity Optimization LabShao-Yen Hung 4. FOBOS (Forward Backward Splitting) 26 • L1-FOBOS’s Sparsity (7)根據(6)的式子: 可以發現到,當|𝑊𝑖 𝑡 − η 𝑡 𝑔𝑖 𝑡 | ≤ η 𝑡+ 1 2 λ 時,會對𝑊𝑖 𝑡+1 進行截斷(Truncating) 換句話說: 𝑊𝑖 𝑡+1 = sgn 𝑊𝑖 𝑡 − η 𝑡 𝑔𝑖 𝑡 max(0, |𝑊𝑖 𝑡 − η 𝑡 𝑔𝑖 𝑡 | − η 𝑡+ 1 2 λ ) 𝑊𝑡+1 = 0 , 𝑖𝑓 𝑊𝑖 𝑡 − η 𝑡 𝑔𝑖 𝑡 ≤ η 𝑡+ 1 2 λ 𝑊𝑖 𝑡 − η 𝑡 𝑔𝑖 𝑡 − sgn 𝑊𝑖 𝑡 − η 𝑡 𝑔𝑖 𝑡 η 𝑡+ 1 2 λ , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (可以這麼解釋) 當一個新樣本產生的梯度,不足以讓該維度權重產生足夠大的變化時, 認為該維度在本次更新中不重要,因此令其權重為 0
  • 27.
    Productivity Optimization LabShao-Yen Hung 4. TG vs FOBOS 27 • L1-FOBOS 𝑊𝑡+1 = 0 , 𝑖𝑓 𝑊𝑖 𝑡 − η 𝑡 𝑔𝑖 𝑡 ≤ η 𝑡+ 1 2 λ 𝑊𝑖 𝑡 − η 𝑡 𝑔𝑖 𝑡 − sgn 𝑊𝑖 𝑡 − η 𝑡 𝑔𝑖 𝑡 η 𝑡+ 1 2 λ , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 有趣的是,當 K=1, θ = INF, TG = L1-FOBOS α = η 𝑡+ 1 2 λ • TG 𝑊𝑡+1 = 𝑇1(𝑊𝑡 − η 𝜕 𝑙 𝑊𝑡, 𝑍 𝜕𝑊𝑡 , η𝒈𝒊, θ) , 𝑖𝑓 𝑚𝑜𝑑 𝑡, 𝐾 = 0 𝑊𝑡 − η 𝜕 𝑙 𝑊𝑡, 𝑍 𝜕𝑊𝑡 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑇1(𝑣𝑖, α, θ) = 0 , 𝑖𝑓 |𝑣𝑖| ≤ α 𝑣𝑖 −α ∗ 𝑠𝑔𝑛 𝑣𝑖 , 𝑖𝑓α ≤ | 𝑣𝑖 | ≤ θ 𝑣𝑖 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  • 28.
    Productivity Optimization LabShao-Yen Hung 0. Agenda 1. Problem Definition 2. Background Knowledge  Convex Function, Subgradient  Lagrange Multiplier, KKT Conditions  Loss Function  Regularization  BGD(Batch Gradient Descent) vs SGD (Stochastic Gradient Descent) 3. SCR (Simple coefficient Rounding);TG (Truncated Gradient) 4. FOBOS (Forward-Backward Splitting / Forward Looking Subgradients) 5. RDA (Regularized Dual Averaging) 6. FTRL (Follow-the-Regularized-Leader) 28
  • 29.
    Productivity Optimization LabShao-Yen Hung Reference [1] John Langford, Lihong Li & Tong Zhang. Sparse Online Learning via Truncated Gradient. Journal of Machine Learning Research, 2009. [2] John Duchi & Yoram Singer. Efficient Online and Batch Learning using Forward Backward Splitting. Journal of Machine Learning Research, 2009. [3] Lin Xiao. Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization. Journal of Machine Learning Research, 2010. [4] H. B. McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and L1 regularization. In AISTATS, 2011. [5] H. Brendan McMahan,Gary Holt, D. Sculley et al. Ad Click Prediction: a View from the Trenches. In KDD , 2013. 29