ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems

ICML2012読み会:
Scaling Up Coordinate Descent Algorithms for
Large L1 regularization Problems
2012-07-28
Yoshihiko Suhara
@sleepy_yoshi

読む論文
• Scaling Up Coordinate Descent Algorithms for
Large L1 regularization Problems
– by C. Scherrer, M. Halappanavar, A. Tewari, D.
Haglin

• Coordinate Descent の並列計算
– [Bradley+ 11] Parallel Coordinate Descent for L1-
Regularized Loss Minimization (ICML2011) とか

2

概要
• 共有メモリマルチコア環境におけるParallel
Coordinate Descentの一般化フレームワークを紹
介

• 以下の2つの手法を提案
– Thread-Greedy Coordinate Descent
– Coloring-Based Coordinate Descent

• Parallel CDの4手法を実験で比較
– Thread-Greedy が思いのほかよかった

3

L1正則化損失関数の最適化
• L1正則化損失関数
𝑛
1
min ℓ 𝒚 𝑖 , 𝑿𝒘 𝑖 + 𝜆 𝒘 1
𝒘 𝑛
𝑖=1
• ここで
– 𝑿 ∈ ℝ 𝑛×𝑘 : 計画行列
– 𝒘 ∈ ℝ 𝑘 : 重みベクトル
– ℓ(𝑦,⋅): 微分可能な凸関数

• たとえば
– Lasso (L1 + 二乗誤差)
– L1正則化ロジスティック回帰

4

記法
𝑿 = 𝑿1 , 𝑿2 , … , 𝑿 𝑗 , … 𝑿 𝑘
𝒆 𝑗 = 0, 0, … , 1, … , 0 𝑇
𝒙1𝑇
𝒙2𝑇
𝑿= ⋮
𝒙 𝑖𝑇
⋮
𝒙𝑇 𝑛

5

補足: Coordinate Descent
• 座標降下法とも呼ばれる (?)
• 選択された次元に対して直線探索
• いろんな次元の選び方
– 例) Cyclic Coordinate Descent
• 並列計算する場合には全次元の部分集合を選択して更新

6

GenCD: A Generic Framework for
Parallel Coordinate Descent

なぜかここから英語

7

Generic Coordinate Descent (GenCD)

8

Step 1: Select
• Selecting 𝐽 coordinates
• The selection criteria differs for variations of CD
techniques
– cyclic CD (CCD)
– stochastic CD (SCD)
• selection of a singlton
– fully greedy CD
• 𝐽 = {1, … , 𝑘}
– Shotgun [Bradley+ 11]
• selects a random subset of a given size

9

Step 2: Propose
• Propose step computes a proposed increment 𝛿 𝑗 for
each 𝑗 ∈ 𝐽.
– this step does not actually change the weights

• In Step 2, we maintain a vector 𝝋 ∈ ℝ 𝑘 , where 𝝋 𝑗 is a
proxy for the objective function evaluated at 𝒘 + 𝜹 𝑗 𝒆 𝑗
– update 𝝋 𝑗 whenever a new proposal is calculated for j
– 𝝋 is not necessary if the algorithm will accepts all
proposals

10

Step 3: Accept
• In Accept step, the algorithm accepts 𝐽′ ⊆ 𝐽
–  [Bradley+ 11] show correlations among features can
lead to divergence if too many coordinates are updated at
once (see below figure)

• In CCD, SCD, Shotgun, the algorithm allows all
proposals to be accepted
– No need to calculate 𝝋

11

Step 4: Update
• In Update step, the algorithm updates
according to the set 𝐽′

𝑿𝒘 を保持

12

Approximate Minimization (1/2)
• Propose step calculates a proposed increment
𝜹 𝑗 for each 𝑗 ∈ 𝐽
𝛿 = argmin 𝛿 𝐹 𝒘 + 𝛿𝒆 𝑗 + 𝜆|𝒘 𝑗 + 𝛿|
1 𝑛
where, 𝐹 𝒘 = 𝑖=1 ℓ 𝒚 𝑖 , 𝑿𝒘 𝑖
𝑛

•  For a general loss function, there is no
closed-form solution along a given coordinate.
– Thus, consider approximate minimization

13

Approximate Minimization (2/2)
• Well known minimizer (e.g., [Yuan and Lin 10])

𝛻𝑗 𝐹 𝒘 − 𝜆 𝛻𝑗 𝐹 𝒘 + 𝜆
𝛿 = −𝜓 𝒘𝑗; ,
𝛽 𝛽
𝑎 if 𝑥 < 𝑎
where, 𝜓 𝑥; 𝑎, 𝑏 = 𝑏 if 𝑥 > 𝑏
𝑥 otherwise

for squared loss 𝛽 = 1, logistic loss 𝛽 = 1/4.

14

Step 2: Propose (Approximated)
′
𝑖ℓ 𝒚 𝑖 ,𝒛 𝑖 ,𝑿 𝑗
?
𝑛

Decrease in the approximated objective
15

Algorithms (conventional)
• SHOTGUN [Bradley+ 11]
– Select step: random subset of the columns
– Accept step: accepts every proposal
• No need to compute a proxy for the objective
– convergence is guaranteed only if the # of coordinates selected
is at most 𝑃 ∗ = 𝑘 (*1)
2𝜌

• GREEDY
– Select step: all coordinates
– Propose step: each thread generating proposals for some subset
of the coordinates using approximation
– Accept step: Only accepts the single best among the all threads.

(*1) 𝜌 is the matrix eigenvalue of 𝑿 𝑇 𝑿 17

Comparisons of the Algorithms

18

Algorithms (proposed)
• THREAD-GREEDY
– Select step: random set of coordinates (?)
– Propose step: each thread generating proposals for some subset of the
coordinates using approximation
– Accept step: Each thread accepts the best of the proposals
– no proof for convergence (however, empirical results are encouraging)

• COLORING
– Preprocessing: structurally independent features are identified via
partial distance-2 coloring
– Select step: a random color is selected
– Accept step: accepts every proposal
• since the features are disjoint.

19

Implementation and Platform
• Implementation
– gcc with OpenMP
• -O3 -fopenmp flags
• parallel for pragma
• static scheduling
– Given n iterations and p threads, each thread gets n/p iterations

• Platform
– AMD Opteron (Magny-Cours)
• with 48 cores (12 cores x 4 sockets)
– 256GB Memory

20

Datasets

(Number of Non-Zero)
21

Convergence rates

ナゼカワカラナイ

22

Summary
• Presented GenCD, a generic framework for
expressing parallel coordinate descent
– Select, Propose, Accept, Upadte

• Performs convergence and scalability tests for the
four algorithms
– but the authors do not favor any of these algorithms
over the others

• The condition for convergence of the THREAD-
GREEDY algorithm is an open question
24

References
• [Yuan and Lin 10] G. Yuan, C. Lin, “A Comparison of Opitmization Methods
and Software for Large-scale L1-regularized Linear Classification”, Journal
of Machine Learning Research, vol.11, pp.3183-3234, 2010.
• [Bradley+ 11] J. K. Bradley, A. Kyrola, D. Bickson, C. Guestrin, “Parallel
Coordinate Descent for L1-Regularized Loss Minimization”, In Proc. ICML
‘11, 2011.

25

ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems

More Related Content

What's hot

Viewers also liked

Similar to ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems

More from sleepy_yoshi

Recently uploaded

ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems