ICML2012読み会:
Scaling Up Coordinate Descent Algorithms for
       Large L1 regularization Problems
               2012-07-28
             Yoshihiko Suhara
              @sleepy_yoshi
読む論文
• Scaling Up Coordinate Descent Algorithms for
  Large L1 regularization Problems
  – by C. Scherrer, M. Halappanavar, A. Tewari, D.
    Haglin


• Coordinate Descent の並列計算
  – [Bradley+ 11] Parallel Coordinate Descent for L1-
    Regularized Loss Minimization (ICML2011) とか

                                                        2
概要
• 共有メモリマルチコア環境におけるParallel
  Coordinate Descentの一般化フレームワークを紹
  介

• 以下の2つの手法を提案
 – Thread-Greedy Coordinate Descent
 – Coloring-Based Coordinate Descent

• Parallel CDの4手法を実験で比較
 – Thread-Greedy が思いのほかよかった

                                       3
L1正則化損失関数の最適化
• L1正則化損失関数
              𝑛
            1
        min     ℓ 𝒚 𝑖 , 𝑿𝒘   𝑖   + 𝜆 𝒘   1
         𝒘 𝑛
                𝑖=1
• ここで
  – 𝑿 ∈ ℝ 𝑛×𝑘 : 計画行列
  – 𝒘 ∈ ℝ 𝑘 : 重みベクトル
  – ℓ(𝑦,⋅): 微分可能な凸関数

• たとえば
  – Lasso (L1 + 二乗誤差)
  – L1正則化ロジスティック回帰


                                             4
記法
𝑿 = 𝑿1 , 𝑿2 , … , 𝑿 𝑗 , … 𝑿 𝑘
  𝒆 𝑗 = 0, 0, … , 1, … , 0 𝑇
                𝒙1𝑇
                𝒙2𝑇
        𝑿=       ⋮
                𝒙 𝑖𝑇
                 ⋮
                𝒙𝑇 𝑛


                                5
補足: Coordinate Descent
• 座標降下法とも呼ばれる (?)
• 選択された次元に対して直線探索
• いろんな次元の選び方
 – 例) Cyclic Coordinate Descent
• 並列計算する場合には全次元の部分集合を選択して更新




                                  6
GenCD: A Generic Framework for
  Parallel Coordinate Descent


                   なぜかここから英語



                                 7
Generic Coordinate Descent (GenCD)




                                     8
Step 1: Select
• Selecting 𝐽 coordinates
• The selection criteria differs for variations of CD
  techniques
   – cyclic CD (CCD)
   – stochastic CD (SCD)
      • selection of a singlton
   – fully greedy CD
      • 𝐽 = {1, … , 𝑘}
   – Shotgun [Bradley+ 11]
      • selects a random subset of a given size

                                                        9
Step 2: Propose
• Propose step computes a proposed increment 𝛿 𝑗 for
  each 𝑗 ∈ 𝐽.
   – this step does not actually change the weights

• In Step 2, we maintain a vector 𝝋 ∈ ℝ 𝑘 , where 𝝋 𝑗 is a
  proxy for the objective function evaluated at 𝒘 + 𝜹 𝑗 𝒆 𝑗
   – update 𝝋 𝑗 whenever a new proposal is calculated for j
   – 𝝋 is not necessary if the algorithm will accepts all
     proposals



                                                              10
Step 3: Accept
• In Accept step, the algorithm accepts 𝐽′ ⊆ 𝐽
   –  [Bradley+ 11] show correlations among features can
     lead to divergence if too many coordinates are updated at
     once (see below figure)

• In CCD, SCD, Shotgun, the algorithm allows all
  proposals to be accepted
   – No need to calculate 𝝋




                                                             11
Step 4: Update
• In Update step, the algorithm updates
  according to the set 𝐽′




 𝑿𝒘 を保持



                                          12
Approximate Minimization (1/2)
• Propose step calculates a proposed increment
   𝜹 𝑗 for each 𝑗 ∈ 𝐽
        𝛿 = argmin 𝛿 𝐹 𝒘 + 𝛿𝒆 𝑗 + 𝜆|𝒘 𝑗 + 𝛿|
                       1    𝑛
      where, 𝐹 𝒘 =         𝑖=1 ℓ   𝒚 𝑖 , 𝑿𝒘   𝑖
                       𝑛


•  For a general loss function, there is no
  closed-form solution along a given coordinate.
  – Thus, consider approximate minimization

                                                  13
Approximate Minimization (2/2)
• Well known minimizer (e.g., [Yuan and Lin 10])

                    𝛻𝑗 𝐹 𝒘 − 𝜆 𝛻𝑗 𝐹 𝒘 + 𝜆
       𝛿 = −𝜓   𝒘𝑗;           ,
                         𝛽          𝛽
                           𝑎         if 𝑥 < 𝑎
        where, 𝜓 𝑥; 𝑎, 𝑏 = 𝑏         if 𝑥 > 𝑏
                           𝑥        otherwise

                for squared loss 𝛽 = 1, logistic loss 𝛽 = 1/4.


                                                           14
Step 2: Propose (Approximated)
                                                     ′
                                                𝑖ℓ       𝒚 𝑖 ,𝒛 𝑖 ,𝑿 𝑗
                                                                         ?
                                                          𝑛




       Decrease in the approximated objective
                                                                             15
Experiments



              16
Algorithms (conventional)
• SHOTGUN [Bradley+ 11]
   – Select step: random subset of the columns
   – Accept step: accepts every proposal
       • No need to compute a proxy for the objective
   – convergence is guaranteed only if the # of coordinates selected
     is at most 𝑃 ∗ = 𝑘 (*1)
                       2𝜌


• GREEDY
   – Select step: all coordinates
   – Propose step: each thread generating proposals for some subset
     of the coordinates using approximation
   – Accept step: Only accepts the single best among the all threads.


                                     (*1) 𝜌 is the matrix eigenvalue of 𝑿 𝑇 𝑿   17
Comparisons of the Algorithms




                                18
Algorithms (proposed)
• THREAD-GREEDY
   – Select step: random set of coordinates (?)
   – Propose step: each thread generating proposals for some subset of the
     coordinates using approximation
   – Accept step: Each thread accepts the best of the proposals
   – no proof for convergence (however, empirical results are encouraging)

• COLORING
   – Preprocessing: structurally independent features are identified via
     partial distance-2 coloring
   – Select step: a random color is selected
   – Accept step: accepts every proposal
       • since the features are disjoint.




                                                                           19
Implementation and Platform
• Implementation
  – gcc with OpenMP
     • -O3 -fopenmp flags
     • parallel for pragma
     • static scheduling
         – Given n iterations and p threads, each thread gets n/p iterations


• Platform
  – AMD Opteron (Magny-Cours)
     • with 48 cores (12 cores x 4 sockets)
  – 256GB Memory

                                                                               20
Datasets




(Number of Non-Zero)
                        21
Convergence rates




ナゼカワカラナイ




                               22
Scalability




              23
Summary
• Presented GenCD, a generic framework for
  expressing parallel coordinate descent
  – Select, Propose, Accept, Upadte

• Performs convergence and scalability tests for the
  four algorithms
  – but the authors do not favor any of these algorithms
    over the others

• The condition for convergence of the THREAD-
  GREEDY algorithm is an open question
                                                           24
References
• [Yuan and Lin 10] G. Yuan, C. Lin, “A Comparison of Opitmization Methods
  and Software for Large-scale L1-regularized Linear Classification”, Journal
  of Machine Learning Research, vol.11, pp.3183-3234, 2010.
• [Bradley+ 11] J. K. Bradley, A. Kyrola, D. Bickson, C. Guestrin, “Parallel
  Coordinate Descent for L1-Regularized Loss Minimization”, In Proc. ICML
  ‘11, 2011.




                                                                            25
おわり




      26

ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems

  • 1.
    ICML2012読み会: Scaling Up CoordinateDescent Algorithms for Large L1 regularization Problems 2012-07-28 Yoshihiko Suhara @sleepy_yoshi
  • 2.
    読む論文 • Scaling UpCoordinate Descent Algorithms for Large L1 regularization Problems – by C. Scherrer, M. Halappanavar, A. Tewari, D. Haglin • Coordinate Descent の並列計算 – [Bradley+ 11] Parallel Coordinate Descent for L1- Regularized Loss Minimization (ICML2011) とか 2
  • 3.
    概要 • 共有メモリマルチコア環境におけるParallel Coordinate Descentの一般化フレームワークを紹 介 • 以下の2つの手法を提案 – Thread-Greedy Coordinate Descent – Coloring-Based Coordinate Descent • Parallel CDの4手法を実験で比較 – Thread-Greedy が思いのほかよかった 3
  • 4.
    L1正則化損失関数の最適化 • L1正則化損失関数 𝑛 1 min ℓ 𝒚 𝑖 , 𝑿𝒘 𝑖 + 𝜆 𝒘 1 𝒘 𝑛 𝑖=1 • ここで – 𝑿 ∈ ℝ 𝑛×𝑘 : 計画行列 – 𝒘 ∈ ℝ 𝑘 : 重みベクトル – ℓ(𝑦,⋅): 微分可能な凸関数 • たとえば – Lasso (L1 + 二乗誤差) – L1正則化ロジスティック回帰 4
  • 5.
    記法 𝑿 = 𝑿1, 𝑿2 , … , 𝑿 𝑗 , … 𝑿 𝑘 𝒆 𝑗 = 0, 0, … , 1, … , 0 𝑇 𝒙1𝑇 𝒙2𝑇 𝑿= ⋮ 𝒙 𝑖𝑇 ⋮ 𝒙𝑇 𝑛 5
  • 6.
    補足: Coordinate Descent •座標降下法とも呼ばれる (?) • 選択された次元に対して直線探索 • いろんな次元の選び方 – 例) Cyclic Coordinate Descent • 並列計算する場合には全次元の部分集合を選択して更新 6
  • 7.
    GenCD: A GenericFramework for Parallel Coordinate Descent なぜかここから英語 7
  • 8.
  • 9.
    Step 1: Select •Selecting 𝐽 coordinates • The selection criteria differs for variations of CD techniques – cyclic CD (CCD) – stochastic CD (SCD) • selection of a singlton – fully greedy CD • 𝐽 = {1, … , 𝑘} – Shotgun [Bradley+ 11] • selects a random subset of a given size 9
  • 10.
    Step 2: Propose •Propose step computes a proposed increment 𝛿 𝑗 for each 𝑗 ∈ 𝐽. – this step does not actually change the weights • In Step 2, we maintain a vector 𝝋 ∈ ℝ 𝑘 , where 𝝋 𝑗 is a proxy for the objective function evaluated at 𝒘 + 𝜹 𝑗 𝒆 𝑗 – update 𝝋 𝑗 whenever a new proposal is calculated for j – 𝝋 is not necessary if the algorithm will accepts all proposals 10
  • 11.
    Step 3: Accept •In Accept step, the algorithm accepts 𝐽′ ⊆ 𝐽 –  [Bradley+ 11] show correlations among features can lead to divergence if too many coordinates are updated at once (see below figure) • In CCD, SCD, Shotgun, the algorithm allows all proposals to be accepted – No need to calculate 𝝋 11
  • 12.
    Step 4: Update •In Update step, the algorithm updates according to the set 𝐽′ 𝑿𝒘 を保持 12
  • 13.
    Approximate Minimization (1/2) •Propose step calculates a proposed increment 𝜹 𝑗 for each 𝑗 ∈ 𝐽 𝛿 = argmin 𝛿 𝐹 𝒘 + 𝛿𝒆 𝑗 + 𝜆|𝒘 𝑗 + 𝛿| 1 𝑛 where, 𝐹 𝒘 = 𝑖=1 ℓ 𝒚 𝑖 , 𝑿𝒘 𝑖 𝑛 •  For a general loss function, there is no closed-form solution along a given coordinate. – Thus, consider approximate minimization 13
  • 14.
    Approximate Minimization (2/2) •Well known minimizer (e.g., [Yuan and Lin 10]) 𝛻𝑗 𝐹 𝒘 − 𝜆 𝛻𝑗 𝐹 𝒘 + 𝜆 𝛿 = −𝜓 𝒘𝑗; , 𝛽 𝛽 𝑎 if 𝑥 < 𝑎 where, 𝜓 𝑥; 𝑎, 𝑏 = 𝑏 if 𝑥 > 𝑏 𝑥 otherwise for squared loss 𝛽 = 1, logistic loss 𝛽 = 1/4. 14
  • 15.
    Step 2: Propose(Approximated) ′ 𝑖ℓ 𝒚 𝑖 ,𝒛 𝑖 ,𝑿 𝑗 ? 𝑛 Decrease in the approximated objective 15
  • 16.
  • 17.
    Algorithms (conventional) • SHOTGUN[Bradley+ 11] – Select step: random subset of the columns – Accept step: accepts every proposal • No need to compute a proxy for the objective – convergence is guaranteed only if the # of coordinates selected is at most 𝑃 ∗ = 𝑘 (*1) 2𝜌 • GREEDY – Select step: all coordinates – Propose step: each thread generating proposals for some subset of the coordinates using approximation – Accept step: Only accepts the single best among the all threads. (*1) 𝜌 is the matrix eigenvalue of 𝑿 𝑇 𝑿 17
  • 18.
    Comparisons of theAlgorithms 18
  • 19.
    Algorithms (proposed) • THREAD-GREEDY – Select step: random set of coordinates (?) – Propose step: each thread generating proposals for some subset of the coordinates using approximation – Accept step: Each thread accepts the best of the proposals – no proof for convergence (however, empirical results are encouraging) • COLORING – Preprocessing: structurally independent features are identified via partial distance-2 coloring – Select step: a random color is selected – Accept step: accepts every proposal • since the features are disjoint. 19
  • 20.
    Implementation and Platform •Implementation – gcc with OpenMP • -O3 -fopenmp flags • parallel for pragma • static scheduling – Given n iterations and p threads, each thread gets n/p iterations • Platform – AMD Opteron (Magny-Cours) • with 48 cores (12 cores x 4 sockets) – 256GB Memory 20
  • 21.
  • 22.
  • 23.
  • 24.
    Summary • Presented GenCD,a generic framework for expressing parallel coordinate descent – Select, Propose, Accept, Upadte • Performs convergence and scalability tests for the four algorithms – but the authors do not favor any of these algorithms over the others • The condition for convergence of the THREAD- GREEDY algorithm is an open question 24
  • 25.
    References • [Yuan andLin 10] G. Yuan, C. Lin, “A Comparison of Opitmization Methods and Software for Large-scale L1-regularized Linear Classification”, Journal of Machine Learning Research, vol.11, pp.3183-3234, 2010. • [Bradley+ 11] J. K. Bradley, A. Kyrola, D. Bickson, C. Guestrin, “Parallel Coordinate Descent for L1-Regularized Loss Minimization”, In Proc. ICML ‘11, 2011. 25
  • 26.