1
XGBoost
XGBoost is the machine learning method based on
“boosting algorithm”
This method uses the many decision tree
We don’t explain “decision tree”, please refer
following URL:
https://en.wikipedia.org/wiki/Decision_tree
2
What is ”Boosting”?
3
Boosting algorithm(1)
For a given dataset with n examples and m
features, the explanatory variables 𝒙𝑖 are defined :
Define the 𝑖 − 𝑡ℎ objective variable 𝑦𝑖; i = 1,2, ⋯ , 𝑛
4
𝒙𝑖 = 𝑥𝑖1, 𝑥𝑖2, ⋯ , 𝑥𝑖𝑚
Boosting algorithm(2)
Define the output of 𝑡 − 𝑡ℎ decision tree: 𝑦𝑖
(𝑡)
The error 𝜖𝑖
(1)
between first decision tree’s output and
objective variable 𝑦𝑖 is following:
5
𝜖𝑖
(1)
= 𝑦𝑖
(1)
− 𝑦𝑖
Boosting algorithm(3)
The second decision tree predict 𝜖𝑖
(1)
Define the second decision tree’s output 𝜖𝑖
(2)
The predicted value 𝑦𝑖
(𝟐)
is following:
6
𝑦𝑖
2
= 𝑦𝑖
1
+ 𝜖𝑖
2
Boosting algorithm(4)
The error 𝜖𝑖
(2)
between predicted value using two
decision tree and objective value is following:
7
𝜖𝑖
(2)
= 𝑦𝑖 − 𝑦𝑖
2
= 𝑦𝑖 − 𝑦𝑖
1
+ 𝜖𝑖
(2)
Boosting algorithm(5)
Define the third decision tree’s predicted value 𝜖𝑖
(2)
The predicted value 𝑦𝑖
(3)
using three decision trees is:
8
𝑦𝑖
3
= 𝑦𝑖
2
+ 𝜖𝑖
3
= 𝑦𝑖
1
+ 𝜖𝑖
2
+ 𝜖𝑖
3
Boosting algorithm(6)
Construct a new model using the information of the
model learned so far
“Boosting”
9
* It is not Boosting algorithm to make error as objective variable
What is “XGBoost”?
10
XGBoost
XGBoost has been shown to give state-of-the-art
results on many standard classification benchmarks
More than half of the methods won by the Kaggle
competition use XGBoost
11
XGBoost algorithm(1)
Define the 𝑘 − 𝑡ℎ decision tree: 𝑓𝑘
The predicted value when boosting K times is as
follows:
12
y𝑖 =
𝑘=1
𝐾
𝑓𝑘 𝑥𝑖
XGBoost algorithm(2)
Define the loss function:
Our purpose is minimize the following objective
13
𝑙 𝑦𝑖, 𝑦𝑖
ℒ 𝜙 =
𝑖=1
𝐼
𝑙 𝑦𝑖, 𝑦𝑖
XGBoost algorithm(3)
Deformation of formula
14
min
𝑓𝑡
ℒ 𝑓𝑡 = min
𝑓𝑡 𝑖=1
𝐼
𝑙 𝑦𝑖, 𝑦𝑖
(𝑡)
= min
𝑓𝑡 𝑖=1
𝐼
𝑙 𝑦𝑖,
𝑘=1
𝑡
𝑓𝑘 𝑥𝑖
= min
𝑓𝑡 𝑖=1
𝐼
𝑙 𝑦𝑖, 𝑦𝑖
(𝑡−1)
+ 𝑓𝑡 𝑥𝑖
XGBoost algorithm(4)
Define the “penalizes function”:
𝛾 and 𝜆 is the hyper parameters
𝑇 is number of tree node
𝑤 is the vector of the nodes
15
𝛺 𝑓 = 𝛾𝑇 +
1
2
𝜆 𝒘 2
XGBoost algorithm(5)
If we add the “penalizes function” to loss, it helps to
smooth the final learnt weight to avoid over-fitting
So, Our new purpose is minimize the
following objective function ℒ 𝜙 :
16
ℒ 𝜙 =
𝑖=1
𝐼
𝑙 𝑦𝑖, 𝑦𝑖 +
𝑘=1
𝐾
𝛺 𝑓𝑘
XGBoost algorithm(6)
Minimizing ℒ 𝜙 is same to minimizing all ℒ(𝑡)
:
17
min
𝑓𝑡
ℒ(𝑡) = min
𝑓𝑡 𝑖=1
𝐼
𝑙 𝑦𝑖, 𝑦𝑖
(𝑡−1)
+ 𝑓𝑡 𝑥𝑖 + Ω 𝑓𝑡
XGBoost algorithm(7)
Second-order approximation can be used
to quickly optimize the objective :
18
ℒ(𝑡) =
𝑖=1
𝐼
𝑙 𝑦𝑖, 𝑦𝑖
(𝑡−1)
+ 𝑓𝑡 𝑥𝑖 + Ω 𝑓𝑡
Taylor expansion
ℒ(𝑡) ≅
𝑖=1
𝐼
𝑙 𝑦𝑖, 𝑦𝑖
(𝑡−1)
+ 𝑔𝑖 𝑓𝑡 𝑥𝑖 +
1
2
ℎ𝑖 𝑓𝑡
2
𝑥𝑖 + Ω 𝑓𝑡
𝑔𝑖 = 𝜕 𝑦 𝑡−1 𝑙 𝑦𝑖, 𝑦 𝑡−1 ℎ𝑖 = 𝜕 𝑦 𝑡−1
2
𝑙 𝑦𝑖, 𝑦 𝑡−1
XGBoost algorithm(8)
We can remove the constant terms to obtain
the following simplified objective at step 𝑡:
19
ℒ(𝑡)
=
𝑖=1
𝐼
𝑔𝑖 𝑓𝑡 𝑥𝑖 +
1
2
ℎ𝑖 𝑓𝑡
2
𝑥𝑖 + Ω 𝑓𝑡
ℒ(𝑡)
=
𝑖=1
𝐼
𝑙 𝑦𝑖, 𝑦𝑖
(𝑡−1)
+ 𝑔𝑖 𝑓𝑡 𝑥𝑖 +
1
2
ℎ𝑖 𝑓𝑡
2
𝑥𝑖 + Ω 𝑓𝑡
XGBoost algorithm(9)
Define 𝐼𝑗 as the instance set of leaf j
20
Leaf 1
Leaf 2
Leaf 3Leaf 4
𝐼4
XGBoost algorithm(11)
Deformation of formula
21
min
𝑓𝑡
ℒ(𝑡)
= min
𝑓𝑡
𝑖=1
𝐼
𝑔𝑖 𝑓𝑡 𝑥𝑖 +
1
2
ℎ𝑖 𝑓𝑡
2
𝑥𝑖 + Ω 𝑓𝑡
= min
𝑓𝑡
𝑖=1
𝐼
𝑔𝑖 𝑓𝑡 𝑥𝑖 +
1
2
ℎ𝑖 𝑓𝑡
2
𝑥𝑖 + 𝛾𝑇 +
1
2
𝜆 𝑤 2
= min
𝑓𝑡
𝑗=1
𝑇
𝑖∈𝐼 𝑗
𝑔𝑖 𝑤𝑗 +
1
2
𝑖∈𝐼 𝑗
ℎ𝑖 + 𝜆 𝑤𝑗
2
+ 𝛾𝑇
Quadratic function of w
XGBoost algorithm(12)
We can solve the quadratic function ℒ(𝑡)
on 𝑤𝑗
22
𝑤ℎ𝑒𝑟𝑒
𝑑ℒ(𝑡)
𝑑𝑤𝑗
= 0
𝑤𝑗 = −
𝑖∈𝐼 𝑗
𝑔𝑖
𝑖∈𝐼 𝑗
ℎ𝑖 + 𝜆
XGBoost algorithm(13)
Remember 𝑔𝑖 and ℎ𝑖 is the inclination of loss function
and they can calculate with the output of (𝑡 − 1)𝑡ℎ
tree and 𝑦𝑖
So, we can calculate 𝑤𝑗 and minimizing ℒ 𝜙
23
𝑔𝑖 = 𝜕 𝑦 𝑡−1 𝑙 𝑦𝑖, 𝑦 𝑡−1
ℎ𝑖 = 𝜕 𝑦 𝑡−1
2
𝑙 𝑦𝑖, 𝑦 𝑡−1
How to split the node
24
How to split the node
I told you how to minimize the loss.
One of the key problems in tree learning is to find the
best split.
25
XGBoost algorithm(14)
Substitute 𝑤𝑗 for ℒ(𝑡)
:
In this equation, if
𝑖∈𝐼 𝑗
𝑔 𝑖
2
𝑖∈𝐼 𝑗
ℎ 𝑖+𝜆
in each node become bigger,
ℒ(𝑡)
become smaller.
26
ℒ(𝑡) = −
1
2
𝑗=1
𝑇
𝑖∈𝐼 𝑗
𝑔𝑖
2
𝑖∈𝐼 𝑗
ℎ𝑖 + 𝜆
+ 𝛾𝑇
XGBoost algorithm(15)
Compare the
𝑖∈𝐼 𝑗
𝑔 𝑖
2
𝑖∈𝐼 𝑗
ℎ 𝑖+𝜆
in before split node and after split
node
Define the objective function before
split node : ℒ 𝑏𝑒𝑓𝑜𝑟𝑒
(𝑡)
Define the objective function after
split node : ℒ 𝑎𝑓𝑡𝑒𝑟
(𝑡)
27
XGBoost algorithm(16)
ℒ 𝑏𝑒𝑓𝑜𝑟𝑒
(𝑡)
= −
1
2 𝑗≠𝑠
𝑇
𝑖∈𝐼 𝑗
𝑔 𝑖
2
𝑖∈𝐼 𝑗
ℎ 𝑖+𝜆
−
1
2
𝑖∈𝐼 𝑠
𝑔 𝑖
2
𝑖∈𝐼 𝑠
ℎ 𝑖+𝜆
+ 𝛾𝑇
ℒ 𝑎𝑓𝑡𝑒𝑟
(𝑡)
= −
1
2 𝑗≠𝑠
𝑇
𝑖∈𝐼 𝑗
𝑔 𝑖
2
𝑖∈𝐼 𝑗
ℎ 𝑖+𝜆
−
1
2
𝑖∈𝐼 𝐿
𝑔 𝑖
2
𝑖∈𝐼 𝐿
ℎ 𝑖+𝜆
−
1
2
𝑖∈𝐼 𝑅
𝑔 𝑖
2
𝑖∈𝐼 𝑅
ℎ 𝑖+𝜆
+ 𝛾𝑇
28
𝐼𝑠
𝐼𝐿 𝐼 𝑅
before
after
XGBoost algorithm(17)
After maximizing ℒ 𝑏𝑒𝑓𝑜𝑟𝑒
(𝑡)
− ℒ 𝑎𝑓𝑡𝑒𝑟
(𝑡)
we can get the
minimizing ℒ(𝑡)
29

Introduction of Xgboost

  • 1.
  • 2.
    XGBoost XGBoost is themachine learning method based on “boosting algorithm” This method uses the many decision tree We don’t explain “decision tree”, please refer following URL: https://en.wikipedia.org/wiki/Decision_tree 2
  • 3.
  • 4.
    Boosting algorithm(1) For agiven dataset with n examples and m features, the explanatory variables 𝒙𝑖 are defined : Define the 𝑖 − 𝑡ℎ objective variable 𝑦𝑖; i = 1,2, ⋯ , 𝑛 4 𝒙𝑖 = 𝑥𝑖1, 𝑥𝑖2, ⋯ , 𝑥𝑖𝑚
  • 5.
    Boosting algorithm(2) Define theoutput of 𝑡 − 𝑡ℎ decision tree: 𝑦𝑖 (𝑡) The error 𝜖𝑖 (1) between first decision tree’s output and objective variable 𝑦𝑖 is following: 5 𝜖𝑖 (1) = 𝑦𝑖 (1) − 𝑦𝑖
  • 6.
    Boosting algorithm(3) The seconddecision tree predict 𝜖𝑖 (1) Define the second decision tree’s output 𝜖𝑖 (2) The predicted value 𝑦𝑖 (𝟐) is following: 6 𝑦𝑖 2 = 𝑦𝑖 1 + 𝜖𝑖 2
  • 7.
    Boosting algorithm(4) The error𝜖𝑖 (2) between predicted value using two decision tree and objective value is following: 7 𝜖𝑖 (2) = 𝑦𝑖 − 𝑦𝑖 2 = 𝑦𝑖 − 𝑦𝑖 1 + 𝜖𝑖 (2)
  • 8.
    Boosting algorithm(5) Define thethird decision tree’s predicted value 𝜖𝑖 (2) The predicted value 𝑦𝑖 (3) using three decision trees is: 8 𝑦𝑖 3 = 𝑦𝑖 2 + 𝜖𝑖 3 = 𝑦𝑖 1 + 𝜖𝑖 2 + 𝜖𝑖 3
  • 9.
    Boosting algorithm(6) Construct anew model using the information of the model learned so far “Boosting” 9 * It is not Boosting algorithm to make error as objective variable
  • 10.
  • 11.
    XGBoost XGBoost has beenshown to give state-of-the-art results on many standard classification benchmarks More than half of the methods won by the Kaggle competition use XGBoost 11
  • 12.
    XGBoost algorithm(1) Define the𝑘 − 𝑡ℎ decision tree: 𝑓𝑘 The predicted value when boosting K times is as follows: 12 y𝑖 = 𝑘=1 𝐾 𝑓𝑘 𝑥𝑖
  • 13.
    XGBoost algorithm(2) Define theloss function: Our purpose is minimize the following objective 13 𝑙 𝑦𝑖, 𝑦𝑖 ℒ 𝜙 = 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖
  • 14.
    XGBoost algorithm(3) Deformation offormula 14 min 𝑓𝑡 ℒ 𝑓𝑡 = min 𝑓𝑡 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 (𝑡) = min 𝑓𝑡 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑘=1 𝑡 𝑓𝑘 𝑥𝑖 = min 𝑓𝑡 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 (𝑡−1) + 𝑓𝑡 𝑥𝑖
  • 15.
    XGBoost algorithm(4) Define the“penalizes function”: 𝛾 and 𝜆 is the hyper parameters 𝑇 is number of tree node 𝑤 is the vector of the nodes 15 𝛺 𝑓 = 𝛾𝑇 + 1 2 𝜆 𝒘 2
  • 16.
    XGBoost algorithm(5) If weadd the “penalizes function” to loss, it helps to smooth the final learnt weight to avoid over-fitting So, Our new purpose is minimize the following objective function ℒ 𝜙 : 16 ℒ 𝜙 = 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 + 𝑘=1 𝐾 𝛺 𝑓𝑘
  • 17.
    XGBoost algorithm(6) Minimizing ℒ𝜙 is same to minimizing all ℒ(𝑡) : 17 min 𝑓𝑡 ℒ(𝑡) = min 𝑓𝑡 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 (𝑡−1) + 𝑓𝑡 𝑥𝑖 + Ω 𝑓𝑡
  • 18.
    XGBoost algorithm(7) Second-order approximationcan be used to quickly optimize the objective : 18 ℒ(𝑡) = 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 (𝑡−1) + 𝑓𝑡 𝑥𝑖 + Ω 𝑓𝑡 Taylor expansion ℒ(𝑡) ≅ 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 (𝑡−1) + 𝑔𝑖 𝑓𝑡 𝑥𝑖 + 1 2 ℎ𝑖 𝑓𝑡 2 𝑥𝑖 + Ω 𝑓𝑡 𝑔𝑖 = 𝜕 𝑦 𝑡−1 𝑙 𝑦𝑖, 𝑦 𝑡−1 ℎ𝑖 = 𝜕 𝑦 𝑡−1 2 𝑙 𝑦𝑖, 𝑦 𝑡−1
  • 19.
    XGBoost algorithm(8) We canremove the constant terms to obtain the following simplified objective at step 𝑡: 19 ℒ(𝑡) = 𝑖=1 𝐼 𝑔𝑖 𝑓𝑡 𝑥𝑖 + 1 2 ℎ𝑖 𝑓𝑡 2 𝑥𝑖 + Ω 𝑓𝑡 ℒ(𝑡) = 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 (𝑡−1) + 𝑔𝑖 𝑓𝑡 𝑥𝑖 + 1 2 ℎ𝑖 𝑓𝑡 2 𝑥𝑖 + Ω 𝑓𝑡
  • 20.
    XGBoost algorithm(9) Define 𝐼𝑗as the instance set of leaf j 20 Leaf 1 Leaf 2 Leaf 3Leaf 4 𝐼4
  • 21.
    XGBoost algorithm(11) Deformation offormula 21 min 𝑓𝑡 ℒ(𝑡) = min 𝑓𝑡 𝑖=1 𝐼 𝑔𝑖 𝑓𝑡 𝑥𝑖 + 1 2 ℎ𝑖 𝑓𝑡 2 𝑥𝑖 + Ω 𝑓𝑡 = min 𝑓𝑡 𝑖=1 𝐼 𝑔𝑖 𝑓𝑡 𝑥𝑖 + 1 2 ℎ𝑖 𝑓𝑡 2 𝑥𝑖 + 𝛾𝑇 + 1 2 𝜆 𝑤 2 = min 𝑓𝑡 𝑗=1 𝑇 𝑖∈𝐼 𝑗 𝑔𝑖 𝑤𝑗 + 1 2 𝑖∈𝐼 𝑗 ℎ𝑖 + 𝜆 𝑤𝑗 2 + 𝛾𝑇 Quadratic function of w
  • 22.
    XGBoost algorithm(12) We cansolve the quadratic function ℒ(𝑡) on 𝑤𝑗 22 𝑤ℎ𝑒𝑟𝑒 𝑑ℒ(𝑡) 𝑑𝑤𝑗 = 0 𝑤𝑗 = − 𝑖∈𝐼 𝑗 𝑔𝑖 𝑖∈𝐼 𝑗 ℎ𝑖 + 𝜆
  • 23.
    XGBoost algorithm(13) Remember 𝑔𝑖and ℎ𝑖 is the inclination of loss function and they can calculate with the output of (𝑡 − 1)𝑡ℎ tree and 𝑦𝑖 So, we can calculate 𝑤𝑗 and minimizing ℒ 𝜙 23 𝑔𝑖 = 𝜕 𝑦 𝑡−1 𝑙 𝑦𝑖, 𝑦 𝑡−1 ℎ𝑖 = 𝜕 𝑦 𝑡−1 2 𝑙 𝑦𝑖, 𝑦 𝑡−1
  • 24.
    How to splitthe node 24
  • 25.
    How to splitthe node I told you how to minimize the loss. One of the key problems in tree learning is to find the best split. 25
  • 26.
    XGBoost algorithm(14) Substitute 𝑤𝑗for ℒ(𝑡) : In this equation, if 𝑖∈𝐼 𝑗 𝑔 𝑖 2 𝑖∈𝐼 𝑗 ℎ 𝑖+𝜆 in each node become bigger, ℒ(𝑡) become smaller. 26 ℒ(𝑡) = − 1 2 𝑗=1 𝑇 𝑖∈𝐼 𝑗 𝑔𝑖 2 𝑖∈𝐼 𝑗 ℎ𝑖 + 𝜆 + 𝛾𝑇
  • 27.
    XGBoost algorithm(15) Compare the 𝑖∈𝐼𝑗 𝑔 𝑖 2 𝑖∈𝐼 𝑗 ℎ 𝑖+𝜆 in before split node and after split node Define the objective function before split node : ℒ 𝑏𝑒𝑓𝑜𝑟𝑒 (𝑡) Define the objective function after split node : ℒ 𝑎𝑓𝑡𝑒𝑟 (𝑡) 27
  • 28.
    XGBoost algorithm(16) ℒ 𝑏𝑒𝑓𝑜𝑟𝑒 (𝑡) =− 1 2 𝑗≠𝑠 𝑇 𝑖∈𝐼 𝑗 𝑔 𝑖 2 𝑖∈𝐼 𝑗 ℎ 𝑖+𝜆 − 1 2 𝑖∈𝐼 𝑠 𝑔 𝑖 2 𝑖∈𝐼 𝑠 ℎ 𝑖+𝜆 + 𝛾𝑇 ℒ 𝑎𝑓𝑡𝑒𝑟 (𝑡) = − 1 2 𝑗≠𝑠 𝑇 𝑖∈𝐼 𝑗 𝑔 𝑖 2 𝑖∈𝐼 𝑗 ℎ 𝑖+𝜆 − 1 2 𝑖∈𝐼 𝐿 𝑔 𝑖 2 𝑖∈𝐼 𝐿 ℎ 𝑖+𝜆 − 1 2 𝑖∈𝐼 𝑅 𝑔 𝑖 2 𝑖∈𝐼 𝑅 ℎ 𝑖+𝜆 + 𝛾𝑇 28 𝐼𝑠 𝐼𝐿 𝐼 𝑅 before after
  • 29.
    XGBoost algorithm(17) After maximizingℒ 𝑏𝑒𝑓𝑜𝑟𝑒 (𝑡) − ℒ 𝑎𝑓𝑡𝑒𝑟 (𝑡) we can get the minimizing ℒ(𝑡) 29

Editor's Notes

  • #3 Gini coefficient
  • #15 最後の式をホワイトボードに書く
  • #16 Ωの定義をホワイトボードに書く
  • #18 最後の式をホワイトボードに書く
  • #19 証明する
  • #20 最後の式をホワイトボードに書く
  • #21 Star mark is the data xi. In this example, I4 is these instance set
  • #23 微分して 最後のはホワイトボードに書く
  • #24 思い出してください
  • #27 Wを代入する計算