Introduction of Xgboost

XGBoost
XGBoost is the machine learning method based on
“boosting algorithm”
This method uses the many decision tree
We don’t explain “decision tree”, please refer
following URL:
https://en.wikipedia.org/wiki/Decision_tree
2

Boosting algorithm(1)
For a given dataset with n examples and m
features, the explanatory variables 𝒙𝑖 are defined :
Define the 𝑖 − 𝑡ℎ objective variable 𝑦𝑖; i = 1,2, ⋯ , 𝑛
4
𝒙𝑖 = 𝑥𝑖1, 𝑥𝑖2, ⋯ , 𝑥𝑖𝑚

Define the output of 𝑡 − 𝑡ℎ decision tree: 𝑦𝑖
(𝑡)
The error 𝜖𝑖
(1)
between first decision tree’s output and
objective variable 𝑦𝑖 is following:
5
𝜖𝑖
(1)
= 𝑦𝑖
(1)
− 𝑦𝑖

The second decision tree predict 𝜖𝑖
(1)
Define the second decision tree’s output 𝜖𝑖
(2)
The predicted value 𝑦𝑖
(𝟐)
is following:
6
𝑦𝑖
2
= 𝑦𝑖
1
+ 𝜖𝑖
2

The error 𝜖𝑖
(2)
between predicted value using two
decision tree and objective value is following:
7
𝜖𝑖
(2)
= 𝑦𝑖 − 𝑦𝑖
2
= 𝑦𝑖 − 𝑦𝑖
1
+ 𝜖𝑖
(2)

Define the third decision tree’s predicted value 𝜖𝑖
(2)
The predicted value 𝑦𝑖
(3)
using three decision trees is:
8
𝑦𝑖
3
= 𝑦𝑖
2
+ 𝜖𝑖
3
= 𝑦𝑖
1
+ 𝜖𝑖
2
+ 𝜖𝑖
3

Construct a new model using the information of the
model learned so far
“Boosting”
9
* It is not Boosting algorithm to make error as objective variable

XGBoost
XGBoost has been shown to give state-of-the-art
results on many standard classification benchmarks
More than half of the methods won by the Kaggle
competition use XGBoost
11

XGBoost algorithm(1)
Define the 𝑘 − 𝑡ℎ decision tree: 𝑓𝑘
The predicted value when boosting K times is as
follows:
12
y𝑖 =
𝑘=1
𝐾
𝑓𝑘 𝑥𝑖

Define the loss function:
Our purpose is minimize the following objective
13
𝑙 𝑦𝑖, 𝑦𝑖
ℒ 𝜙 =
𝑖=1
𝐼

Deformation of formula
14
min
𝑓𝑡
ℒ 𝑓𝑡 = min
𝑓𝑡 𝑖=1
𝐼
(𝑡)
= min
𝑓𝑡 𝑖=1
𝐼
𝑙 𝑦𝑖,
𝑘=1
𝑡
𝑓𝑘 𝑥𝑖
= min
𝑓𝑡 𝑖=1
𝐼
(𝑡−1)
+ 𝑓𝑡 𝑥𝑖

Define the “penalizes function”:
𝛾 and 𝜆 is the hyper parameters
𝑇 is number of tree node
𝑤 is the vector of the nodes
15
𝛺 𝑓 = 𝛾𝑇 +
1
2
𝜆 𝒘 2

If we add the “penalizes function” to loss, it helps to
smooth the final learnt weight to avoid over-fitting
So, Our new purpose is minimize the
following objective function ℒ 𝜙 :
16
ℒ 𝜙 =
𝑖=1
𝐼
𝑙 𝑦𝑖, 𝑦𝑖 +
𝑘=1
𝐾
𝛺 𝑓𝑘

Minimizing ℒ 𝜙 is same to minimizing all ℒ(𝑡)
:
17
min
𝑓𝑡
ℒ(𝑡) = min
𝑓𝑡 𝑖=1
𝐼
(𝑡−1)
+ 𝑓𝑡 𝑥𝑖 + Ω 𝑓𝑡

Second-order approximation can be used
to quickly optimize the objective :
18
ℒ(𝑡) =
𝑖=1
𝐼
(𝑡−1)
+ 𝑓𝑡 𝑥𝑖 + Ω 𝑓𝑡
Taylor expansion
ℒ(𝑡) ≅
𝑖=1
𝐼
(𝑡−1)
+ 𝑔𝑖 𝑓𝑡 𝑥𝑖 +
1
2
ℎ𝑖 𝑓𝑡
2
𝑥𝑖 + Ω 𝑓𝑡
𝑔𝑖 = 𝜕 𝑦 𝑡−1 𝑙 𝑦𝑖, 𝑦 𝑡−1 ℎ𝑖 = 𝜕 𝑦 𝑡−1
2
𝑙 𝑦𝑖, 𝑦 𝑡−1

We can remove the constant terms to obtain
the following simplified objective at step 𝑡:
19
ℒ(𝑡)
=
𝑖=1
𝐼
𝑔𝑖 𝑓𝑡 𝑥𝑖 +
1
2
ℎ𝑖 𝑓𝑡
2
ℒ(𝑡)
=
𝑖=1
𝐼
(𝑡−1)
+ 𝑔𝑖 𝑓𝑡 𝑥𝑖 +
1
2
ℎ𝑖 𝑓𝑡
2

Define 𝐼𝑗 as the instance set of leaf j
20
Leaf 1
Leaf 2
Leaf 3Leaf 4
𝐼4

Deformation of formula
21
min
𝑓𝑡
ℒ(𝑡)
= min
𝑓𝑡
𝑖=1
𝐼
1
2
ℎ𝑖 𝑓𝑡
2
= min
𝑓𝑡
𝑖=1
𝐼
1
2
ℎ𝑖 𝑓𝑡
2
𝑥𝑖 + 𝛾𝑇 +
1
2
𝜆 𝑤 2
= min
𝑓𝑡
𝑗=1
𝑇
𝑖∈𝐼 𝑗
𝑔𝑖 𝑤𝑗 +
1
2
𝑖∈𝐼 𝑗
ℎ𝑖 + 𝜆 𝑤𝑗
2
+ 𝛾𝑇
Quadratic function of w

We can solve the quadratic function ℒ(𝑡)
on 𝑤𝑗
22
𝑤ℎ𝑒𝑟𝑒
𝑑ℒ(𝑡)
𝑑𝑤𝑗
= 0
𝑤𝑗 = −
𝑖∈𝐼 𝑗
𝑔𝑖
𝑖∈𝐼 𝑗
ℎ𝑖 + 𝜆

Remember 𝑔𝑖 and ℎ𝑖 is the inclination of loss function
and they can calculate with the output of (𝑡 − 1)𝑡ℎ
tree and 𝑦𝑖
So, we can calculate 𝑤𝑗 and minimizing ℒ 𝜙
23
𝑔𝑖 = 𝜕 𝑦 𝑡−1 𝑙 𝑦𝑖, 𝑦 𝑡−1
ℎ𝑖 = 𝜕 𝑦 𝑡−1
2
𝑙 𝑦𝑖, 𝑦 𝑡−1

How to split the node
I told you how to minimize the loss.
One of the key problems in tree learning is to find the
best split.
25

Substitute 𝑤𝑗 for ℒ(𝑡)
:
In this equation, if
𝑖∈𝐼 𝑗
𝑔 𝑖
2
𝑖∈𝐼 𝑗
ℎ 𝑖+𝜆
in each node become bigger,
ℒ(𝑡)
become smaller.
26
ℒ(𝑡) = −
1
2
𝑗=1
𝑇
𝑖∈𝐼 𝑗
𝑔𝑖
2
𝑖∈𝐼 𝑗
ℎ𝑖 + 𝜆
+ 𝛾𝑇

Compare the
𝑖∈𝐼 𝑗
𝑔 𝑖
2
𝑖∈𝐼 𝑗
ℎ 𝑖+𝜆
in before split node and after split
node
Define the objective function before
split node : ℒ 𝑏𝑒𝑓𝑜𝑟𝑒
(𝑡)
Define the objective function after
split node : ℒ 𝑎𝑓𝑡𝑒𝑟
(𝑡)
27

ℒ 𝑏𝑒𝑓𝑜𝑟𝑒
(𝑡)
= −
1
2 𝑗≠𝑠
𝑇
𝑖∈𝐼 𝑗
𝑔 𝑖
2
𝑖∈𝐼 𝑗
ℎ 𝑖+𝜆
−
1
2
𝑖∈𝐼 𝑠
𝑔 𝑖
2
𝑖∈𝐼 𝑠
ℎ 𝑖+𝜆
+ 𝛾𝑇
ℒ 𝑎𝑓𝑡𝑒𝑟
(𝑡)
= −
1
2 𝑗≠𝑠
𝑇
𝑖∈𝐼 𝑗
𝑔 𝑖
2
𝑖∈𝐼 𝑗
ℎ 𝑖+𝜆
−
1
2
𝑖∈𝐼 𝐿
𝑔 𝑖
2
𝑖∈𝐼 𝐿
ℎ 𝑖+𝜆
−
1
2
𝑖∈𝐼 𝑅
𝑔 𝑖
2
𝑖∈𝐼 𝑅
ℎ 𝑖+𝜆
+ 𝛾𝑇
28
𝐼𝑠
𝐼𝐿 𝐼 𝑅
before
after

After maximizing ℒ 𝑏𝑒𝑓𝑜𝑟𝑒
(𝑡)
− ℒ 𝑎𝑓𝑡𝑒𝑟
(𝑡)
we can get the
minimizing ℒ(𝑡)
29

Introduction of Xgboost

In this document