Decision tree

Decision Tree
HANSAM CHO
GROOT SEMINAR

Definition
Decision tree learning is a method commonly used in data mining. The goal is to
create a model that predicts the value of a target variable based on several input
variables.

Issues
1. How to split the training records → Impurity measure, Algorithm
2. When to stop splitting → Stopping condition, Pruning

Impurity Measure
• Splitting의 결과가 얼마나 좋은지에 대한 평가 척도 (homogeneity)
• Misclassification error
• Gini impurity
• Information gain
• Variance reduction

Gini impurity
• Used by the CART (classification and regression tree) algorithm for classification trees
• Gini impurity is a measure of how often a randomly chosen element from the set
would be incorrectly labeled if it was randomly labeled according to the distribution
of labels in the subset.
뽑힌 element가 특정 클래스에 속할 확률 잘못 분류될 확률

Information gain
• Used by the ID3, C4.5 and C5.0
• Information (사건의 확률이 낮을 수록 높은 정보를 가지고 있다 / 로또 예시)
• Entropy (expectation of information) / Deviance

Information gain
• Information gain

Variance reduction
• Introduced in CART, variance reduction is often employed in cases where the target
variable is continuous (regression tree)

Algorithm
•Split의 결과가 얼마나 좋은지에 대한 척도 – Impurity measure
•어떻게 나눌까 - Algorithm
• ID3
• C 4.5
• C 5.0
• CART

ID3 - algorithm
• Calculate the entropy of every attribute a of the data set S.
• Partition ("split") the set S into subsets using the attribute for which the resulting
entropy after splitting is minimized; or, equivalently, information gain is maximum
• Make a decision tree node containing that attribute.
• Recurse on subsets using the remaining attributes.

ID3 – stopping condition
• Every element in the subset belongs to the same class; in which case the node is
turned into a leaf node and labelled with the class of the examples.
• There are no more attributes to be selected, but the examples still do not belong to
the same class. In this case, the node is made a leaf node and labelled with the most
common class of the examples in the subset.
• There are no examples in the subset, which happens when no example in the parent
set was found to match a specific value of the selected attribute. An example could be
the absence of a person among the population with age over 100 years. Then a leaf
node is created and labelled with the most common class of the examples in the
parent node's set.

C 4.5 – Information gain ratio
• A notable problem occurs when information gain is applied to attributes that can take
on a large number of distinct values. (ex. 고객번호 / overfitting)
•Information gain ratio
• Intrinsic value (많이 쪼개는 것에 대한 패널티, 쪼개는 것에 대한 엔트로피)

C 4.5 - Improvements from ID.3
algorithm

C 4.5 – Handling continuous attribute
(mid-point)

Pruning
•Pre-pruning / Post-pruning
•Reduced error pruning
• Subtree 제거했을 때 성능 차이가 없다면 pruning
•Cost complexity pruning

CART
• 기본적인 컨셉은 C 4.5와 유사
• Regression 가능
• Binary split
• Choose attribute recursively
• Classification – Gini impurity / Regression – Variance reduction

Ensemble
Bootstrap Aggregation (Bagging)
◦ Random Forest
Boosting
◦ AdaBoost
◦ Gradient Boosting

Bootstrap Aggregation (Bagging)
• Bootstrapping
• Given a standard training set D of size n, bagging generates m new training sets Di, each of
size n′, by sampling from D uniformly and with replacement. By sampling with replacement,
some observations may be repeated in each Di. If n′=n, then for large n the set Di is
expected to have the fraction (1 - 1/e) (≈63.2%) of the unique examples of D, the rest being
duplicates.
• Aggregation
• This kind of sample is known as a bootstrap sample. Then, m models are fitted using the
above m bootstrap samples and combined by averaging the output (for regression) or
voting (for classification)
lim
𝑛→∞
1 − 1 −
1
𝑛
𝑛

Random Forest – random subspace
• Random forests differ in only one way from this general scheme: they use a modified
tree learning algorithm that selects, at each candidate split in the learning process, a
random subset of the features. This process is sometimes called "feature bagging".
• The reason for doing this is the correlation of the trees in an ordinary bootstrap
sample: if one or a few features are very strong predictors for the response variable
(target output), these features will be selected in many of the B trees, causing them to
become correlated.

Extra-Trees
• Its two main differences with other tree based ensemble methods are that it splits
nodes by choosing cut-points fully at random and that it uses the whole learning
sample (rather than a bootstrap replica) to grow the trees.
• 𝑛 𝑚𝑖𝑛 : the minimum sample size for splitting a node
• 비슷한 성능을 유지하면서 computational cost감소

Random Forest – Variable importance

Boosting
Boosting algorithms consist of iteratively learning weak classifiers with respect to a
distribution and adding them to a final strong classifier. When they are added, they
are typically weighted in some way that is usually related to the weak learners'
accuracy

AdaBoost
https://www.youtube.com/watch?v=LsK-xG1cLYA
Decision Stump (Weak learner)

AdaBoost
error↓ → α↑
Logit function
error가 0또는 1

AdaBoost
Weighted Gini Index
Bootstrapping

AdaBoost
m-1번째까지 모델이 만들어져 있고 m번째 weak learner 추가하는 과정을 가정
𝛼 𝑚, 𝑦 𝑚에 대해서만 최적화 시행
𝑇 𝑚 : 𝑦 𝑚에 의해 정확히 분류된 포인트 / 𝑀 𝑚 : 𝑦 𝑚에 의해 잘못 분류된 포인트

AdaBoost
14.23을 𝑦 𝑚에 대해 minimize, 𝛼 𝑚상수 취급 / 14.15식 유도

AdaBoost
14.23을𝛼 𝑚에 대해 minimize / 14.17식 유도
https://en.wikipedia.org/wiki/AdaBoost

AdaBoost
exp(−
𝛼 𝑚
2
)는 n에 대해 독립적이기 때문에 제거 가능

Gradient Boosting
Gradient Boosting Algorithm

이후…
Xgboost
LightGBM
Catboost
Optimal decision tree

Decision tree

In this document

More Related Content

What's hot

Similar to Decision tree

More from SEMINARGROOT

Recently uploaded

Decision tree