김현우 a.k.a 순록킴
yBigTa 9기
심리학 & 컴퓨터과학
Tree
Python
yBigTa
the Tree
Contents { decision tree,
RSS,
Gini,
pruning };
Random ForestTree
Structure
Node
Root
Node
Leaf
*
Tree
as an
Algorithm
Tree
as a
Data
Structure
Tree
as a
Data
Structure
6
2 8
1 94
Tree as a
Data Structure
Binary Search tree
Red-Black tree
AVL tree
2,4 tree
Heap
*
Tree
as an
Algorithm
Decision Tree
Random Forest
Decision Tree
Regression & Classification
Decision Tree
Regression
Predicting
Baseball players’
Salary (log transformed)
Predicting
Baseball players’
Salary (log transformed)
Decision Tree
a top-down, greedy
approach
Decision Tree:
Greedy algorithm
Decision Tree:
Greedy algorithm
Decision Tree:
Greedy algorithm
Decision ?
How it works
R1
R2 R3
R1
R3
R2
Splitting Criterion
for Regression
Splitting criterion, Regression
RSS, Residual Sum of Squares : 잔차 제곱의 합
SSE, Sum of Squared Errors of prediction
minimize
Splitting criterion, Regression
RSS, Residual Sum of Squares: 잔차 제곱의 합
한 영역
안의 평균
데이터
하나
한 영역 안에서, 데이터들과 그 평균 간의 차이 의 합
모든 영역에서, 그 값들의 합들의 합
2
How it works
R1
R2 R3
R1
R3
R2
Splitting criterion, Regression
RSS, Residual Sum of Squares: 잔차 제곱의 합
한 영역
안의 평균
데이터
하나
한 영역 안에서, 데이터들과 그 평균 간의 차이 의 합
모든 영역에서, 그 값들의 합들의 합
2
Decision Tree, Regression
RSS, Residual Sum of Squares: 잔차 제곱의 합
1. Select predictor(변수) X and cutpoint(분할 기준점) t
that split the predictor space into the regions
{ X | X < t } and { X | X >= t }
2. Select the ones that leads to the greatest possible reduction in RSS
RSS를 가장 크게 감소시키는 X와 t를 고르자
== Select the one among the resulting trees that has the lowest RSS
어떤 X와 어떤 t를 고르는지에 따라 다양한 tree시나리오가 만들어질텐데
그 중에서 RSS가 가장 작게 나오는 tree시나리오를 고르는 알고리즘
How it works
R1
R2
R3
R4
R5
How it looks
Decision Tree
Classification
Predicting
Iris Data
with 2 variables:
Petal length, width
Predicting
Iris Data
with 2 variables:
Petal length, width
Splitting Criterion
for Classification
Splitting Criterion
Gini Index
Cross Entropy
Splitting criterion, Classification
Classification error rate
Resubstitution error
Splitting criterion, Classification
Gini index: measure of impurity 불순도
The proportion of training observations in
the m-th region that are from the k-th class
‘영역 m’에서 ‘분류 k’에 해당하는 데이터의 비율
‘영역 m’에서 ‘분류 k’에 해당하지 않는
데이터의 비율
Splitting criterion, Classification
Classification error rate
Resubstitution error
Splitting criterion, Classification
Gini index: measure of impurity 불순도
Splitting criterion, Classification
Gini index: measure of impurity 불순도
minimize
Splitting criterion, Classification
Gini index: measure of impurity 불순도
‘영역 m’에서 ‘분류 k’에 해당하지 않는
데이터의 비율
The proportion of training observations in
the m-th region that are from the k-th class
‘영역 m’에서 ‘분류 k’에 해당하는 데이터의 비율
Splitting criterion, Classification
Gini index: measure of impurity 불순도
가 0이나 1에 가까울수록..?
Splitting criterion, Classification
Gini index: measure of impurity 불순도
weight *
전체 데이터 개수 대비
노드 안에 있는 데이터 개수의 비율
쉬는 시간
Splitting criterion, Classification
Gini index: measure of impurity 불순도
minimize
Splitting criterion, Classification
Gini index: measure of impurity 불순도
직접 해봅시다
Split on Gender Split on Class
X / XI
Splitting criterion, Classification
Gini index: measure of impurity 불순도
Split on Gender
Students: 30
Play Overwatch: 15 (50%)
Students: 10
Overwatch: 2 (20%)
Students: 20
Overwatch: 13 (65%)
Girl Boy
Splitting criterion, Classification
Gini index: measure of impurity 불순도
Split on Gender
Students: 30
Play Overwatch: 15 (50%)
Students: 10
Overwatch: 2 (20%)
Students: 20
Overwatch: 13 (65%)
( 0.2 * 0.8 + 0.8 * 0.2 ) * 0.33
( 0.65 * 0.35 + 0.35 * 0.65 ) * 0.66
+ = 0.4
Girl
Boy
Splitting criterion, Classification
Gini index: measure of impurity 불순도
Split on Class
Students: 30
Play Overwatch: 15 (50%)
Students: 14
Overwatch: 6 (43%)
Students: 16
Overwatch: 9 (56%)
10th 11th
Splitting criterion, Classification
Gini index: measure of impurity 불순도
Split on Class
Students: 30
Play Overwatch: 15 (50%)
Students: 14
Overwatch: 6 (43%)
Students: 16
Overwatch: 9 (56%)
( 0.43 * 0.57 + 0.57 * 0.43 ) * 0.47
( 0.56 * 0.44 + 0.44 * 0.56 ) * 0.53
+ = 0.49
10th
11th
Splitting criterion, Classification
Gini index: measure of impurity 불순도
Split on Gender
Students: 30
Play Overwatch: 15 (50%)
Students: 10
Overwatch: 2 (20%)
Students: 20
Overwatch: 13 (65%)
Girl Boy
Stopping Criterion
Stopping criterion
1. The node is pure
2. There are fewer observations
than MinLeafSize
3. The algorithm splits MaxNumSplits
작은 문제
Overfitting
Algorithm becoming too specific to the data
you used to train it. It cannot generalize very
well to the data you haven’t given it before.
Overfitting
Bias & Variance
Overfitting & Accuracy
Decision Tree
Regression
Overfit: 과적합
가지치기
Pruning
Reduced Error Pruning
Cost-complexity Pruning
Pruning methods
Cost-complexity Pruning
Tree constructing
→ Stop
Split
Split
Split
SplitSplit
Split
Split Split
Split Split
Split
→ Prune
Model selection
Training
set
Test set
Decision Trees
CART Classification and Regression Tree
C5.0
CHAID
Decision Trees
CART CHAID
Tree & Linearity
Unlike linear models,
they map non-linear relationships quite well
Tree
& Linearity
Tree
& Non-linearity
Set of Logistic regressions
Tree
& advantages
1. 이해하기 쉽다: 씹고 뜯고 맛보고 즐기고 [White box]
2. 데이터 정제가 크게 필요하지 않다: 바로 넣자
3. numerical, categorical 가리지 않는다: 그냥 넣자
4. 데이터가 어떤 패턴인지 볼 때 편하다: 넣어봐
Tree
& disadvantages
1. Overfitting
2. 연속된 수치 값에는 좀 약한 모습이…
Tree implementation
Tree
생각보다 이곳저곳 많이 쓰인다
Tree
머신러닝의 좋은 출발점
감사합니다

Decision Tree Intro [의사결정나무]