Decision Tree Intro [의사결정나무]

김현우 a.k.a 순록킴
yBigTa 9기
심리학 & 컴퓨터과학

the Tree
Contents { decision tree,
RSS,
Gini,
pruning };

*
Tree
as an
Algorithm
Tree
as a
Data
Structure

6
2 8
1 94
Tree as a
Data Structure
Binary Search tree
Red-Black tree
AVL tree
2,4 tree
Heap

Decision Tree
Regression & Classification

Predicting
Baseball players’
Salary (log transformed)

Decision Tree
a top-down, greedy
approach

Decision Tree:
Greedy algorithm

How it works
R1
R2 R3
R1
R3
R2

Splitting Criterion
for Regression

Splitting criterion, Regression
RSS, Residual Sum of Squares : 잔차 제곱의 합
SSE, Sum of Squared Errors of prediction
minimize

Splitting criterion, Regression
RSS, Residual Sum of Squares: 잔차 제곱의 합
한 영역
안의 평균
데이터
하나
한 영역 안에서, 데이터들과 그 평균 간의 차이 의 합
모든 영역에서, 그 값들의 합들의 합
2

Decision Tree, Regression
RSS, Residual Sum of Squares: 잔차 제곱의 합
1. Select predictor(변수) X and cutpoint(분할 기준점) t
that split the predictor space into the regions
{ X | X < t } and { X | X >= t }
2. Select the ones that leads to the greatest possible reduction in RSS
RSS를 가장 크게 감소시키는 X와 t를 고르자
== Select the one among the resulting trees that has the lowest RSS
어떤 X와 어떤 t를 고르는지에 따라 다양한 tree시나리오가 만들어질텐데
그 중에서 RSS가 가장 작게 나오는 tree시나리오를 고르는 알고리즘

Predicting
Iris Data
with 2 variables:
Petal length, width

Splitting Criterion
for Classification

Splitting Criterion
Gini Index
Cross Entropy

Splitting criterion, Classification
Classification error rate
Resubstitution error

Gini index: measure of impurity 불순도
The proportion of training observations in
the m-th region that are from the k-th class
‘영역 m’에서 ‘분류 k’에 해당하는 데이터의 비율
‘영역 m’에서 ‘분류 k’에 해당하지 않는
데이터의 비율

minimize

‘영역 m’에서 ‘분류 k’에 해당하지 않는
데이터의 비율
The proportion of training observations in
the m-th region that are from the k-th class
‘영역 m’에서 ‘분류 k’에 해당하는 데이터의 비율

가 0이나 1에 가까울수록..?

weight *
전체 데이터 개수 대비
노드 안에 있는 데이터 개수의 비율

직접 해봅시다

Split on Gender Split on Class
X / XI

Split on Gender
Students: 30
Play Overwatch: 15 (50%)
Students: 10
Overwatch: 2 (20%)
Students: 20
Overwatch: 13 (65%)
Girl Boy

Split on Gender
Students: 30
Students: 10
Overwatch: 2 (20%)
Students: 20
Overwatch: 13 (65%)
( 0.2 * 0.8 + 0.8 * 0.2 ) * 0.33
( 0.65 * 0.35 + 0.35 * 0.65 ) * 0.66
+ = 0.4
Girl
Boy

Split on Class
Students: 30
Students: 14
Overwatch: 6 (43%)
Students: 16
Overwatch: 9 (56%)
10th 11th

Split on Class
Students: 30
Students: 14
Overwatch: 6 (43%)
Students: 16
Overwatch: 9 (56%)
( 0.43 * 0.57 + 0.57 * 0.43 ) * 0.47
( 0.56 * 0.44 + 0.44 * 0.56 ) * 0.53
+ = 0.49
10th
11th

Stopping criterion
1. The node is pure
2. There are fewer observations
than MinLeafSize
3. The algorithm splits MaxNumSplits

Overfitting
Algorithm becoming too specific to the data
you used to train it. It cannot generalize very
well to the data you haven’t given it before.

Pruning
Reduced Error Pruning
Cost-complexity Pruning

Pruning methods
Cost-complexity Pruning

Tree constructing
→ Stop
Split
Split
Split
SplitSplit
Split
Split Split
Split Split
Split
→ Prune

Model selection
Training
set
Test set

Decision Trees
CART Classification and Regression Tree
C5.0
CHAID

Tree & Linearity
Unlike linear models,
they map non-linear relationships quite well

Tree
& Non-linearity
Set of Logistic regressions

Tree
& advantages
1. 이해하기 쉽다: 씹고 뜯고 맛보고 즐기고 [White box]
2. 데이터 정제가 크게 필요하지 않다: 바로 넣자
3. numerical, categorical 가리지 않는다: 그냥 넣자
4. 데이터가 어떤 패턴인지 볼 때 편하다: 넣어봐

Tree
& disadvantages
1. Overfitting
2. 연속된 수치 값에는 좀 약한 모습이…

Tree
생각보다 이곳저곳 많이 쓰인다

Tree
머신러닝의 좋은 출발점
감사합니다

Decision Tree Intro [의사결정나무]

More Related Content

Similar to Decision Tree Intro [의사결정나무]

More from Hyunwoo Kim

Recently uploaded

Decision Tree Intro [의사결정나무]