Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล

The First NIDA Business Analytics and Data Sciences Contest/Conference
วันที่ 1-2 กันยายน 2559 ณ อาคารนวมินทราธิราช สถาบันบัณฑิตพัฒนบริหารศาสตร์
https://businessanalyticsnida.wordpress.com
https://www.facebook.com/BusinessAnalyticsNIDA/
โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
สาขาวิชาวิทยาการข้อมูล
คณะสถิติประยุกต์ สถาบันบัณฑิตพัฒนบริหารศาสตร์
Machine Learning: An introduction
เครื่องจักรเรียนรู้ได้อย่างไร?
เครื่องจักรเรียนรู้อะไรได้บ้าง
การเรียนรู้ของเครื่องจักรเอาไปประยุกต์ใช้งานใดได้บ้าง
ต้องใช้คณิตศาสตร์ขั้นสูงในการเรียนรู้เรื่องการเรียนรู้ของเครื่องจักร?
มี software อะไรบ้างที่ใช้สาหรับการเรียนรู้ของเครื่องจักร
ประเภทของการเรียนรู้ของเครื่องจักรมีกี่ประเภท แต่ละประเภทเอาไปประยุกต์ใช้อะไร
นวมินทราธิราช 3003 วันที่ 1 กันยายน 2559 10.15-12.30 น.

Machine Learning
An Introduction

Types of Machine Learning
• Supervised Learning ( Classification,
Prediction )
• Unsupervised Learning ( Cluster Analysis )
• Association Analysis
• Reinforcement Learning
• Evolutionary Learning

Classification
• Based on Supervised Learning
• Given a collection of records (training set )
– Each record contains a set of attributes,
one of the attributes is the class.
• Find a model for class attribute as a function
of the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy
of the model. Usually, the given data set is
divided into training and test sets, with
training set used to build the model and test
set used to validate it.

Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set

Examples of Classification Tasks
• Predicting potential customers of a new product
• Identifying spam emails or network intrusion
connections
• Classifying credit risks of customers
• Categorizing news stories as finance,
weather, entertainment, sports, etc

Classification Techniques
• Decision Trees
• K-nearest Neighbors
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
• Ensemble Method

Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Splitting Attributes
Training Data
Model: Decision Tree

Decision Tree Classification
Task
Apply
Model
Induction
Deduction
Learn
Model
Model
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree

Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Start from the root of tree.

Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data

Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”

Decision Boundary
y < 0.33?
: 0
: 3
: 4
: 0
y < 0.47?
: 4
: 0
: 0
: 4
x < 0.43?
Yes
Yes
No
No Yes No
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time

Tree Induction
• Greedy strategy
– Split the training records assigned to each node
from root node to the leaf nodes based on an
attribute test that optimizes certain criterion e.g.
gains of homogeneity of training records for each
node in the tree
– Measures of homogeneity of training records for
a tree node : Entropy, GINI
– Stop splitting when some predefined criterion are
met e.g. the measures reach a predefined
certain thresholds

Measure of Impurity: GINI
• Gini Index for a given node t :
• p( j | t) is the relative frequency of class j at node t.
– Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information

j
tjptGINI 2
)]|([1)(

Measure of Impurity: Entropy
• Entropy at a given node t:
• p( j | t) is the relative frequency of class j at node t.
– Measures impurity of a node.
• Maximum (log nc) when records are equally distributed
among all classes implying least information
• Minimum (0.0) when all records belong to one class,
implying most information
 j
tjptjptEntropy )|(log)|()(

Nearest Neighbor Classifiers
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records

Nearest-Neighbor Classifiers
 Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
 To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Unknown record

Nearest Neighbor Classification
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes
X

Bayesian Classifiers
• Consider each attribute and class label as random
variables
• Given a record with attributes (A1, A2,…,An)
– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
• Can we estimate P(C| A1, A2,…,An ) directly from
data?

Bayesian Classifiers
• Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
– Choose value of C that maximizes
P(C | A1, A2, …, An)
– Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
• How to estimate P(A1, A2, …, An | C )?
)(
)()|(
)|(
21
21
21
n
n
n
AAAP
CPCAAAP
AAACP


 

Artificial Neural Networks (ANN)
)( tXwIY
i
ii  
Perceptron Model
)( tXwsignY
i
ii  
or
• Model is an assembly
of inter-connected
nodes and weighted
links
• Output node sums up
each of its input value
according to the
weights of its links
• Compare output node
against some
threshold t

X1
X2
X3
Y
Black box
w1
t
Output
node
Input
nodes
w2
w3

General Structure of ANN
Activation
function
g(Si )
Si
Oi
I1
I2
I3
wi1
wi2
wi3
Oi
Neuron iInput Output
threshold, t
Input
Layer
Hidden
Layer
Output
Layer
x1 x2 x3 x4 x5
y
Training ANN means learning the
weights of the neurons as well as t
1
( )
1 x
sigmoid x
e


( )i i
i
Y sigmoid w X t 

Backpropagation algorithm
– Gradient Descent is illustrated using single weight
w1 of w
– Preferred values for w1 minimize
– Optimal value for w1 is w1*
SSE
w1L w1RW1* W1
 
2
( , )i i
i
SSE Y f w X 

Backpropagation algorithm
– Direction for adjusting wCURRENT is negative sign of
derivative at SSE at wCURRENT
– To adjust, use magnitude of the derivative of SSE at
wCURRENT
– When curve steep, adjustment large
– When curve nearly flat, adjustment small
– Learning Rate η has values [0, 1]
)(
CURRENTw
SSE
sign



)(
CURRENT
CURRENT
w
SSE
w


 

Support Vector Machines
• Find hyperplane maximizes the margin => B1 is better than
B2
B1
B2
b11
b12
b21
b22
margin

B1
b11
b12
0 bxw

1 bxw
 1 bxw

1 ( ) if w x b 1
( )
1 ( ) if w x b 1
positive class
f x
negative class
      
 
       
2
||||
2
Margin
w


• We want to maximize:
– Which is equivalent to minimizing:
– But subjected to the following constraints:
– This is a constrained optimization problem.
Numerical approaches, e.g. quadratic
programming, can be used to solve it.
2
||||
2
Margin
w

i
i
1 if w x b 1
( )
1 if w x b 1if x
  
 
    
2
||||
)(
2
w
wL



• Decision Function for classifying a given
data z
 i i i
i SV
f(z) = sign y x z + b

 
  
 

Nonlinear Support Vector Machines
• What if decision boundary is not linear?• What if decision boundary is not linear?

Nonlinear Support Vector Machines
• Transform data vector X into new dimensional space
• Some Kernel Functions can be used to compute the dot
product between any two given original data vectors in
the new data space (without the need of actual data
transformation).

Ensemble Methods
• Construct a set of classifiers from the training
data
• Predict class label of previously unseen records
by aggregating predictions made by multiple
classifiers

General Idea
Original
Training data
....D1
D2 Dt-1 Dt
D
Step 1:
Create Multiple
Data Sets
C1 C2 Ct -1 Ct
Step 2:
Build Multiple
Classifiers
C*
Step 3:
Combine
Classifiers

Why does it work?







25
13
25
06.0)1(
25
i
ii
i

• Suppose there are 25 base classifiers
– Each classifier has error rate,  = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes
a wrong prediction:

Examples of Ensemble Methods
• How to generate an ensemble of classifiers?
– Bagging
– Boosting

Bagging
• Sampling with replacement
• Build classifier on each bootstrap sample
• Each sample has probability 1 - (1 – 1/n)n of
being selected in each round
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

Boosting
• An iterative procedure to adaptively change
distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal
weights
– Unlike bagging, weights may change at the
end of boosting round

Boosting
• Records that are wrongly classified will have their
weights increased
• Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
• Example 4 is hard to classify
• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds

What is Cluster Analysis?
• Finding groups of objects such that the objects in
a group will be similar (or related) to one another
and different from (or unrelated to) the objects in
other groups Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized

Applications of Cluster Analysis
• Understanding
– Group related documents for browsing, group
customers into segments or group stocks with similar
price fluctuations
• Summarization
– Reduce the size of large data sets by sampling data
from each cluster

K-means Clustering
• Each cluster is associated with a centroid
(center point)
• Each data point is assigned to the cluster with
the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple

K-Means Algorithm
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6

Limitations of K-means
• K-means has problems when clusters are of
differing
– Sizes
– Densities
– Non-globular shapes
• K-means has problems when the data contains
outliers.

Hierarchical Clustering
• Produces a set of nested clusters organized as a
hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the sequences
of merges or splits
1 3 2 5 4 6
0
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
2
3 4
5

Agglomerative Clustering Algorithm
• A popular hierarchical clustering technique
• Basic algorithm is straightforward
1. Compute the proximity matrix (similarities between
pairs of clusters)
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
May not be suitable for large datasets due to the cost
of computing and updating the proximity matrix

How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Ward’s Method uses squared
error
Proximity Matrix

Hierarchical Clustering: Comparison
Group Average
Ward’s Method
1
2
3
4
5
6
1
2
5
3
4
MIN MAX
1
2
3
4
5
6
1
2
5
3
4
1
2
3
4
5
6
1
2 5
3
41
2
3
4
5
6
1
2
3
4
5

Other Issues
• Data Cleaning
• Data Sampling
• Dimension Reduction
• Data Visualization
• Over fitting and Under fitting Problems
• Imbalance Issues

Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล

More Related Content

What's hot

Viewers also liked

Similar to Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล

More from BAINIDA

Recently uploaded

Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล