The First NIDA Business Analytics and Data Sciences Contest/Conference
วันที่ 1-2 กันยายน 2559 ณ อาคารนวมินทราธิราช สถาบันบัณฑิตพัฒนบริหารศาสตร์
https://businessanalyticsnida.wordpress.com
https://www.facebook.com/BusinessAnalyticsNIDA/
โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
สาขาวิชาวิทยาการข้อมูล
คณะสถิติประยุกต์ สถาบันบัณฑิตพัฒนบริหารศาสตร์
Machine Learning: An introduction
เครื่องจักรเรียนรู้ได้อย่างไร?
เครื่องจักรเรียนรู้อะไรได้บ้าง
การเรียนรู้ของเครื่องจักรเอาไปประยุกต์ใช้งานใดได้บ้าง
ต้องใช้คณิตศาสตร์ขั้นสูงในการเรียนรู้เรื่องการเรียนรู้ของเครื่องจักร?
มี software อะไรบ้างที่ใช้สาหรับการเรียนรู้ของเครื่องจักร
ประเภทของการเรียนรู้ของเครื่องจักรมีกี่ประเภท แต่ละประเภทเอาไปประยุกต์ใช้อะไร
นวมินทราธิราช 3003 วันที่ 1 กันยายน 2559 10.15-12.30 น.
Machine Learning
An Introduction
Types of Machine Learning
• Supervised Learning ( Classification,
Prediction )
• Unsupervised Learning ( Cluster Analysis )
• Association Analysis
• Reinforcement Learning
• Evolutionary Learning
Classification
• Based on Supervised Learning
• Given a collection of records (training set )
– Each record contains a set of attributes,
one of the attributes is the class.
• Find a model for class attribute as a function
of the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy
of the model. Usually, the given data set is
divided into training and test sets, with
training set used to build the model and test
set used to validate it.
Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
Examples of Classification Tasks
• Predicting potential customers of a new product
• Identifying spam emails or network intrusion
connections
• Classifying credit risks of customers
• Categorizing news stories as finance,
weather, entertainment, sports, etc
Classification Techniques
• Decision Trees
• K-nearest Neighbors
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
• Ensemble Method
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Splitting Attributes
Training Data
Model: Decision Tree
Decision Tree Classification
Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Start from the root of tree.
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”
Decision Boundary
y < 0.33?
: 0
: 3
: 4
: 0
y < 0.47?
: 4
: 0
: 0
: 4
x < 0.43?
Yes
Yes
No
No Yes No
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time
Tree Induction
• Greedy strategy
– Split the training records assigned to each node
from root node to the leaf nodes based on an
attribute test that optimizes certain criterion e.g.
gains of homogeneity of training records for each
node in the tree
– Measures of homogeneity of training records for
a tree node : Entropy, GINI
– Stop splitting when some predefined criterion are
met e.g. the measures reach a predefined
certain thresholds
Measure of Impurity: GINI
• Gini Index for a given node t :
• p( j | t) is the relative frequency of class j at node t.
– Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information

j
tjptGINI 2
)]|([1)(
Measure of Impurity: Entropy
• Entropy at a given node t:
• p( j | t) is the relative frequency of class j at node t.
– Measures impurity of a node.
• Maximum (log nc) when records are equally distributed
among all classes implying least information
• Minimum (0.0) when all records belong to one class,
implying most information
 j
tjptjptEntropy )|(log)|()(
Nearest Neighbor Classifiers
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records
Nearest-Neighbor Classifiers
 Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
 To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Unknown record
Nearest Neighbor Classification
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes
X
Bayesian Classifiers
• Consider each attribute and class label as random
variables
• Given a record with attributes (A1, A2,…,An)
– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
• Can we estimate P(C| A1, A2,…,An ) directly from
data?
Bayesian Classifiers
• Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
– Choose value of C that maximizes
P(C | A1, A2, …, An)
– Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
• How to estimate P(A1, A2, …, An | C )?
)(
)()|(
)|(
21
21
21
n
n
n
AAAP
CPCAAAP
AAACP


 
Naïve Bayes Classifier
• Assume independence among attributes Ai when
class is given:
– P(A1, A2, …, An |Cj) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
– Can estimate P(Ai| Cj) for all Ai and Cj.
– New unknown record is classified to Cj if P(Cj) 
P(Ai| Cj) is maximal.
Artificial Neural Networks (ANN)
)( tXwIY
i
ii  
Perceptron Model
)( tXwsignY
i
ii  
or
• Model is an assembly
of inter-connected
nodes and weighted
links
• Output node sums up
each of its input value
according to the
weights of its links
• Compare output node
against some
threshold t

X1
X2
X3
Y
Black box
w1
t
Output
node
Input
nodes
w2
w3
General Structure of ANN
Activation
function
g(Si )
Si
Oi
I1
I2
I3
wi1
wi2
wi3
Oi
Neuron iInput Output
threshold, t
Input
Layer
Hidden
Layer
Output
Layer
x1 x2 x3 x4 x5
y
Training ANN means learning the
weights of the neurons as well as t
1
( )
1 x
sigmoid x
e


( )i i
i
Y sigmoid w X t 
Backpropagation algorithm
– Gradient Descent is illustrated using single weight
w1 of w
– Preferred values for w1 minimize
– Optimal value for w1 is w1*
SSE
w1L w1RW1* W1
 
2
( , )i i
i
SSE Y f w X 
Backpropagation algorithm
– Direction for adjusting wCURRENT is negative sign of
derivative at SSE at wCURRENT
– To adjust, use magnitude of the derivative of SSE at
wCURRENT
– When curve steep, adjustment large
– When curve nearly flat, adjustment small
– Learning Rate η has values [0, 1]
)(
CURRENTw
SSE
sign



)(
CURRENT
CURRENT
w
SSE
w


 
Support Vector Machines
• Find hyperplane maximizes the margin => B1 is better than
B2
B1
B2
b11
b12
b21
b22
margin
Support Vector Machines
B1
b11
b12
0 bxw

1 bxw
 1 bxw

1 ( ) if w x b 1
( )
1 ( ) if w x b 1
positive class
f x
negative class
      
 
       
2
||||
2
Margin
w

Support Vector Machines
• We want to maximize:
– Which is equivalent to minimizing:
– But subjected to the following constraints:
– This is a constrained optimization problem.
Numerical approaches, e.g. quadratic
programming, can be used to solve it.
2
||||
2
Margin
w

i
i
1 if w x b 1
( )
1 if w x b 1if x
  
 
    
2
||||
)(
2
w
wL


Support Vector Machines
• Decision Function for classifying a given
data z
 i i i
i SV
f(z) = sign y x z + b

 
  
 
Nonlinear Support Vector Machines
• What if decision boundary is not linear?• What if decision boundary is not linear?
Nonlinear Support Vector Machines
• Transform data vector X into new dimensional space
• Some Kernel Functions can be used to compute the dot
product between any two given original data vectors in
the new data space (without the need of actual data
transformation).
Ensemble Methods
• Construct a set of classifiers from the training
data
• Predict class label of previously unseen records
by aggregating predictions made by multiple
classifiers
General Idea
Original
Training data
....D1
D2 Dt-1 Dt
D
Step 1:
Create Multiple
Data Sets
C1 C2 Ct -1 Ct
Step 2:
Build Multiple
Classifiers
C*
Step 3:
Combine
Classifiers
Why does it work?







25
13
25
06.0)1(
25
i
ii
i

• Suppose there are 25 base classifiers
– Each classifier has error rate,  = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes
a wrong prediction:
Examples of Ensemble Methods
• How to generate an ensemble of classifiers?
– Bagging
– Boosting
Bagging
• Sampling with replacement
• Build classifier on each bootstrap sample
• Each sample has probability 1 - (1 – 1/n)n of
being selected in each round
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Boosting
• An iterative procedure to adaptively change
distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal
weights
– Unlike bagging, weights may change at the
end of boosting round
Boosting
• Records that are wrongly classified will have their
weights increased
• Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
• Example 4 is hard to classify
• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds
What is Cluster Analysis?
• Finding groups of objects such that the objects in
a group will be similar (or related) to one another
and different from (or unrelated to) the objects in
other groups Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
Applications of Cluster Analysis
• Understanding
– Group related documents for browsing, group
customers into segments or group stocks with similar
price fluctuations
• Summarization
– Reduce the size of large data sets by sampling data
from each cluster
K-means Clustering
• Each cluster is associated with a centroid
(center point)
• Each data point is assigned to the cluster with
the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
K-Means Algorithm
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Limitations of K-means
• K-means has problems when clusters are of
differing
– Sizes
– Densities
– Non-globular shapes
• K-means has problems when the data contains
outliers.
Hierarchical Clustering
• Produces a set of nested clusters organized as a
hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the sequences
of merges or splits
1 3 2 5 4 6
0
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
2
3 4
5
Agglomerative Clustering Algorithm
• A popular hierarchical clustering technique
• Basic algorithm is straightforward
1. Compute the proximity matrix (similarities between
pairs of clusters)
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
May not be suitable for large datasets due to the cost
of computing and updating the proximity matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Ward’s Method uses squared
error
Proximity Matrix
Hierarchical Clustering: Comparison
Group Average
Ward’s Method
1
2
3
4
5
6
1
2
5
3
4
MIN MAX
1
2
3
4
5
6
1
2
5
3
4
1
2
3
4
5
6
1
2 5
3
41
2
3
4
5
6
1
2
3
4
5
Other Issues
• Data Cleaning
• Data Sampling
• Dimension Reduction
• Data Visualization
• Over fitting and Under fitting Problems
• Imbalance Issues

Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล

  • 1.
    The First NIDABusiness Analytics and Data Sciences Contest/Conference วันที่ 1-2 กันยายน 2559 ณ อาคารนวมินทราธิราช สถาบันบัณฑิตพัฒนบริหารศาสตร์ https://businessanalyticsnida.wordpress.com https://www.facebook.com/BusinessAnalyticsNIDA/ โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล สาขาวิชาวิทยาการข้อมูล คณะสถิติประยุกต์ สถาบันบัณฑิตพัฒนบริหารศาสตร์ Machine Learning: An introduction เครื่องจักรเรียนรู้ได้อย่างไร? เครื่องจักรเรียนรู้อะไรได้บ้าง การเรียนรู้ของเครื่องจักรเอาไปประยุกต์ใช้งานใดได้บ้าง ต้องใช้คณิตศาสตร์ขั้นสูงในการเรียนรู้เรื่องการเรียนรู้ของเครื่องจักร? มี software อะไรบ้างที่ใช้สาหรับการเรียนรู้ของเครื่องจักร ประเภทของการเรียนรู้ของเครื่องจักรมีกี่ประเภท แต่ละประเภทเอาไปประยุกต์ใช้อะไร นวมินทราธิราช 3003 วันที่ 1 กันยายน 2559 10.15-12.30 น.
  • 2.
  • 3.
    Types of MachineLearning • Supervised Learning ( Classification, Prediction ) • Unsupervised Learning ( Cluster Analysis ) • Association Analysis • Reinforcement Learning • Evolutionary Learning
  • 4.
    Classification • Based onSupervised Learning • Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
  • 5.
    Classification Task Apply Model Induction Deduction Learn Model Model Tid Attrib1Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set
  • 6.
    Examples of ClassificationTasks • Predicting potential customers of a new product • Identifying spam emails or network intrusion connections • Classifying credit risks of customers • Categorizing news stories as finance, weather, entertainment, sports, etc
  • 7.
    Classification Techniques • DecisionTrees • K-nearest Neighbors • Neural Networks • Naïve Bayes and Bayesian Belief Networks • Support Vector Machines • Ensemble Method
  • 8.
    Example of aDecision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Splitting Attributes Training Data Model: Decision Tree
  • 9.
    Decision Tree Classification Task Apply Model Induction Deduction Learn Model Model TidAttrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Tree Induction algorithm Training Set Decision Tree
  • 10.
    Apply Model toTest Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Start from the root of tree.
  • 11.
    Apply Model toTest Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 12.
    Apply Model toTest Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 13.
    Apply Model toTest Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 14.
    Apply Model toTest Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data
  • 15.
    Apply Model toTest Data Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Assign Cheat to “No”
  • 16.
    Decision Boundary y <0.33? : 0 : 3 : 4 : 0 y < 0.47? : 4 : 0 : 0 : 4 x < 0.43? Yes Yes No No Yes No 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y • Border line between two neighboring regions of different classes is known as decision boundary • Decision boundary is parallel to axes because test condition involves a single attribute at-a-time
  • 17.
    Tree Induction • Greedystrategy – Split the training records assigned to each node from root node to the leaf nodes based on an attribute test that optimizes certain criterion e.g. gains of homogeneity of training records for each node in the tree – Measures of homogeneity of training records for a tree node : Entropy, GINI – Stop splitting when some predefined criterion are met e.g. the measures reach a predefined certain thresholds
  • 18.
    Measure of Impurity:GINI • Gini Index for a given node t : • p( j | t) is the relative frequency of class j at node t. – Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information  j tjptGINI 2 )]|([1)(
  • 19.
    Measure of Impurity:Entropy • Entropy at a given node t: • p( j | t) is the relative frequency of class j at node t. – Measures impurity of a node. • Maximum (log nc) when records are equally distributed among all classes implying least information • Minimum (0.0) when all records belong to one class, implying most information  j tjptjptEntropy )|(log)|()(
  • 20.
  • 21.
    Nearest-Neighbor Classifiers  Requiresthree things – The set of stored records – Distance Metric to compute distance between records – The value of k, the number of nearest neighbors to retrieve  To classify an unknown record: – Compute distance to other training records – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) Unknown record
  • 22.
    Nearest Neighbor Classification •Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from other classes X
  • 23.
    Bayesian Classifiers • Considereach attribute and class label as random variables • Given a record with attributes (A1, A2,…,An) – Goal is to predict class C – Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) • Can we estimate P(C| A1, A2,…,An ) directly from data?
  • 24.
    Bayesian Classifiers • Approach: –compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem – Choose value of C that maximizes P(C | A1, A2, …, An) – Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) • How to estimate P(A1, A2, …, An | C )? )( )()|( )|( 21 21 21 n n n AAAP CPCAAAP AAACP    
  • 25.
    Naïve Bayes Classifier •Assume independence among attributes Ai when class is given: – P(A1, A2, …, An |Cj) = P(A1| Cj) P(A2| Cj)… P(An| Cj) – Can estimate P(Ai| Cj) for all Ai and Cj. – New unknown record is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.
  • 26.
    Artificial Neural Networks(ANN) )( tXwIY i ii   Perceptron Model )( tXwsignY i ii   or • Model is an assembly of inter-connected nodes and weighted links • Output node sums up each of its input value according to the weights of its links • Compare output node against some threshold t  X1 X2 X3 Y Black box w1 t Output node Input nodes w2 w3
  • 27.
    General Structure ofANN Activation function g(Si ) Si Oi I1 I2 I3 wi1 wi2 wi3 Oi Neuron iInput Output threshold, t Input Layer Hidden Layer Output Layer x1 x2 x3 x4 x5 y Training ANN means learning the weights of the neurons as well as t 1 ( ) 1 x sigmoid x e   ( )i i i Y sigmoid w X t 
  • 28.
    Backpropagation algorithm – GradientDescent is illustrated using single weight w1 of w – Preferred values for w1 minimize – Optimal value for w1 is w1* SSE w1L w1RW1* W1   2 ( , )i i i SSE Y f w X 
  • 29.
    Backpropagation algorithm – Directionfor adjusting wCURRENT is negative sign of derivative at SSE at wCURRENT – To adjust, use magnitude of the derivative of SSE at wCURRENT – When curve steep, adjustment large – When curve nearly flat, adjustment small – Learning Rate η has values [0, 1] )( CURRENTw SSE sign    )( CURRENT CURRENT w SSE w    
  • 30.
    Support Vector Machines •Find hyperplane maximizes the margin => B1 is better than B2 B1 B2 b11 b12 b21 b22 margin
  • 31.
    Support Vector Machines B1 b11 b12 0bxw  1 bxw  1 bxw  1 ( ) if w x b 1 ( ) 1 ( ) if w x b 1 positive class f x negative class                  2 |||| 2 Margin w 
  • 32.
    Support Vector Machines •We want to maximize: – Which is equivalent to minimizing: – But subjected to the following constraints: – This is a constrained optimization problem. Numerical approaches, e.g. quadratic programming, can be used to solve it. 2 |||| 2 Margin w  i i 1 if w x b 1 ( ) 1 if w x b 1if x           2 |||| )( 2 w wL  
  • 33.
    Support Vector Machines •Decision Function for classifying a given data z  i i i i SV f(z) = sign y x z + b        
  • 34.
    Nonlinear Support VectorMachines • What if decision boundary is not linear?• What if decision boundary is not linear?
  • 35.
    Nonlinear Support VectorMachines • Transform data vector X into new dimensional space • Some Kernel Functions can be used to compute the dot product between any two given original data vectors in the new data space (without the need of actual data transformation).
  • 36.
    Ensemble Methods • Constructa set of classifiers from the training data • Predict class label of previously unseen records by aggregating predictions made by multiple classifiers
  • 37.
    General Idea Original Training data ....D1 D2Dt-1 Dt D Step 1: Create Multiple Data Sets C1 C2 Ct -1 Ct Step 2: Build Multiple Classifiers C* Step 3: Combine Classifiers
  • 38.
    Why does itwork?        25 13 25 06.0)1( 25 i ii i  • Suppose there are 25 base classifiers – Each classifier has error rate,  = 0.35 – Assume classifiers are independent – Probability that the ensemble classifier makes a wrong prediction:
  • 39.
    Examples of EnsembleMethods • How to generate an ensemble of classifiers? – Bagging – Boosting
  • 40.
    Bagging • Sampling withreplacement • Build classifier on each bootstrap sample • Each sample has probability 1 - (1 – 1/n)n of being selected in each round Original Data 1 2 3 4 5 6 7 8 9 10 Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9 Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2 Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
  • 41.
    Boosting • An iterativeprocedure to adaptively change distribution of training data by focusing more on previously misclassified records – Initially, all N records are assigned equal weights – Unlike bagging, weights may change at the end of boosting round
  • 42.
    Boosting • Records thatare wrongly classified will have their weights increased • Records that are classified correctly will have their weights decreased Original Data 1 2 3 4 5 6 7 8 9 10 Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3 Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2 Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4 • Example 4 is hard to classify • Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds
  • 43.
    What is ClusterAnalysis? • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized
  • 44.
    Applications of ClusterAnalysis • Understanding – Group related documents for browsing, group customers into segments or group stocks with similar price fluctuations • Summarization – Reduce the size of large data sets by sampling data from each cluster
  • 45.
    K-means Clustering • Eachcluster is associated with a centroid (center point) • Each data point is assigned to the cluster with the closest centroid • Number of clusters, K, must be specified • The basic algorithm is very simple
  • 46.
    K-Means Algorithm -2 -1.5-1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 4 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 6
  • 47.
    Limitations of K-means •K-means has problems when clusters are of differing – Sizes – Densities – Non-globular shapes • K-means has problems when the data contains outliers.
  • 48.
    Hierarchical Clustering • Producesa set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram – A tree like diagram that records the sequences of merges or splits 1 3 2 5 4 6 0 0.05 0.1 0.15 0.2 1 2 3 4 5 6 1 2 3 4 5
  • 49.
    Agglomerative Clustering Algorithm •A popular hierarchical clustering technique • Basic algorithm is straightforward 1. Compute the proximity matrix (similarities between pairs of clusters) 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the proximity matrix 6. Until only a single cluster remains May not be suitable for large datasets due to the cost of computing and updating the proximity matrix
  • 50.
    How to DefineInter-Cluster Similarity p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Similarity? • MIN • MAX • Group Average • Distance Between Centroids • Ward’s Method uses squared error Proximity Matrix
  • 51.
    Hierarchical Clustering: Comparison GroupAverage Ward’s Method 1 2 3 4 5 6 1 2 5 3 4 MIN MAX 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 5 3 41 2 3 4 5 6 1 2 3 4 5
  • 52.
    Other Issues • DataCleaning • Data Sampling • Dimension Reduction • Data Visualization • Over fitting and Under fitting Problems • Imbalance Issues