Machine Learning_SVM_KNN_K-MEANSModule 2.pdf

MACHINE LEARNING (INTEGRATED)
(21ISE62)
Dr. Shivashankar
Professor
Department of Information Science & Engineering
GLOBAL ACADEMY OF TECHNOLOGY-Bengaluru
7/19/2024 1
Dr. Shivashankar, ISE, GAT
GLOBAL ACADEMY OF TECHNOLOGY
Ideal Homes Township, Rajarajeshwari Nagar, Bengaluru – 560 098
Department of Information Science & Engineering

Course Outcomes
After Completion of the course, student will be able to:
 Illustrate Regression Techniques and Decision Tree Learning
Algorithm.
 Apply SVM, ANN and KNN algorithm to solve appropriate problems.
 Apply Bayesian Techniques and derive effective learning rules.
 Illustrate performance of AI and ML algorithms using evaluation
techniques.
 Understand reinforcement learning and its application in real world
problems.
Text Book:
1. Tom M. Mitchell, Machine Learning, McGraw Hill Education, India Edition 2013.
2. EthemAlpaydın, Introduction to machine learning, MIT press, Second edition.
3. Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining,
Pearson, First Impression, 2014.
7/19/2024 2

MODULE-2
SUPPORT VECTOR MACHINE
• Support Vector Machine called as SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression prediction tool that
uses machine learning theory to maximize predictive accuracy while automatically
avoiding over-fit to the data.
• SVM can be defined as systems which use hypothesis space of a linear functions in a
high dimensional feature space, trained with a learning algorithm.
• The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data
point in the correct category in the future.
• This best decision boundary is called a hyperplane.
• SVM becomes famous when, using pixel maps as input; it gives best accuracy.
• SVM was developed by Vladimir Vapnik in the 1970s.
• SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane.
• SVM algorithm finds the closest data points of the lines from both the classes.
• These points are called support vectors.
• The distance between the vectors and the hyperplane is called as margin and the goal
of SVM is to maximize this margin.
• The hyperplane with maximum margin is called the optimal hyperplane.
7/19/2024 3

Cont…
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM:
Linear SVM: Used for linearly separable data, if a dataset can be classified into two classes
by using a single straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
Non-linear SVM: Used for non-linearly separated data, if a dataset cannot be classified by
using a straight line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.
7/19/2024 4
Fig. 2.1. Concept of SVM Technique

Examples of Bad Decision Boundaries
Class 1
Class 2
Class 1
Class 2
Fig. 3: Examples of Bad Decision Boundaries

Linearly Separable Case
If a dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM classifier
and classification problem is Binary classification or two class classification.
Binary classification can be viewed as the task of separating classes in feature space:
Hyperplane:
Where
7/19/2024 6
f(x) = (wTx + b)
– w : weight vector
– x : input vector
– b : bias or offset value
Fig 2.2: Linearly Separable classification

Cont..
Define the hyperplanes H such that
w•xi+b ≥1, when yi =+1
w•xi+b < -1, when yi =–1
H1 and H2 are the margins:
H1: w•xi+b = +1
H2: w•xi+b = –1
The points on the margins H1 and H2 are the tips of the Support Vectors.
The plane H0 is the median in between, where w•xi+b =0
d+ = the shortest distance to the closest positive point.
d- = the shortest distance to the closest negative point.
The margin (gutter) of a separating hyperplane is d+ + d–.
7/19/2024 7

Maximizing the margin
We want a classifier with as big margin as possible
Recall the distance from a point (x0,y0) to a line:
Ax+By+c = 0 is |A x0 +B y0 +c|/sqrt(A2+B2)
The distance between H1 and H2 is:
|w•x+b|/||w||=1/||w||
The distance between H1 and H2 is: 2/||w||
In order to maximize the margin, we need to minimize ||w||. With the
condition that there are no datapoints between H1 and H2 :
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi = -1 Can be combined into yi(xi•w)  1
7/19/2024 8

Constrained optimization problem
• The problem of finding the optimal hyperplane is an optimization problem
and can be solved by optimization techniques.
• It can be solved by the Lagrangian Multipler method (αi), Which can be
formulated as:
𝑤 = ෍
𝑖=1
𝑚
𝛼𝑖𝑥𝑖𝑦𝑖
𝛼𝑖: the Lagrange multiplier, we need a Lagrange multiplier 𝛼i for each of the
constraints
𝑥𝑖 𝑎𝑛𝑑 𝑦𝑖 are called as the support vectors.
7/19/2024 9

Cont…
Problems:
1. Draw the hyperplane for the given data points (1,1) (2,1) (1,-1) (2,-1) (4,0) (5,1) (5,-1)
(6,0) using SVM and classifying new data points (2,-2).
Solution:
1. Plot the graph:
𝑆𝑒𝑙𝑒𝑐𝑡𝑠 𝑡ℎ𝑒 𝑣𝑒𝑐𝑡𝑜𝑟 𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑠: 𝑆1 =
2
1
𝑆2 =
2
−1
𝑆3 =
4
0
𝑆1 𝑆2 𝑆3--Support Vector because these are closest data points to the centroid (3-x-axis)
2. To provide vector representation, we need to add bias on all support vectors. Here we
assume bias=1. So, our support vector now become:
ҧ
𝑆1
2
1
1
ҧ
𝑆2
2
−1
1
ҧ
𝑆3
4
0
1
7/19/2024 10

Cont…
3. Consider one part of the support vector as +ve and other as –ve.
Here, 𝑆1 and 𝑆2 𝑎𝑟𝑒 − 𝑣𝑒 𝑎𝑛𝑑 𝑆3 𝑖𝑠 + 𝑣𝑒.
4. Our objective is to find an optimal hyperplane which means, we need to find
the values of w and b of the optimal hyperplane.
f 𝒙 = w.x +b=0
5. To find the optimal hyperplane, we use Lagrange (α) Multiplier method.
Now let us complete w and b which determine the Optimal hyperplane.
According to Lagrange equation,
𝑤 = ෍
𝑖=1
𝑚
Here, 𝑥𝑖 𝑎𝑛𝑑 𝑦𝑖 𝑎𝑟𝑒 𝑡ℎ𝑒 support vectors, 𝑆1 , 𝑆2 𝑎𝑛𝑑 𝑆3
Let us substitute support vectors in above equations
∝1
ഥ
𝑆1
ഥ
𝑆1+ ∝2
ഥ
𝑆1 𝑆2+∝3
ഥ
𝑆1 𝑆3 = −1
∝1 𝑆2𝑆1+ ∝2 𝑆2 𝑆2+∝3 𝑆2 𝑆3 = −1
∝1 𝑆3𝑆1+ ∝2 𝑆3 𝑆2+∝3 𝑆3 𝑆3 = 1
7/19/2024 11

Cont…
Let us substitute values of 𝑆1 , 𝑆2 𝑎𝑛𝑑 𝑆3
∝1
2
1
1
2
1
1
+ ∝2
2
1
1
2
−1
1
+ ∝3
2
1
1
4
0
1
= −1
∝1
2
−1
1
2
1
1
+ ∝2
2
−1
1
2
−1
1
+ ∝3
2
−1
1
4
0
1
= −1
∝1
4
0
1
2
1
1
+ ∝2
4
0
1
2
−1
1
+ ∝3
4
0
1
4
0
1
= 1
Therefore,
6 ∝1+4 ∝2+9 ∝3=-1
4 ∝1+6 ∝2+9 ∝3=-1
9 ∝1+9 ∝2+17 ∝3=1
After solving the above equations, we get
∝1=-3.25
∝2= -3.25
∝3= 3.5
7/19/2024 12

Cont…
Now let us find w, i.e.
𝒘 = ෍ ∝𝒊
ഥ
𝑺𝒊
𝑤 = −3.25
2
1
1
−3.25
2
−1
1
+ 3.5
4
0
1
W=
1
0
−3
Therefore, hyperplane equation, f(x)=w.x+b
So, w=
1
0
and offset or bias, b=-3
5. Plot hyperplane
7/19/2024 13

Cont…
Since b=-3, a hyperplane is drawn +3 to the positive side and w is
1
0
, the hyperplane is
drawn parallel to y – axis.
Now let us clarify the new data points
2
−2
We know that
w.x+b ≥ 𝟎 −− −𝒃𝒆𝒍𝒐𝒏𝒈𝒔 𝒕𝒐 𝒄𝒍𝒂𝒔𝒔 + 𝟏
w.x+b < 𝟎 −− −𝒃𝒆𝒍𝒐𝒏𝒈𝒔 𝒕𝒐 𝒄𝒍𝒂𝒔𝒔 − 𝟏
Let us substitute the values in the above equation
Y= w.x+b
Y=
1
0
2
−2
− 𝟑
Y=2-0-3 =-1
Therefore, new data point
2
−2
belongs to class -1
7/19/2024 14

Cont…
Proble-2:
Draw the hyperplane for the given data points Positively labelled data points
(3,1)(3,-1)(5,1)(5,-1) and Negatively labelled data points (1,0)(0,1)(0,-1)(-1,0)
using SVM and classifying the solution.
Solution:
𝑆𝑒𝑙𝑒𝑐𝑡𝑠 𝑡ℎ𝑒 𝑣𝑒𝑐𝑡𝑜𝑟 𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑠: 𝑆1 =
1
0
𝑆2 =
3
1
𝑆3 =
3
−1
Each vector is augmented with bias 1
So, 2. To provide vector representation, we need to add bias on all support vectors. Here
we assume bias=1. So, our support vector now become:
ҧ
𝑆1
1
0
1
ҧ
𝑆2
3
1
1
ҧ
𝑆3
3
−1
1
∝1=-3.5
∝2= 0.75
∝3= 0.75
W=
1
0
−3
, So, w=
1
0
and offset or bias, b=-2
7/19/2024 15

Non-Linear SVM or Nonlinear Separable Case
• If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line.
• So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:
z=x2 +y2 -------(1)
• We must use a nonlinear SVM (i.e. we need to convert data from one feature space to
another). For nonlinear separable case:
• Φ1
𝑥1
𝑥2
=
4 − 𝑥2 + 𝑥1 − 𝑥2
4 − 𝑥1 + 𝑥1 − 𝑥2 𝑖𝑓 𝑥1
2
+ 𝑥2
2
> 2
𝑥1
𝑥2
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
7/19/2024 16
Fig. 11: Nonlinear data points Fig. 12: Added 3rd axis Fig. 11: After added 3rd axis, best
hyperplane for nonlinear SVM

Conti…
Problem 1: Draw the hyperplane for the given data points Positively labelled data points
(2,2)(2,-2)(-2,-2)(-2,2) and Negatively labelled data points (1,1)(1,-1)(-1,-1)(-1,1) using
nonlinear SVM and classifying the solution.
Solution:
1. Plot the graph
2. Nonlinear separable case:
• From the plotted graph, there is no hyperplane exists in the input space.
• We must use a nonlinear SVM (i.e. we need to convert data from one feature space to
another). For nonlinear separable case:
• Φ1
𝑥1
𝑥2
=
4 − 𝑥2 + 𝑥1 − 𝑥2
4 − 𝑥1 + 𝑥1 − 𝑥2 𝑖𝑓 𝑥1
2
+ 𝑥2
2
> 2
𝑥1
𝑥2
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
7/19/2024 17

Conti…
By applying nonlinear equation, convert the given data pints into other features.
So, positive examples are
𝟐
𝟐
,
𝟐
−𝟐
,
−𝟐
−𝟐
,
−𝟐
𝟐
-----
𝟐
𝟐
,
𝟏𝟎
𝟔
,
𝟔
𝟔
,
𝟔
𝟏𝟎
And negative examples are
𝟏
𝟏
,
𝟏
−𝟏
,
−𝟏
−𝟏
,
−𝟏
𝟏
-----
𝟏
𝟏
,
𝟏
−𝟏
,
−𝟏
−𝟏
,
−𝟏
𝟏
3. Now plot the graph for obtained new data points
Now we can classify easily identify the Support vectors
𝑺𝟏 =
𝟏
𝟏
, 𝑺𝟐 =
𝟐
𝟐
Each vector is augmented with 1 as bias input
ҧ
𝑆1
1
1
1
𝑎𝑛𝑑 ҧ
𝑆2
2
2
1
7/19/2024 18

Conti..
According to Lagrange equation,
𝑤 = ෍
𝑖=1
𝑚
Here, 𝑥𝑖 𝑎𝑛𝑑 𝑦𝑖 𝑎𝑟𝑒 𝑡ℎ𝑒 support vectors, 𝑆1 𝑎𝑛𝑑 𝑆2
Let us substitute support vectors in above equations
∝1
ഥ
𝑆1
ഥ
𝑆1+ ∝2
ഥ
𝑆1 𝑆2 = −1
∝1
ഥ
𝑆1 𝑆2+ ∝2 𝑆2 𝑆2 = 1
After substitute 𝑆1 and 𝑆2 values and simplified the above equations,
3∝1 +5 ∝2= −1
5∝1 +9 ∝2= −1
Therefore ,∝1= −7 𝑎𝑛𝑑 ∝2= 4
𝒘 = ෍ ∝𝒊
ഥ
𝑺𝒊
𝑤 = −7
1
1
1
+4
2
2
1
=
1
1
−3
.
Therefore, hyperplane y=wx+b, with w=
𝟏
𝟏
and bias =-3
7/19/2024 19

Support Vector Machine Terminology
Hyperplane: The hyperplane tries that the margin between the closest points of
different classes should be as maximum as possible. In the case of linear
classifications, it will be a linear equation i.e. wx+b = 0.
Support Vectors: The closest data points to the hyperplane, which makes a critical
role in deciding the hyperplane and margin.
Margin: is the distance between the support vector and hyperplane. The main
objective of the SVM algorithm is to maximize the margin. The wider margin indicates
better classification performance.
Kernel: is the mathematical function, which is used in SVM to map the original input
data points into high-dimensional feature spaces. Some of the common kernel
functions are linear, polynomial and radial basis function(RBF).
Hard Margin: Also called as the maximum-margin hyperplane is a hyperplane that
properly separates the data points of different categories without any
misclassifications.
Soft Margin: When the data is not perfectly separable or contains outliers, SVM
permits a soft margin technique. It discovers a compromise between increasing the
margin and reducing violations.
Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect
classifications or margin violations.
7/19/2024 20

How Does Support Vector Machine Algorithm Work?
• The best way to understand the SVM algorithm is by the SVM classifier.
• This hyper-pane is chosen based on margin as the hyperplane providing the
maximum margin between the two classes is considered.
• These margins are calculated using data points known as Support Vectors.
Support Vectors are those data points that are near to the hyper-plane and
help in positioning data points it.
7/19/2024 21

Cont…
The functioning of SVM classifier is to be understood mathematically then it can
be understood in the following ways-
Step 1: SVM algorithm predicts the classes. One of the classes is identified as 1
while the other is identified as -1.
Step 2: As all machine learning algorithms convert the business problem into a
mathematical equation involving unknowns. These unknowns are then found by
converting the problem into an optimization problem.
Step 3: This loss function can also be called a cost function whose cost is 0 when
no class is incorrectly predicted. If this is not the case, then error/loss is
calculated.
Step 4: As is the case with most optimization problems, weights are optimized by
calculating the gradients using advanced mathematical concepts of calculus viz.
partial derivatives.
Step 5: The gradients are updated only by using the regularization parameter
when there is no error in the classification while the loss function is also used
when misclassification happens.
7/19/2024 22

Important Concepts in SVM
• Support vectors are those data points whose basis the margins are calculated
and maximized.
• The number of support vectors or the strength of their influence is one of the
hyper-parameters.
7/19/2024 23
Fig. 2: Presents Support vectors, margin and Classes

Cont…
Hard Margin:
• Hard Margin refers to that kind of decision boundary that makes sure that all
the data points are classified correctly.
• While this leads to the SVM classifier not causing any error, it can also cause
the margins to shrink thus making the whole purpose of running an SVM
algorithm without results.
Soft Margin:
• Soft Margin SVM introduces flexibility by allowing some margin violations
(misclassifications) to handle cases where the data is not perfectly separable.
7/19/2024 24

SVM Implementation in Python
In Python, an SVM classifier can be developed using the sklearn library.
Step 1: Load the important libraries
>> import pandas as pd
>> import numpy as np
>> import sklearn
>> from sklearn import svm
>> from sklearn.model_selection import train_test_split
>> from sklearn import metrics
Step 2: Import dataset and extract the X variables and Y separately.
>> df = pd.read_csv(“mydataset.csv”)
>> X = df.loc[:,[‘Var_X1’,’Var_X2’,’Var_X3’,’Var_X4’]]
>> Y = df[[‘Var_Y’]]
Step 3: Divide the dataset into train and test
>> X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3,
random_state=123)
Step 4: Initializing the SVM classifier mode
>> svm_clf = svm.SVC(kernel = ‘linear’)
7/19/2024 25

Cont…
Step 5: Fitting the SVM classifier model
>> svm_clf.fit(X_train, y_train)
Step 6: Coming up with predictions
>> y_pred_test = svm_clf.predict(X_test)
Step 7: Evaluating model’s performance
>> metrics.accuracy(y_test, y_pred_test)
>> metrics.precision(y_test, y_pred_test)
>> metrics.recall(y_test, y_pred_test)
7/19/2024 26

Advantages & Disadvantages of SVM
Advantages
• It is one of the most accurate machine learning algorithms.
• It is a dynamic algorithm and can solve a range of problems, including linear and
non-linear problems, binary, binomial, and multi-class classification problems,
along with regression problems.
• SVM uses the concept of margins and tries to maximize the differentiation
between two classes; it reduces the chances of model overfitting, making the
model highly stable.
• SVM is known for its computation speed and memory management. It uses less
memory, especially when compared to machine vs deep learning algorithms with
whom SVM often competes.
Disadvantages:
• While SVM is fast and can work in high dimensions, it still fails in front of Naïve
Bayes, providing faster predictions in high dimensions. Also, it takes a relatively
long time during the training phase.
• Compared to other linear algorithms such as Linear Regression, SVM is not highly
interpretable, especially when using kernels that make SVM non-linear. Thus, it
isn’t easy to assess how the independent variables affect the target variable.
7/19/2024 27

Cont…
Applications of SVM:
• Text categorization
• Semantic role labeling (predicate, agent, ..)
• Image classification
• Image segmentation
• Hand-written recognition
Characteristics of SVM
• Based on supervised learning methods
• Using for classification or regression analysis
• A non-probabilistic binary linear classifier
• Representation of the examples as points in space
• Examples of the separate categories are divided by a clear gap that is as
wide as possible.
• New examples are then mapped into that same space and predicted to
belong to a category based on the side of the gap on which they fall
• Performing linear classification.
7/19/2024 28

K-Nearest Neighbour
• The k-Nearest Neighbors (KNN) algorithm is a non-parametric, supervised learning classifier,
which uses proximity to make classifications or predictions about the grouping of an individual
data point.
• It is one of the popular and simplest classification and regression classifiers used in machine
learning today.
• The nearest neighbors of an instance are defined in terms of the standard Euclidean Distance.
More precisely, let an arbitrary instance x be described by the feature vector
(𝑎1 𝑥 , 𝑎2 𝑥 , … … . , 𝑎𝑛(𝑥))
Distance between two instances 𝑥𝑖 and 𝑥𝑗 is defined to be d(𝒙𝒊, 𝒙𝒋 ), where,
𝑑 𝑥𝑖, 𝑥𝑗 ≡ ෍
𝑟=1
𝑛
𝑎𝑟 𝑥𝑖 − 𝑎𝑟 𝑥𝑗
2
The K-NN Real valued target function can be defined as:
f(x)=
σ𝒊=𝟏
𝒌
𝒘𝒊𝒇(𝒙𝒊)
σ𝒊=𝟏
𝒌 𝒘𝒊
Where, 𝒘𝒊 =
𝟏
𝒅 𝒙𝒒,𝒙𝒊
𝟐
7/19/2024 29
Fig 2.1: K-NN example

Cont…
7/19/2024 30
𝑬𝒄𝒍𝒊𝒅𝒆𝒂𝒏 𝑫𝒊𝒔𝒕𝒂𝒏𝒄𝒆 𝒃𝒆𝒕𝒘𝒆𝒆𝒏 𝑨 𝒂𝒏 𝑩 = 𝑿𝟐 − 𝑿𝟏
𝟐 + 𝒀𝟐 − 𝒀𝟏
𝟐

K-Nearest Neighbor (KNN) Algorithm for Machine Learning
• K-NN is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to
the available categories.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the
new data.
7/19/2024 31

How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category, number of the neighbor is maximum.
Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
7/19/2024 32
Fig. 2.13: K-NN for best
classifier

Cont…
• Firstly, we will choose the number of neighbors, so we will choose the k value.
• Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
• Euc-dist[(𝑥1, 𝑦1); (𝑥2, 𝑦2)= 𝑥2 − 𝑥1
2 + 𝑦2 − 𝑦1
2
• By calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B.
Consider the below graph:
• As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A.
7/19/2024 33

Why do we need a K-NN Algorithm?
• Suppose there are two categories, i.e., Category A and Category B, and we
have a new data point x1, so this data point will lie in which of these
categories.
• To solve this type of problem, we need a K-NN algorithm. With the help of K-
NN, we can easily identify the category or class of a particular dataset.
7/19/2024 34
Fig. 11. Presents the importance of KNN

Advantages and Disadvantages of KNN Algorithm
Advantages:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
Disadvantages:
• Always needs to determine the value of K which may be complex
some time.
• The computation cost is high because of calculating the distance
between the data points for all the training samples.
7/19/2024 35

Applications of K-nearest Neighbor
1. Credit score
The KNN algorithm compares an individual's credit rating to others with comparable characteristics to help
calculate their credit rating.
2. Approval of the loan
The k-nearest neighbor technique, similar to credit scoring, is useful in detecting people who are more likely
to default on loans by comparing their attributes to those of similar people.
3. Preprocessing of data
Many missing values can be found in datasets. Missing data imputation is a procedure that uses the KNN
algorithm to estimate missing values.
4. Healthcare:
KNN has also had application within the healthcare industry, making predictions on the risk of heart attacks
and prostate cancer. The algorithm works by calculating the most likely gene expressions..
5. Prediction of stock prices
The KNN algorithm is useful in estimating the future value of stocks based on previous data since it has a
knack for anticipating the prices of unknown entities.
6. Recommendation systems
KNN can be used in recommendation systems since it can help locate people with comparable traits. It can
be used in an online video streaming platform, for example, to propose content that a user is more likely to
view based on what other users watch.
7. Computer Vision
For picture classification, the KNN algorithm is used. It's important in a variety of computer vision
applications since it can group comparable data points together, such as cats and dogs in separate classes.
8. Easy to implement:
Given the algorithm’s simplicity and accuracy, it is one of the first classifiers that a new data scientist will
learn.
7/19/2024 36

Conti..
Problem 1: From the given dataset, find (x,y)= (170, 57) whether belongs to under or
normal weight. Assume K=3.
Solution:
Find the Euc-dist:d= 𝑥2 − 𝑥1
2 + 𝑦2 − 𝑦1
2
d1= 170 − 167 2 + 57 − 51 2
= 32 + 62 = 45 = 6.70
d2= 122 + 52 = 169 =13
And so on
7/19/2024 37
Height (cm) Weight (kg) Class
167 51 Underweight
182 62 Normal
176 69 Normal
173 64 Normal
172 65 Normal
174 56 Underweight
169 58 Normal
173 57 Normal
170 55 Normal
170 57 ?

Conti..
Since K=3, with maximum 3 ranks
with distances.
The smallest distance is
• (169,58)-1.414: Normal
• (170,55)-2: Normal
• (173,57)-3:Normal
Hence all 3 points, so (170,57)belongs
to normal class,
7/19/2024 38
Height (cm) Weight (kg) Class Distance
167 51 Underweight 6.7
182 62 Normal 13
176 69 Normal 13.4
173 64 Normal 7.6
172 65 Normal 8.2
174 56 Underweight 4.1
169 58 Normal 1.414-1(R)
173 57 Normal 3-3(R)
170 55 Normal 2-2(R)
170 57 Normal 3

Conti..
Problem 2: From the given dataset, find (x,y)= (157, 54) whether belongs to medium or
longer. Assume K=3.
Solution:
Find the Euc-dist:d= 𝑥2 − 𝑥1
2 + 𝑦2 − 𝑦1
2
7/19/2024 39
Sl. No. Height Weight Target
1 150 50 Medium
2 155 55 Medium
3 160 60 Longer
4 161 59 Longer
5 158 65 Longer
6 157 54 ?
Sl. No. Height Weight Target Distance
1 150 50 Medium 8.06
2 155 55 Medium 2.24 (1)
3 160 60 Longer 6.71(3)
4 161 59 Longer 6.40(2)
5 158 65 Longer 11.05
6 157 54 ?

Conti..
From the table and K=3, with maximum 3 ranks with distances
We have 2.24 (medium), 6.40(Longer) and 6.71(Longer)
f(𝑥𝑣) =
f(𝑥𝑣) = angmax
𝑣𝜖𝑉
෍
𝑖=1
𝑘
𝛿 𝑣, 𝑓(𝑥𝑣 ) −− −𝛿 𝑎, 𝑏 = 1 𝑖𝑓 𝑎 == 𝑏
𝛿 𝑎, 𝑏 = 0 𝑖𝑓 𝑎 ≠ 𝑏
Compare medium with 2.24(m), 6.40(L) and 6.71(L)
==𝛿 𝑀, 𝑀 + 𝛿 𝑀, 𝐿 + 𝛿 𝑀, 𝐿
1+0+0=1
Compare longer with 2.24(m), 6.40(L) and 6.71(L)
==𝛿 𝐿, 𝑀 + 𝛿 𝐿, 𝐿 + 𝛿 𝐿, 𝐿
0+1+1=2
Since 2 is longer, (157,54)belong to longer
If we consider the distance 2.24, 6.71 and 6.40, -----2.24 is smaller, hence medium could be consider.
Distance weighted NN:
1. Discrete valued target function
2. Real valued target function
7/19/2024 40

Conti..
Discrete valued function:
f(𝑥𝑣) = angmax
𝑣𝜖𝑉
෍
𝑖=1
𝑘
𝑤𝑖𝛿 𝑣, 𝑓(𝑥𝑖 )
Where, 𝑤𝑖 =
1
𝑑 𝑥𝑞,𝑥𝑖
2
W.r.t. medium:
f(𝑥𝑞) =0.199*𝛿 𝑚, 𝑚 + 0.022∗𝛿 𝑚, 𝑙 +
0.024*𝛿 𝑚, 𝑙
=0.199*1 + 0.022∗0+0.024∗0= 0.199
W.r.t. Longer:
f(𝑥𝑞) =0.199*𝛿 𝑙, 𝑚 + 0.022∗𝛿 𝑙, 𝑙 +
0.024*𝛿 𝑙, 𝑙
= 0.199*0 + 0.022∗1+0.024∗1= 0.046
Since 0.199 > 0.046—new instance is
Classified to medium.
7/19/2024 41
Sl.
No.
Height Weight Target Distance 1
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 2
1 150 50 Mediu
m
8.06
2 155 55 Mediu
m
2.24 (1) 0.199
3 160 60 Longer 6.71(3) 0.022
4 161 59 Longer 6.40(2) 0.024
5 158 65 Longer 11.05
6 157 54 Mediu
m

Conti..
Real valued target function:
f(x)=
σ𝑖=1
𝑘
𝑤𝑖𝑓(𝑥𝑖)
σ𝑖=1
𝑘 𝑤𝑖
Where, 𝑤𝑖 =
1
𝑑 𝑥𝑞,𝑥𝑖
2 weighted vectors-randomly we will consider
f(𝑥𝑞) =
(0.199∗1.2+0.022∗1.8+0.024∗2.1)
0.45+0.15+0.16
=1.51
7/19/2024 42
Sl. No. Height Weight Target Distance 1
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 2
1 150 50 1.5 8.06
2 155 55 1.2 2.24 (1) 0.199
3 160 60 1.8 6.71(3) 0.022
4 161 59 2.1 6.40(2) 0.024
5 158 65 1.7 11.05
6 157 54 1.5

Conti..
Problem 3: Calculate the centroid classifier for the give data and the given a test instance
(6,5), predict the class.
Solution:
• Step1: Compute the mean/centroid of each class.
• There are 2 classes, A & B.
• Centroid of class A=(3+5+4,1+2+3)/3=(12,6)/3=(4,2)
• Centroid of class B=(7+6+8,6+7+5)/3=(21,18)/3=(7,6)
• Step 2: calculate the Euclidean distance between test instance (6,5) and each of the
centroid.
7/19/2024 43
X Y Class
3 1 A
5 2 A
4 3 A
7 6 B
6 7 B
8 5 B

Conti..
Euc-dist[(𝑥1, 𝑦1); (𝑥2, 𝑦2)= 𝑥2 − 𝑥1
2 + 𝑦2 − 𝑦1
2
Class A : [(6,5);(4,2)] = 4 − 6 2 + 2 − 5 2 = 3.6
Class B: [(6,5);(7,6)] = 7 − 6 2 + 6 − 5 2 =1.414
The test instance has smaller distance to class B
Hence, the class of this test instance is predicted as B.
Problem 4. Given the following training instances in the table, each having two attributes
(x1 and x2). Compute the class label for test instance 𝑡1 = 3,7 , using 3 nearest neighbors
(k=3).
7/19/2024 44
Training
Instances
𝑥1 𝑥2 Output
𝐼1 7 7 0
𝐼2 7 4 0
𝐼3 3 4 1
𝐼4 1 4 1

Conti..
Euc-dist[(𝑥1, 𝑦1); (𝑥2, 𝑦2)=d= 𝑥1 − 𝑦1
2 + 𝑥2 − 𝑦2
2
d= 𝑥1 − 𝑦1
2 + 𝑥2 − 𝑦2
2 Neighbor rank
𝑑1 = 7 − 3 2 + 7 − 7 2 =4 3
𝑑2 = 7 − 3 2 + 4 − 7 2 = 5 4
𝑑3 = 3 − 3 2 + 4 − 7 2 = 3 1
𝑑4 = 1 − 3 2 + 4 − 7 2 = 3.6 2
For K=3, we will consider 𝐼1 = 3, 𝐼3 = 1,and 𝐼4 = 2
So K=3, 𝑡2=(3,7) -----output is 1
Highest vote=0.11,
so output =1
7/19/2024 45
d 𝑑2 Vote
=1/𝑑2
Rank
4 16 1/16=0.06 3
5 25 1/25=0.04 4
3 9 1/9=0.11 1
3.6 12.96 1/12.96=0
.08
2

Conti..
Problem 5: Apply KNN classifier to predict the diabetic patience with the given features
BMI, Age. If the training examples are: Assume K=3, Test example: BMI=43.6, Age=40,
Sugar=?
7/19/2024 46
BMI Age Sugar
33.6 50 1
26.6 30 0
23.4 40 0
43.1 67 0
35.3 23 1
35.9 67 1
36.7 45 1
25.7 46 0
23.3 29 0
31 56 1

Conti..
Solution:
First calculate the distance between the test instances and training instance: Test examples: BMI=43.6.
Age=40, sugar=?
Euc-dist=d= 𝑥2 − 𝑥1
2 + 𝑦2 − 𝑦1
2 , 𝒅𝟏 = 43.6 − 33.6 2 + 40 − 50 2 = 14.14
Therefore, for test examples: BMI=43.6, Age=40, sugar=1, because in the rank 1, sugar=1
7/19/2024 47
BMI Age Sugar Distance to new Rank
33.6 50 1 14.14 2
26.6 30 0 19.72 5
23.4 40 0 20.20 6
43.1 67 0 27.00 9
35.3 23 1 18.92 4
35.9 67 1 28.08 10
36.7 45 1* 8.52 1
25.7 46 0 18.88 3
23.3 29 0 23.09 8
31 56 1 20.37 7

Cont…
Problem 6: given the training data, predict the class of the following new examples using KNN for K=5,
age<=30, income = medium, student=yes, credit rating=fair.
7/19/2024 48
Age Income Student Credit
rating
Buys
computers
<=30 High No Fair No
<=30 High No Excellent No
30..40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31..40 Low Yes Excellent Yes
<=30 Medium No Fair no
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31..40 Medium No Excellent Yes
31..40 High Yes Fair Yes
>40 Medium no Excellent No

Cont…
Solution:
• For similarity measures, use a single match of attribute values:
• σ𝑖=1
4
𝑤𝑖 ∗
𝜕 𝑎𝑖,𝑏𝑖
4
• Where, 𝜕 𝑎𝑖, 𝑏𝑖 =1 if 𝑎𝑖 = 𝑏𝑖 and
• =0 otherwise.
• 𝑎𝑖𝑎𝑛𝑑 𝑏𝑖 are either age, income, stude or credit rating
• Weight are all 1 except for income it is 2.
• Now, new examples using KNN for K=5, age<=30, income = medium, student=yes,
credit rating=fair.
• For RID=1 class=no, distance to new:
(1*1+2*0+1*0+1*1)/4=0.5
7/19/2024 49
Age<=30 from the
table
Age<=30 from the
given new examples
Income-high from
the table Income-medium
Student-no from
the table
Student-yes
Credit rating-fair
from the table
Credit rating-fair
from new example

Cont…
7/19/2024 50
Age Income Student Credit rating Buys
computers
RID class distance
<=30 High No Fair No 1 No 0.5
<=30 High No Excellent No 2 No 0.25
30..40 High No Fair Yes 3 Yes 0.25
>40 Medium No Fair Yes* 4 Yes 0.75
>40 Low Yes Fair Yes 5 Yes 0.5
>40 Low Yes Excellent No 6 No 0.25
31..40 Low Yes Excellent Yes 7 Yes 0.25
<=30 Medium No Fair No 8 No 1
<=30 Low Yes Fair Yes* 9 Yes 0.75
>40 Medium Yes Fair Yes* 10 Yes 1
<=30 Medium Yes Excellent Yes* 11 Yes 1
31..40 Medium No Excellent Yes 12 Yes 0.5
31..40 High Yes Fair Yes 13 Yes 0.5
>40 Medium no Excellent No 14 No 0.5

Cont…
• Therefore, among the five nearest neighbors (RID and distance
values: 4-0.75,8-1,9—0.75,10-1,11-1), four are from class Yes and
one from class No.
• Hence, the KNN-classifier, buy computers=yes.
7/19/2024 51

Clustering K-means
• The task of grouping data points based on their similarity with each other is
called Clustering or Cluster Analysis.
• This method is defined under the branch of Unsupervised Learning, which
aims at gaining insights from unlabelled data point
• Cluster analysis divides the data into groups (clusters) that are meaningful,
useful, or both.
• For instance, clustering can be regarded as a form of classification in that it
creates a labeling of objects with class (cluster) labels.
7/19/2024 52

K-means
• K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science.
• K means clustering, assigns data points to one of the K clusters depending on their distance
from the center of the clusters.
• It starts by randomly assigning the clusters centroid in the space.
• Then each data point assign to one of the cluster based on its distance from centroid of the
cluster.
• After assigning each point to one of the cluster, new cluster centroids are assigned.
• This process runs iteratively until it finds good cluster.
• Here, K defines the number of pre-defined clusters that need to be created in the process,
as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
• Hence, each cluster has data points with some commonalities/similarities, and it is away
from other clusters.
7/19/2024 53

The Basic K-means Algorithm
• First, we randomly initialize k points, called means or cluster centroids.
• We categorize each item to its closest mean, and we update the mean’s coordinates,
which are the averages of the items categorized in that cluster so far.
• We repeat the process for a given number of iterations and at the end, we have our
clusters.
Basic K-means algorithm
Step-1: Select the number K (clusters) randomly to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
7/19/2024 54

Strengths and Weaknesses
Strength
• K-means is simple and can be used for a wide variety of data types.
• It is also quite efficient, even though multiple runs are often performed.
• This algorithm is very easy to understand and implement.
• This algorithm is efficient, Robust, and Flexible
• If data sets are distinct and spherical clusters, then give the best result
Weaknesses
• This algorithm needs prior specification for the number of cluster centers that is the
value of K.
• It cannot handle outliers and noisy data, as the centroids get deflected
• It does not work well with a very large set of datasets as it takes huge computational
time.
7/19/2024 55

Cont…
Problem 1: Divide the given sample data into two clusters [2] using K means algorithm
S={2,3,4,10,11,12,20,25,30}. Given K=2, for new data point 15, identify the cluster belongs
to.
Solution:
1. Choose 2 random clusters from the given data sets C1=4, C2=12.
2. Find the distance between given samples and centroids, put the sample in the nearest
cluster.
3. Repeat the same for all data points.
Cluster k1={2,3,4} -------------(2-4=2, 3-4=1, 4-4=0, 10-4=6,……..
2-12=10, 3-12=9, 5-12=7, 10-12=2,……..
so 2,3 and 4 are placed in cluster 1 as its distance is nearest
to C1=4 and
Cluster K2={10,11,12,20,25,30}
4. Compute new centroids
K1={2,3,4} K2={10,11,12,20,25,30}
C1={2+3+4/3}=3 C2={10+11+12+20+25+30}/6=18
So C1=3 C2=18
7/19/2024 56

Cont…
5. Find new clustering C1=3 and C2=18
K1={2,3,4,10} K2={11,12,20,25,30}
C1=2+3+4+10/4 =4.75 K2=11+12+20+25+30/5=19.6
6. Find new clustering C1=4.75 and C2=19.6
K1={2,3,4,10,11,12} K2-{20,25,30}
C1=2+3+4+10+11+12/6=7 C2=20+25+30/3=25
7. Find new clustering C1=7 and C2=25
K1={2,3,4,10,11,12} K2-{20,25,30}
Since clustering and centroid values remains same.
So the given dataset is dividing into 2 clusters as
K1={2,3,4,10,11,12} K2-{20,25,30}
With centroids C1=7 and C2=25.
8. Identify the cluster for new data points 15
Distance between 15 and C1(15-7)=8
Distance between 15 and C2(15-25)=10
Since distance between 15 and C1 is less, new data point 15 belongs to C1(=7).
7/19/2024 57

Cont…
Problem 2: Divide the following data points into two clusters using K-mean and identify (5,4) belongs
to which cluster.
Solution:
Step 1: Choosing randomly 2 clusters centers
C1=(2,1) and C2=(2,3)
Step 2: Finding distance between two clusters centers and each data point (Apply Euclidean distance)
For data points, (1,1) and C1(2,1): d= 1 − 2 2 + 1 − 1 2 = 1
(2,1) and (2,1): d= 2 − 2 2 + 1 − 1 2 = 0
(2,3) and (2,1): d= 2 − 2 2 + 3 − 1 2 = 2 and so on
7/19/2024 58
X 1 2 2 3 4 5
Y 1 1 3 2 3 5
Data points Distance from C1
(2,1)
Distance from C2(2,3) New clusters
(1,1) 1 2.24 C1
(2,1) 0 2 C1
(2,3) 2 0 C2
(3,2) 1.41 1.41 C1
(4,3) 2.83 2 C2
(5,5) 5 3.61 C2

Cont…
Step 3: cluster 1 of C1={ (1,1), (2,1), (3,2)}
cluster 2 of C2={ (2,3), (4,3), (5,5)}
Step 4: Recalculate cluster center
C1=
1
3
[(1,1)+(2,1)+(3,2)]=
1
3
[6,4]= (2,1.33)
C2=
1
3
[(2,3)+(4,3)+(5,5)]=
1
3
[11,11]= (3.67,3.67)
Step 5: Repeat the step 2 until we get same cluster center or same cluster elements
7/19/2024 59
Data points Distance from
C1(2,1.33)
Distance from
C2(3.67,3.67)
New clusters
(1,1) 1.05 3.78 C1
(2,1) 0.33 3.15 C1
(2,3) 1.67 1.8 C1
(3,2) 1.204 1.8 C1
(4,3) 2.605 0.75 C2
(5,5) 4.74 1.88 C2

Cont…
cluster 1 of C1={ (1,1), (2,1),(2,3), (3,2)}
cluster 2 of C2={ (4,3), (5,5)}
Step 6: Recalculate cluster center
C1=
1
4
[(1,1)+(2,1)+(2,3)+(3,2)]=
1
4
[8,7]= (2,1.75)
C2=
1
2
[(4,3)+(5,5)]=
1
2
[9,8]= (4.5,4)
Step 7: Repeat the step 2 until we get same cluster center or same cluster elements
Step 8: cluster 1 of C1={ (1,1), (2,1),(2,3), (3,2)}
cluster 2 of C2={ (4,3), (5,5)}
Since cluster elements are same as compared to previous iteration, stop.
7/19/2024 60
Data points Distance from C1(2,1.75) Distance from C2(4.5,4) New clusters
(1,1) 1.25 4.61 C1
(2,1) 0.75 3.9 C1
(2,3) 1.25 2.69 C1
(3,2) 1.03 2.5 C1
(4,3) 2.36 1.12 C2
(5,5) 4.42 1.12 C2

Cont…
Problem 3 Use K-means clustering to cluster the following data into two groups. Data
points {2,4,10,12,3,20,30,11,25}, initial cluster centroids are M1=4 and M2=11.
Solution: Initial centroids: M1=4, M2=11.
Distance to is calculated by d(𝑥2, 𝑥1) = 𝑥2 − 𝑥1
2
Therefore, C1={2,4,3}
M1=(2+4+3)/3=3
C2={10,12,20,30,11,25}
M2=(10+12+20+30+11+25)/6=18
so new centroids: M1=3, M2=18
7/19/2024 61
Data
points
Distance to Cluster New
cluster
M1(4) M2(11)
2 2 9 C1
4 0 7 C1
10 6 1 C2
12 8 1 C2
3 1 8 C1
20 16 9 C2
30 26 19 C2
11 7 0 C2
25 21 14 C2

Cont…
Current centroids: M1=3, M2=18
Therefore, C1={2,4,20,3}
C2={12,20,30,11,25}
So,
New centroids: M1=4.75
M2=19.6
7/19/2024 62
Data
points
cluster
M1 M2
2 1 16 C1 C1
4 1 14 C1 C1
10 7 8 C2 C1
12 9 6 C2 C2
3 0 15 C1 C1
20 17 2 C2 C2
30 27 12 C2 C2
11 8 7 C2 C2
25 22 7 C2 C2

Cont…
Current centroids: M1=4.75, M2=19.6
Therefore, C1={2,4,10,11,12,3}
C2={20,30,25}
So,
New centroids: M1=7
M2=25
7/19/2024 63
Data
points
cluster
M1 M2
2 2.75 17.6 C1 C1
4 0.75 15.6 C1 C1
10 5.25 9.6 C1 C1
12 7.25 7.6 C2 C1
3 1.75 16.6 C1 C1
20 15.25 0.4 C2 C2
30 25.25 10.4 C2 C2
11 6.25 8.6 C2 C1
25 20.25 5.4 C2 C2

Cont…
Current centroids: M1=7, M2=25
Therefore, final cluster are
• C1=(2,4,10,11,12,13}
• C2={20,30,5}
7/19/2024 64
Data
points
cluster
M1 M2
2 5 23 C1 C1
4 3 21 C1 C1
10 3 15 C1 C1
12 5 13 C1 C1
3 4 22 C1 C1
20 13 5 C2 C2
30 23 5 C2 C2
11 4 14 C1 C1
25 18 0 C2 C2

Cont…
Problem 4: Use K-means clustering to cluster and suppose that the data mining task is to cluster
points into 3 cluster. Where the data points are A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4)
and C1(1,2), c2(4,9). Suppose initially we assign A1, B1 and C1 as the center of each cluster
respectively.
Solution: Initial centroids: A1=(2,10), B1=(5,8), C1=(1,2)
Distance to is calculated by d(𝑃1, 𝑃2) = 𝑥2 − 𝑥1
2 + 𝑦2 − 𝑦1
2
Therefore, C1={2,10}
C2={(8,5,7,6,4) (4,8,5,4,9)}
C3={(2,1)(5,2)}
So new centroids:
A1=(2,10), B1=(6,6) and
C1=(1.5,3.5)
7/19/2024 65
Data points Distance to Cluster New
cluster
2 10 5 8 1 2
A1 2 10 0 3.61 8.06 1
A2 2 5 5 4.24 3.16 3
A3 8 4 8.49 5 7.28 2
B1 5 8 3.61 0 7.21 2
B2 7 5 7.07 3.61 6.71 2
B3 6 4 7.21 4.12 5.39 2
C1 1 2 8 7.21 0 3
C2 4 9 2.24 1.41 7.62 2

Cont…
Current centroids: A1=(2,10), B1=(6,6), C1=(1.5,3.5)
Therefore, C1={2,4) (10,9)}
C2={(8,5,7,6) (4,8,5,4)}
C3={(2,1)(5,2)}
So new centroids:
A1=(3,9.5),
B1=(6.5,5.25) and
C1=(1.5,3.5)
7/19/2024 66
cluster
2 10 6 6 1.5 3.5
A1 2 10 0 5.66 6.52 1 1
A2 2 5 5 4.12 1.58 3 3
A3 8 4 8.49 2.83 6.52 2 2
B1 5 8 3.61 2.24 5.7 2 2
B2 7 5 7.07 1.41 5.7 2 2
B3 6 4 7.21 2.00 4.53 2 2
C1 1 2 8.06 6.46 1.58 3 3
C2 4 9 2.24 3.61 6.04 2 1

Cont…
Current centroids: A1=(3,9.5), B1=(6.5,5.25), C1=(1.5,3.5)
Therefore, the new centroids:
A1=(3.6, 7.9),
B1=(7,4.33) and
C1=(1.5,3.5)
7/19/2024 67
cluster
3 9.5 6.5 65.
25
1.5 3.5
A1 2 10 1.12 6.54 6.52 1 1
A2 2 5 4.61 4.51 1.58 3 3
A3 8 4 7.43 1.95 6.52 2 2
B1 5 8 2.5 3.13 5.7 2 1
B2 7 5 6.02 0.56 5.7 2 2
B3 6 4 6.26 1.35 4.53 2 2
C1 1 2 7.76 6.39 1.58 3 3
C2 4 9 1.12 4.51 6.04 1 1

Cont…
Current centroids: A1=(3.6, 7.9), B1=(7,4.33), C1=(1.5,3.5)
Therefore, the final clusters:
C1={(2,5,4)(10,8,9)},
C2={(8,7,6)(4,5,4)}
C3={(2,1)(5,2)}
7/19/2024 68
cluster
3.
6
7.9 7 4.3
3
1.5 3.5
A1 2 10 1.94 7.56 6.52 1 1
A2 2 5 4.33 5.04 1.58 3 3
A3 8 4 6.62 1.05 6.52 2 2
B1 5 8 1.67 4.18 5.70 1 1
B2 7 5 5.21 0.67 5.70 2 2
B3 6 4 5.52 1.05 4.53 2 2
C1 1 2 7.49 6.44 1.58 3 3
C2 4 9 0.33 5.55 6.04 1 1

Hierarchical Clustering
• Hierarchical clustering is another unsupervised machine learning algorithm,
which is used to group the unlabeled datasets into a cluster.
• It is a connectivity-based clustering model that groups the data points
together that are close to each other based on the measure of similarity or
distance.
• The assumption is that data points that are close to each other are more
similar or related than data points that are farther apart.
• It is based on the idea of creating a hierarchy of clusters, where each cluster
is made up of smaller clusters that can be further divided into even smaller
clusters.
• This hierarchical structure makes it easy to visualize the data and identify
patterns within the data.
Hierarchical clustering is of two types.
Agglomerative clustering
Divisive clustering
7/19/2024 69

Agglomerative Clustering
• Agglomerative clustering is a type of data clustering method used in
unsupervised learning.
• It begins with N groups, each containing initially one entity, and then the two
most similar groups merge at each stage until there is a single group
containing all the data.
• It is an iterative process that groups similar objects into clusters based on
some measure of similarity.
• It uses a bottom-up approach for dividing data points into clusters.
• The algorithm begins by assigning each object to its own cluster.
• It then uses a distance metric to determine the similarity between objects and
clusters.
• If two clusters have similar elements, they are merged together into a larger
cluster.
• This continues until all objects are grouped into one final cluster.
7/19/2024 70

Agglomerative Hierarchical Clustering Algorithm
• Step 1: Consider each dataset as a single cluster and calculate the distance of
one cluster from all the other clusters.
• Step 2: In the second step, comparable clusters are merged together to form
a single cluster. Let’s say cluster (B) and cluster (C) are very similar to each
other, therefore we merge them in the second step similarly to cluster (D)
and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
• Step 3: We recalculate the proximity according to the algorithm and merge
the two nearest clusters([(DE), (F)]) together to form new clusters as [(A),
(BC), (DEF)]
• Step 4: Repeating the same process; The clusters DEF and BC are comparable
and merged together to form a new cluster. We’re now left with clusters [(A),
(BCDEF)].
• Step 4: At last, the two remaining clusters are merged together to form a
single cluster [(ABCDEF)].
7/19/2024 71

Cont…
The average linkage clustering uses the average formula, i.e. distance between two
clustering A & B
d(A,B)=avg{d(a,y): x𝜖𝐴, 𝑦𝜖𝐵}
d(A,B)=
∈𝑑 𝑥,𝑦 :x𝜖𝐴,𝑦𝜖𝐵
𝐴 𝐵
7/19/2024 72
Fig.9. Concept of Agglomerative Clustering

Key Issues in Hierarchical Clustering
Lack of a Global Objective Function:
• Agglomerative hierarchical clustering techniques use various criteria to decide locally,
at each step, which clusters should be merged (or split for divisive approaches). This
approach yields clustering algorithms that avoid the difficulty of attempting to solve a
hard combinatorial optimization problem.
• Do not have problems with local minima or difficulties in choosing initial points.
Ability to Handle Different Cluster Sizes:
• There are two approaches: weighted, which treats all clusters equally, and unweighted,
which takes the number of points in each cluster into account.
• Treating clusters of unequal size equally gives different weights to the points in
different clusters, while taking the cluster size into account gives points in different
clusters the same weight.
Merging Decisions are Final:
• Agglomerative hierarchical clustering algorithms tend to make good local decisions
about combining two clusters since they can use information about the pairwise
similarity of all points.
• This approach prevents a local optimization criterion from becoming a global
optimization criterion.
7/19/2024 73

Advantage and disadvantages of Agglomerative Hierarchical
Clustering Algorithm
Advantages
1. Performance: It is effective in data observation from the data shape and returns accurate results
2. Easy: It is easy to use and provides better user guidance with good community support. So much
content and good documentation are available for a better user experience.
3. More Approaches: Two approaches are there using which datasets can be trained and tested,
agglomerative and divisive.
4. Performance on Small Datasets: The hierarchical clustering algorithms are effective on small
datasets and return accurate and reliable results with lower training and testing time.
Disadvantages
1. Time Complexity: As many iterations and calculations are associated, the time complexity of
hierarchical clustering is high. In some cases, it is one of the main reasons for preferring K-Means
clustering.
2. Space Complexity: As many calculations of errors with losses are associated with every epoch, the
space complexity of the algorithm is very high. Due to this, while implementing the hierarchical
clustering, the space of the model is considered. In such cases, we prefer K-Means clustering.
3. Poor performance on Large Datasets: When training a hierarchical clustering algorithm for large
datasets, the training process takes so much time with space which results in poor performance of the
algorithms.
7/19/2024 74

Exercise problems
Problem 1: Consider the following set of 6 one dimensional data points : 18,22,25,
42,27,43. merge the clusters using minimum distance and update proximity matrix
accordingly. Show proximity matrix to each iteration.
Solution:
Since minimum distance is 1—(42,43) or (43,42), so ,merge 42 and 43
From matrix 2, since 2 is minimum distance, merge (25,27)
7/19/2024 75
18 22 25 27 42 43
18 0 4 7 9 24 25
22 4 0 3 5 20 21
25 7 3 0 2 17 18
27 9 5 2 0 15 16
42 24 20 17 15 0 1
43 25 21 18 16 1 0
18 22 25 27 42,43
18 0 4 7 9 24
22 4 0 3 5 20
25 7 3 0 2 17
27 9 5 2 0 15
42,43 24 20 17 15 0

Exercise problems
Since 3 is minimum distance, merge 22,25.and 27---{22,(25,27)}
Since 4 is minimum distance, merge 18,22,25,27---[18,{22,(25,27)}]
Draw the dendrogram for the merged data points.
7/19/2024 76
18 22 25,27 42,43
18 0 4 7 24
22 4 0 3 20
25,27 7 3 0 15
42,43 24 20 15 0
18 22,25,27 42,43
18 0 4 24
22,25,27 4 0 15
42,43 24 15 0

Problems
Problem 2: For the given dataset, find the clusters using a single link technique. Use
Euclidean distance and draw the dendrogram.
Solution:
Step 1: Compute the distance matrix using Euclidean distance.
Let A(𝑥1, 𝑦1) 𝑎𝑛𝑑 B(𝑥2, 𝑦2)
Then Euclidean distance between two points
d(A,B)= x2 − x1
2 + y2 − y1
2
7/19/2024 77
Sample No X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30

Conti..
d(P1,P2)= 𝟎. 𝟐𝟐 − 𝟎. 𝟒𝟎 𝟐 + 𝟎. 𝟑𝟖 − 𝟎. 𝟓𝟑 𝟐 = 0.23
d(P1,P3)= 𝟎. 𝟑𝟓 − 𝟎. 𝟒𝟎 𝟐 + 𝟎. 𝟑𝟐 − 𝟎. 𝟓𝟑 𝟐 = 0.22
d(P2,P3)= 𝟎. 𝟑𝟓 − 𝟎. 𝟐𝟐 𝟐 + 𝟎. 𝟑𝟐 − 𝟎. 𝟑𝟖 𝟐 = 0.14 and so on
Step 2: Merging the two closest members
Here, the minimum values is 0.10 and hence we combine P3 and P6 (as 0.10 came in the P6 row and
p3 column).
Now, form the clusters of elements corresponding to the minimum value and update the distance
matrix.
7/19/2024 78
P1 P2 P3 P4 P5 P6
P1 0
P2 0.23 0
P3 0.22 0.14 0
P4 0.37 0.19 0.13 0
P5 0.34 0.14 0.28 0.23 0
P6 0.24 0.24 0.10 0.22 0.39 0

Conti..
(P3,P6)
Merge two closest members of the two clusters. The minimum value is 0.13 and hence we combine
P3, P6, P4
{(P3, P6), P4}
7/19/2024 79
P1 P2 P3 P4 P5 P6
P1 0
P2 0.23 0
P3 0.22 0.14 0
P4 0.37 0.19 0.13 0
P5 0.34 0.14 0.28 0.23 0
P6 0.24 0.24 0.10 0.22 0.39 0
P1 P2 P3,P6 P4 P5
P1 0
P2 0.23 0
P3,P6 0.22 0.14 0
P4 0.37 0.19 0.13 0
P5 0.34 0.14 0.28 0.23 0
P1 P2 P3,P6 P4 P5
P1 0
P2 0.23 0
P3,P6 0.22 0.14 0
P4 0.37 0.19 0.13 0
P5 0.34 0.14 0.28 0.23 0
P1 P2 P3,P6,P4 P5
P1 0
P2 0.23 0
P3,P6,P4 0.22 0.14 0
P5 0.34 0.14 0.28 0

Conti..
Now combined P2 and P5
[{(P3, P6), P4},(P2,P5)]
Now update the matrix and merge P2,P5,P3,P6 and P4
([{(P3, P6), P4},(P2,P5)], P1)
Now we have reached to the solution.
7/19/2024 80
P1 P2 P3,P6,P4 P5
P1 0
P2 0.23 0
P3,P6,P4 0.22 0.14 0
P5 0.34 0.14 0.28 0
P1 P2,P5 P3,P6,P4
P1 0
P2,P5 0.23 0
P3,P6,P4 0.22 0.14 0
P1 P2,P5 P3,P6,P4
P1 0
P2,P5 0.23 0
P3,P6,P4 0.22 0.14 0
P1 P2,P5,P3,P6,P
4
P1 0
P2,P5,P3,P6,P4 0.22 0

Conti
The dendrogram as per the solution is as follow
P3
P6
P4
P2
P5
P1
Dendrogram of the cluster formed for the group P1,P2,P3,P4,P5 and P6.
7/19/2024 81

Conti..
Problem 3: Given a one dimensional dataset {1,5,8,10,2}, use the Agglomerative clustering
algorithm with complete link with Euclidean distance to establish a hierarchical grouping
relationship. By using the cutting threshold of 5, how many clusters are there? What is
there membership in each group?
Solution:
Euclidean distance = 𝑥2 − 𝑥1
2 + 𝑦2 − 𝑦1
2
for 1 dimensional Euc-dist= 𝑥2 − 𝑥1
2
Apply 1D Euclidean distance to calculate the matrix
7/19/2024 82
1 5 8 10 2
1 0 4 7 9 1
5 4 0 3 5 3
8 7 3 0 2 6
10 9 5 2 0 8
2 1 3 6 8 0
1 2 3 4 5
1 0 4 7 9 1
2 4 0 3 5 3
3 7 3 0 2 6
4 9 5 2 0 8
5 1 3 6 8 0

Conti..
From the distance matrix, we can find distance between points 1 and 5 is smallest, i,e.2.
Then merge {1,5}.
Now recalculate the distance:
d(2,{1,5}}=max{d(2,1), d(2,5)}=max(4,3)=4
d(3,{1,5}}=max{d(3,1), d(3,5)}=max(7,6)=7
d(4,{1,5}}=max{d(4,1), d(4,5)}=max(4,5)=9
From the matrix, the distance between points 3 and 4 is smallest , i.e.2
Hence they merge together as to form a cluster {3,4}.
Using the complete link, we have the distance between different points/cluster as follows.
d({1,5}, {3,4})=max{d({1,5},3), d ({1,5},4)}=max(7,9)=9
d(2, {3,4})=max{d(2,3), d (2,4)}=max(3,5)=5
Thus, we can update the distance matrix, where row 2
corresponds to point 2, row 1 and 3 corresponds to
Cluster {1,5} and {3,4} as follows.
7/19/2024 83
1,5 2 3 4
1,5 0 4 7 9
2 4 0 3 5
3 7 3 0 2
4 9 5 2 0
1,5 2 3,4
1,5 0 4 9
2 4 0 5
3,4 9 5 0

Conti..
Following the same procedure, we merge pints 2 with the cluster {1,5} to form {1,2,5} and
update the distance matrix as follows.
After increase the distance threshold to 9,
all clusters would merge.
Fig 12: Dendogram for the given datasets
7/19/2024 84
[1,5],2 [3,4]
[1,5],2 0 9
[3,4] 9 0

Conti..
Problem 3: Given the data set {a,b,c,d,e} and following distance matrix. Construct a
dendrogram by average linkage hierarchical clustering using the Agglomerative method.
Solution:
The average linkage clustering uses the average formula, i.e. distance between two
clustering A & B
d(A,B)=avg{d(a,y): x𝜖𝐴, 𝑦𝜖𝐵}
d(A,B)=
∈𝑑 𝑥,𝑦 :x𝜖𝐴,𝑦𝜖𝐵
𝐴 𝐵
7/19/2024 85
a b c d e
a 0 9 3 6 11
b 9 0 7 5 10
c 3 7 0 9 2
d 6 5 9 0 8
e 11 10 2 8 0

Conti..
Dataset : {a,b,c,d,e}
Initial clustering (Single to a sets)
C1={a},{b},{c},{d},{e}
From the table, the minimum distance is the distance between the clusters {c} and {e}.
Also, d({c}:{e})=2
We merge {c} ad {e} to form the cluster {c,e}
The new set of cluster C2 ={a},{b},{d},{c,e}
7/19/2024 86
a b c d e
a 0 9 3 6 11
b 9 0 7 5 10
c 3 7 0 9 2
d 6 5 9 0 8
e 11 10 2 8 0
a b c,e d
a 0 9 ? 6
b 9 0 ? 5
c,e ? ? 0 ?
d 6 5 ? 0

Conti..
Let us compute the distance of{c,e} from other clusters.
d({c,e},{a})=avg{d(c,a),d(e,a)}=
3+11
2∗1
=7
d({c,e},{b})=avg{d(c,b),d(e,b)}=
7+10
2∗1
=8.5
d({c,e},{d})=avg{d(c,d),d(e,d)}=
9+8
2∗1
=7
Now update the table.
From C2 table, the minimum distance is the distance between the cluster {d} and {b}.
Also, d({b},{d})=5
We merge {b} and {d} to form the cluster {b,d}
The new set of cluster, C3: {a},{c,e},{b,d}
7/19/2024 87
a b c,e d
a 0 9 7 6
b 9 0 8.5 5
c,e 7 8.5 0 8.5
d 6 5 8.5 0

Conti..
Let us compute the distance of {b,d} from other clusters.
d({b,d},{a})=avg{d(b,a),d(d,a)}
d({b,d},{a}) =
9+6
2∗1
= 7.5
d({b,d},{c,e}) =Avg{d(b,c): d(b,e),d(d,c),d(d,e)}
d({b,d},{c,e})=
7+10+9+8
2∗2
= 8.5
7/19/2024 88
a b c,e d
a 0 9 7 6
b 9 0 8.5 5
c,e 7 8.5 0 8.5
d 6 5 8.5 0
a b,d c,e
A 0 ? 7
b,d ? 0 ?
c,e 7 ? 0
a b,d c,e
a 0 7.5 7
b,d 7.5 0 8.5
c,e 7 8.5 0

Conti..
From the table, the minimum distance is the distance between the clusters {a} and {c,e} is 7.
Also, d({a});{c,e})=7
We merge {a} and {b,d} to form the cluster {a,b,d}
The new set of clusters C4: {a,c,e},{b,d}
Let us compute the distance of {a,c,e}from other cluster.
D({a,c,e}, {b,d})=Avg{d(a,b),d(a,d),d(c,b),d(c,d),d(e,b),d(e,d)
D({a,c,e};{bd})=
9+6+7+9+10+8
3∗2
= 8.16
Fig 11: Dendogram for the dataset {a,b,c,d,e}.
7/19/2024 89
a,c,e b,d
a,c,e 0 ?
b,d ? 0
a,c,e b,d
a,c,e 0 8.16
b,d 8.16 0

Divisive Clustering
• Divisive clustering is also a type of hierarchical clustering that is used to create
clusters of data points.
• It is an unsupervised learning algorithm that begins by placing all the data
points in a single cluster and then progressively splits the clusters until each
data point is in its own cluster.
• It is useful for analyzing datasets that may have complex structures or
patterns, as it can help identify clusters that may not be obvious at first
glance.
• Divisive clustering works by first assigning all the data points to one cluster.
• Then, it looks for ways to split this cluster into two or more smaller clusters.
• This process continues until each data point is in its own cluster.
7/19/2024 90

Cont…
Steps to Divisive Hierarchical Clustering
The algorithm for divisive hierarchical clustering involves several steps.
Step 1: Consider all objects a part of one big cluster.
Step 2: Spilt the big cluster into small clusters using any flat-clustering method- ex. k-
means.
Step 3: Selects an object or subgroup to split into two smaller sub-clusters based on some
distance metric such as Euclidean distance or correlation coefficients.
Step 4: The process continues recursively until each object forms its own cluster.
7/19/2024 91
Fig. 12: Concept of Divisive Hierarchical
Clustering

Cont…
7/19/2024 92
Fig.13. Presents the differences between Agglomerative and Divisive
algorithms.

Conti..
1. k-NN algorithm does more computation on test time rather than train time.
A)TRUE
B) FALSE
2. Which of the following distance metric can not be used in k-NN?
A) Manhattan
B) Minkowski
C) Tanimoto
D) Jaccard
E) Mahalanobis
F) All can be used
3) Which of the following option is true about k-NN algorithm?
A) It can be used for classification
B) It can be used for regression
C) It can be used in both classification and regression
4) Which of the following machine learning algorithm can be used for imputing missing values of both categorical and continuous
variables?
A) K-NN
B) Linear Regression
C) Logistic Regression
5) Which of the following will be Euclidean Distance between the two data point A(1,3) and B(2,3)?
A) 1
B) 2
C) 4
D) 8
7/19/2024 93

A. K-Means Clustering comes under
1.Supervised learning Algorithm
2. Unsupervised Learning Algorithm
3. Reinforcement Learning
4. None of the above
B. Which of the following is true for clustering
1. Clustering is a technique used to group similar objects into clusters.
2. partition data into groups
3. dividing entire data, based on patterns in data
4. All of the above
C. Which of the following is true for K-Means Clustering
1. All data points in a cluster should be similar to each
2. other.
3. The data points from different clusters should be as different as possible.
4. Both 1 and 2
5. Only 1
6. Only 2
D. Which of the following applications comes under clustering
1. Customer Segmentation
2. Targeted Marketing
3. Recommendation Engines
4. Predicting the temperature
5. Only 1,2,3,4
6. All the above
E. What is intra cluster distance
1. distance between points in the cluster to its centroid
2. distance between each point in the cluster
3. sum of squares of distances between points
4. None of the above
7/19/2024 94

Conti..
Q1. Movie recommendation systems are an example of:
1. Classification
2. Clustering
3. Reinforcement Learning
4. Regression
Options:
A. 2 Only
B. 1 and 2
C. 1 and 3
D. 2 and 3
E. 1, 2, and 3
F. 1, 2, 3, and 4
Q2. Sentiment Analysis is an example of:
Regression
Classification
Clustering
Reinforcement Learning
Options:
A. 1 Only
B. 1 and 2
C. 1 and 3
D. 1, 2 and 3
E. 1, 2 and 4
F. 1, 2, 3 and 4
7/19/2024 95

Conti..
Q3. Can decision trees be used for performing clustering?
A. True
B. False
Q4. What is the minimum no. of variables/ features required to perform clustering?
Options:
A. 0
B. 1
C. 2
D. 3
Q5. For two runs of K-Mean clustering, is it expected to get the same clustering results?
A. Yes
B. No
Q6. Which of the following clustering algorithms suffers from the problem of convergence at local optima?
A. K- Means clustering algorithm
B. Agglomerative clustering algorithm
C. Expectation-Maximization clustering algorithm
D. Diverse clustering algorithm
Options:
A. 1 only
B. 2 and 3
C. 2 and 4
D. 1 and 3
E. 1,2 and 4
F. All of the above
7/19/2024 96

Machine Learning_SVM_KNN_K-MEANSModule 2.pdf

More Related Content

What's hot

Similar to Machine Learning_SVM_KNN_K-MEANSModule 2.pdf

More from Dr. Shivashankar

Recently uploaded

Machine Learning_SVM_KNN_K-MEANSModule 2.pdf