Support Vector Machine
(SVM)
Introduction to Support Vector Machine (SVM)
• Support Vector Machines are the supervised machine learning
models which are used both for the Classification and Regression
problems.
• In 1995-1998, Vladimir Vapnik developed Support Vector Machine
(SVM) at AT&T Bell Laboratories.
• SVMs are one of the most used models, being based on Statistical
Learning Frameworks of Vapnik-Chervonenkis (VC) dimension.
• The main objective of SVM is to find the optimal hyperplane which
linearly separates the data points in two component by maximizing
the margin .
• A data point is viewed as p-dimensional vector (p numbers of
features) and we want to know whether we can separate such many
points with a (p-1)-dimensional hyperplane. This is called a linear
classifier.
• There are infinitely many hyperplanes possible to classify the data
points of two classes for linear data classification.
• The main objective is to choose the best hyperplane that represents
the largest separation or margin between two classes.
• Hence, studying SVM means to choose an optimal hyperplane so that
the distance from it to the nearest data point on each side is
maximum.
• Intuitively, a good separation is achieved by an optimal hyperplane
can lower the generalization error of the classifier.
• The SVM can provide a good generalization performance (less test
error) on test dataset classification without incorporating the
problem-domain knowledge. This feature is unique to support vector
machines.
• The key concept of SVM learning algorithm is to identify the support
vectors from training set for both classes that defines maximum
margin.
• The support vectors consists of a small subset of training data points
extracted by the learning algorithm.
• The concept of SVM begins with a linearly separable binary
classification problem. For classifying non-linearly separable data, it is
necessary to map the data points to a higher-dimensional space. The
mapping scheme of SVM is designed to ensure that the dot products
of pairs (kernel function) between support vectors and the vector
drawn from the input space are calculated effectively.
Linear Vs Nonlinear Separable Data
Selection of Optimal Hyperplane
SVM uses Kernel Function for Implicit Non-
linear Mapping
Binary Linearly Separable Classification
• To separate the two classes of data points, there are many possible
hyperplanes that could be chosen. Our objective is to find a plane
that has the maximum margin, i.e the maximum distance between
data points of both classes.
• Support vectors are data points that are closer to the hyperplane and
influence the position and orientation of the hyperplane.
• Using these support vectors, we maximize the margin of the classifier.
• Deleting the support vectors will change the position of the hyperplane.
• These are the points that help us build our SVM (that works for a
new/test data).
test data point
(will be predicted as square)
test data point
(will be predicted as circle)
Applications
• SVM is used for text classification tasks such as category assignment,
detecting spam and sentiment analysis.
• It is also commonly used for image recognition challenges, performing
particularly well in aspect-based recognition and color-based
classification.
• SVM also plays a vital role in many areas of handwritten digit
recognition, such as postal automation services.
Derivation of Support Vector Equation
Hard Margin Vs Soft Margin
What is Vector Projection?
• Vector Projection is a method of finding component of a vector along
the direction of second vector.
• By projecting a vector on another vector, we obtain a vector which
represent the component of the first vector along the direction of
second vector.
• This projection is the Scalar Projection gives the magnitude of one
vector along the direction of another and that can be positive or
negative.
• Projection is negative, when the angle between the two vectors is
greater than 90 degrees, indicating that the projection goes in the
opposite direction of the base vector.
Properties of Dot Product
• The Vector Projection of 𝑎 on 𝑏 is given as
𝑎.𝑏
𝑏
• The Vector Projection of 𝑏 on 𝑎 is given as
𝑎.𝑏
𝑎
• Inequalities Based on Dot Product
• Cauchy – Schwartz inequality: for any two vectors a and b, the
magnitude of the dot product is always less than or equal to the product
of magnitudes of vector a and vector b that means
𝑎. 𝑏 ≤ 𝑎 𝑏
• Triangle Inequality: For any two vectors a and b, we always have
𝑎 + 𝑏 ≤ 𝑎 + 𝑏
Linear decision boundary between two classes
• Geometry for n = 2 with w1 > 0, w2 > 0 and w0 < 0 is shown in Figure below.
• The origin is on the negative side of ℋ if w0 < 0, and if w0 > 0, the origin is
on the positive side of ℋ.
• If w0 = 0, the hyperplane passes through the origin.
𝐱 = 𝐱𝑃 + 𝑟
𝐰
𝐰
Where 𝐰 is the Euclidean norm of 𝐰 and
𝐰
𝐰
is a unit vector.
Location of any point x may be considered relative to ℋ.
Defining 𝒙𝑝as the normal projection of x onto ℋ,
Algebraic measure of the distance
from x to the hyperplane
As
𝑔(𝐱) = 𝑟. 𝐰
𝑔 𝐱 is a measure of the Euclidean distance of the point x from the
decision hyperplane ℋ.
𝑔 𝐱 = 𝐰𝑇𝐱 + 𝑤0 ቐ
> 0 if 𝐱 𝜖 ℋ+
= 0 if 𝐱 𝜖 ℋ
< 0 if 𝐱 𝜖 ℋ−
Perpendicular distance d from coordinate origin to ℋ = Τ
𝑤0 𝐰
Scalar Projection
Derive the width of the margin
Hyperplane ℋ separates the feature space into two half space ℋ+ and ℋ−
Geometry for 3-dimensions (n=3)
Linear Maximal Margin Classifier for Linearly
Separable Data
• For linearly separable, many hyperplanes exist to perform separation.
• SVM framework tells which hyperplane is best.
• Hyperplane with the largest margin which minimizes test error.
• Select the decision boundary that is far away from both the classes.
• Large margin separation is expected to yield good generalization.
• in wTx + w0 = 0, w defines a direction perpendicular to the
hyperplane.
• w is called the normal vector (or simply normal) of the hyperplane.
• Without changing the normal vector w, varying w0 moves the
hyperplane parallel to itself.
Large margin and small margin seperation
test data point
(will be predicted as square)
test data point
(will be predicted as circle)
Geometric interpretation of algebraic distances of points to a hyperplane for
two-dimensional case
Two parallel hyperplanes ℋ1 and ℋ2 that pass through 𝐱(𝑖) and 𝐱(𝑘) respectively.
𝓗 divides the input space
into two half spaces
We can rescale w and w0 to obtain 𝓗𝟏 and 𝓗𝟐 as:
ℋ1 and ℋ2 are parallel to the hyperplane 𝐰𝑇𝐱 + 𝑤0 = 0.
ℋ1: 𝐰𝑇𝐱 + 𝑤0 = +1
ℋ2: 𝐰𝑇
𝐱 + 𝑤0 = −1
such that
𝐰𝑇𝐱(𝑖) + 𝑤0 ≥ 1 if 𝑦 𝑖 = +1
𝐰𝑇𝐱(𝑖) + 𝑤0 ≤ −1 if 𝑦 𝑖 = −1
or equivalently,
𝑦 𝑖
𝐰𝑇
𝐱 𝑖
+ 𝑤0 ≥ 1
distance between the two hyperplanes = margin 𝑀
𝑀 =
2
𝐰
This equation states that maximizing the margin of separation between
classes is equivalent to minimizing the Euclidean norm of the weight vector w.
Maximize function is a non-convex function
• Hence, our goal is to simply find the Maximum Margin M. The
optimal hyperplane defined by the equation 𝑤𝑜
𝑇x+𝑏𝑜 = 0 is unique in
the sense that the optimum weight vector 𝑤𝑜 provides the maximum
possible separation between positive and negative data points. This
optimal condition is obtained by minimizing the Euclidean norm of
the weight vector w.
• Quadratic optimization for finding the Optimal Hyperplane
• Problem: Maximize the Margin M =
2
𝑤
subject to the constraint
𝑦𝑖 𝑤𝑇
x + 𝑏 ≥ 1 for all 𝑖 = 1, 2, … , 𝑁 and the weight vector minimizes
the cost function:
Φ 𝑤 =
1
2
𝑤𝑇𝑤 =
1
2
𝑤 2
• This constraint optimization problem is called the primal problem.
Primal problem is the convex function
Primal problem
• The primal problem is characterized as follow:
• The cost function Φ(𝑤) is a convex function of w
• The constraints are linear in w
• We can solve this constraint optimization problem using the method
of Lagrange multipliers.
• We represent the cost function as Lagrange function using Lagrange
multipliers as follows:
𝐽 𝑤, 𝑏, 𝛼 =
1
2
𝑤𝑇𝑤 − σ𝑖=1
𝑁
𝛼𝑖 𝑑𝑖 𝑤𝑇x + 𝑏 − 1 ….. (1)
where the auxiliary nonnegative variables 𝛼𝑖 are called Lagrange
multipliers.
• The solution to the constraint optimization problem is determined by
the saddle point of the Lagrange function 𝐽(𝑤,𝑏,𝛼), which has to
minimized with respect to 𝑤 and 𝑏; it also has to be maximized with
respect to 𝛼.
• Then, differentiating 𝐽(𝑤,𝑏,𝛼) with respect to w and b and setting the
results equals to zero, we get the following two conditions of
optimality:
• Application of optimality condition 1 and condition 2 to the Lagrange
function yields:
Dual Problem
• After differentiation, we got following expressions:
• Replacing these two expressions in equation (1), we get:
• ……..(2)
• Rearrange the equation (2)
………… (3)
• In the equation 3, it is observed quadratic programming problem depends
only on the Lagrange multipliers α and not on w and b anymore.
• The equation (3) represent the dual problem. Final dual problem can be
stated as follows:
……. (4)
• Equation 4 is a quadratic programming problem for maximization, where the
𝛼𝑖 are nonnegative.
To compute the optimal bias 𝑏𝑜, we may use the 𝑤𝑜 in the positive
hyperplane 𝑔(𝑥) = 1 to obtain 𝑏𝑜.
This problem works well when the data points are linearly separable. That’s
why this default version is also called hard-margin SVM. But as we can
imagine, it’s very unlikely for us to find datasets that are completely separable
in real life. Therefore, we are always use a variation that allows us to extend
SVMs to scenarios where some points could be misclassified, also known as
soft-margin SVM.
Soft Margin SVM
• The distance of the vectors from the hyperplane is called the margin
which is a separation of a line to the closest class points. We would
like to choose a hyperplane that maximizes the margin between
classes.
• Soft Margin:
• As most of the real-world data are not fully linearly separable, we
will allow some margin violation to occur which is called soft
margin classification. It is better to have a large margin, even
though some constraints are violated.
• Margin violation means choosing a hyperplane, which can allow
some data points to stay on either the incorrect side of the
hyperplane and between the margin and correct side of the
hyperplane.
Hard Margin Vs. Soft Margin
Soft Margin
• What Soft Margin does is
• it tolerates a few dots to get misclassified.
• it tries to balance the trade-off between finding a line that
maximizes the margin and minimizes the misclassification.
• Two types of misclassifications can happen:
• The dot is on the wrong side of the decision boundary but on the
correct side on the margin
• The dot is on the wrong side of the decision boundary and on the
wrong side of the margin
• Either case, the support vector machine tolerates those dots to be
misclassified when it tries to find the linear decision boundary.
Slack Variable (𝜉)
Optimal Hyperplane for Non Separable Data

Support vector machine, machine learning

  • 1.
  • 2.
    Introduction to SupportVector Machine (SVM) • Support Vector Machines are the supervised machine learning models which are used both for the Classification and Regression problems. • In 1995-1998, Vladimir Vapnik developed Support Vector Machine (SVM) at AT&T Bell Laboratories. • SVMs are one of the most used models, being based on Statistical Learning Frameworks of Vapnik-Chervonenkis (VC) dimension. • The main objective of SVM is to find the optimal hyperplane which linearly separates the data points in two component by maximizing the margin .
  • 3.
    • A datapoint is viewed as p-dimensional vector (p numbers of features) and we want to know whether we can separate such many points with a (p-1)-dimensional hyperplane. This is called a linear classifier. • There are infinitely many hyperplanes possible to classify the data points of two classes for linear data classification. • The main objective is to choose the best hyperplane that represents the largest separation or margin between two classes. • Hence, studying SVM means to choose an optimal hyperplane so that the distance from it to the nearest data point on each side is maximum. • Intuitively, a good separation is achieved by an optimal hyperplane can lower the generalization error of the classifier.
  • 4.
    • The SVMcan provide a good generalization performance (less test error) on test dataset classification without incorporating the problem-domain knowledge. This feature is unique to support vector machines. • The key concept of SVM learning algorithm is to identify the support vectors from training set for both classes that defines maximum margin. • The support vectors consists of a small subset of training data points extracted by the learning algorithm. • The concept of SVM begins with a linearly separable binary classification problem. For classifying non-linearly separable data, it is necessary to map the data points to a higher-dimensional space. The mapping scheme of SVM is designed to ensure that the dot products of pairs (kernel function) between support vectors and the vector drawn from the input space are calculated effectively.
  • 5.
    Linear Vs NonlinearSeparable Data
  • 6.
  • 7.
    SVM uses KernelFunction for Implicit Non- linear Mapping
  • 8.
    Binary Linearly SeparableClassification • To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes.
  • 9.
    • Support vectorsare data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. • Using these support vectors, we maximize the margin of the classifier. • Deleting the support vectors will change the position of the hyperplane. • These are the points that help us build our SVM (that works for a new/test data). test data point (will be predicted as square) test data point (will be predicted as circle)
  • 10.
    Applications • SVM isused for text classification tasks such as category assignment, detecting spam and sentiment analysis. • It is also commonly used for image recognition challenges, performing particularly well in aspect-based recognition and color-based classification. • SVM also plays a vital role in many areas of handwritten digit recognition, such as postal automation services.
  • 11.
    Derivation of SupportVector Equation
  • 12.
    Hard Margin VsSoft Margin
  • 13.
    What is VectorProjection? • Vector Projection is a method of finding component of a vector along the direction of second vector. • By projecting a vector on another vector, we obtain a vector which represent the component of the first vector along the direction of second vector. • This projection is the Scalar Projection gives the magnitude of one vector along the direction of another and that can be positive or negative. • Projection is negative, when the angle between the two vectors is greater than 90 degrees, indicating that the projection goes in the opposite direction of the base vector.
  • 14.
    Properties of DotProduct • The Vector Projection of 𝑎 on 𝑏 is given as 𝑎.𝑏 𝑏 • The Vector Projection of 𝑏 on 𝑎 is given as 𝑎.𝑏 𝑎 • Inequalities Based on Dot Product • Cauchy – Schwartz inequality: for any two vectors a and b, the magnitude of the dot product is always less than or equal to the product of magnitudes of vector a and vector b that means 𝑎. 𝑏 ≤ 𝑎 𝑏 • Triangle Inequality: For any two vectors a and b, we always have 𝑎 + 𝑏 ≤ 𝑎 + 𝑏
  • 15.
    Linear decision boundarybetween two classes • Geometry for n = 2 with w1 > 0, w2 > 0 and w0 < 0 is shown in Figure below. • The origin is on the negative side of ℋ if w0 < 0, and if w0 > 0, the origin is on the positive side of ℋ. • If w0 = 0, the hyperplane passes through the origin.
  • 16.
    𝐱 = 𝐱𝑃+ 𝑟 𝐰 𝐰 Where 𝐰 is the Euclidean norm of 𝐰 and 𝐰 𝐰 is a unit vector. Location of any point x may be considered relative to ℋ. Defining 𝒙𝑝as the normal projection of x onto ℋ, Algebraic measure of the distance from x to the hyperplane
  • 17.
    As 𝑔(𝐱) = 𝑟.𝐰 𝑔 𝐱 is a measure of the Euclidean distance of the point x from the decision hyperplane ℋ. 𝑔 𝐱 = 𝐰𝑇𝐱 + 𝑤0 ቐ > 0 if 𝐱 𝜖 ℋ+ = 0 if 𝐱 𝜖 ℋ < 0 if 𝐱 𝜖 ℋ− Perpendicular distance d from coordinate origin to ℋ = Τ 𝑤0 𝐰
  • 18.
  • 19.
    Derive the widthof the margin
  • 20.
    Hyperplane ℋ separatesthe feature space into two half space ℋ+ and ℋ− Geometry for 3-dimensions (n=3)
  • 21.
    Linear Maximal MarginClassifier for Linearly Separable Data • For linearly separable, many hyperplanes exist to perform separation. • SVM framework tells which hyperplane is best. • Hyperplane with the largest margin which minimizes test error. • Select the decision boundary that is far away from both the classes. • Large margin separation is expected to yield good generalization. • in wTx + w0 = 0, w defines a direction perpendicular to the hyperplane. • w is called the normal vector (or simply normal) of the hyperplane. • Without changing the normal vector w, varying w0 moves the hyperplane parallel to itself.
  • 22.
    Large margin andsmall margin seperation test data point (will be predicted as square) test data point (will be predicted as circle)
  • 23.
    Geometric interpretation ofalgebraic distances of points to a hyperplane for two-dimensional case Two parallel hyperplanes ℋ1 and ℋ2 that pass through 𝐱(𝑖) and 𝐱(𝑘) respectively. 𝓗 divides the input space into two half spaces We can rescale w and w0 to obtain 𝓗𝟏 and 𝓗𝟐 as:
  • 24.
    ℋ1 and ℋ2are parallel to the hyperplane 𝐰𝑇𝐱 + 𝑤0 = 0. ℋ1: 𝐰𝑇𝐱 + 𝑤0 = +1 ℋ2: 𝐰𝑇 𝐱 + 𝑤0 = −1 such that 𝐰𝑇𝐱(𝑖) + 𝑤0 ≥ 1 if 𝑦 𝑖 = +1 𝐰𝑇𝐱(𝑖) + 𝑤0 ≤ −1 if 𝑦 𝑖 = −1 or equivalently, 𝑦 𝑖 𝐰𝑇 𝐱 𝑖 + 𝑤0 ≥ 1 distance between the two hyperplanes = margin 𝑀 𝑀 = 2 𝐰 This equation states that maximizing the margin of separation between classes is equivalent to minimizing the Euclidean norm of the weight vector w.
  • 25.
    Maximize function isa non-convex function
  • 26.
    • Hence, ourgoal is to simply find the Maximum Margin M. The optimal hyperplane defined by the equation 𝑤𝑜 𝑇x+𝑏𝑜 = 0 is unique in the sense that the optimum weight vector 𝑤𝑜 provides the maximum possible separation between positive and negative data points. This optimal condition is obtained by minimizing the Euclidean norm of the weight vector w. • Quadratic optimization for finding the Optimal Hyperplane • Problem: Maximize the Margin M = 2 𝑤 subject to the constraint 𝑦𝑖 𝑤𝑇 x + 𝑏 ≥ 1 for all 𝑖 = 1, 2, … , 𝑁 and the weight vector minimizes the cost function: Φ 𝑤 = 1 2 𝑤𝑇𝑤 = 1 2 𝑤 2 • This constraint optimization problem is called the primal problem.
  • 27.
    Primal problem isthe convex function
  • 28.
    Primal problem • Theprimal problem is characterized as follow: • The cost function Φ(𝑤) is a convex function of w • The constraints are linear in w • We can solve this constraint optimization problem using the method of Lagrange multipliers. • We represent the cost function as Lagrange function using Lagrange multipliers as follows: 𝐽 𝑤, 𝑏, 𝛼 = 1 2 𝑤𝑇𝑤 − σ𝑖=1 𝑁 𝛼𝑖 𝑑𝑖 𝑤𝑇x + 𝑏 − 1 ….. (1) where the auxiliary nonnegative variables 𝛼𝑖 are called Lagrange multipliers.
  • 29.
    • The solutionto the constraint optimization problem is determined by the saddle point of the Lagrange function 𝐽(𝑤,𝑏,𝛼), which has to minimized with respect to 𝑤 and 𝑏; it also has to be maximized with respect to 𝛼. • Then, differentiating 𝐽(𝑤,𝑏,𝛼) with respect to w and b and setting the results equals to zero, we get the following two conditions of optimality: • Application of optimality condition 1 and condition 2 to the Lagrange function yields:
  • 30.
    Dual Problem • Afterdifferentiation, we got following expressions: • Replacing these two expressions in equation (1), we get: • ……..(2) • Rearrange the equation (2) ………… (3)
  • 31.
    • In theequation 3, it is observed quadratic programming problem depends only on the Lagrange multipliers α and not on w and b anymore. • The equation (3) represent the dual problem. Final dual problem can be stated as follows: ……. (4) • Equation 4 is a quadratic programming problem for maximization, where the 𝛼𝑖 are nonnegative.
  • 32.
    To compute theoptimal bias 𝑏𝑜, we may use the 𝑤𝑜 in the positive hyperplane 𝑔(𝑥) = 1 to obtain 𝑏𝑜. This problem works well when the data points are linearly separable. That’s why this default version is also called hard-margin SVM. But as we can imagine, it’s very unlikely for us to find datasets that are completely separable in real life. Therefore, we are always use a variation that allows us to extend SVMs to scenarios where some points could be misclassified, also known as soft-margin SVM.
  • 33.
    Soft Margin SVM •The distance of the vectors from the hyperplane is called the margin which is a separation of a line to the closest class points. We would like to choose a hyperplane that maximizes the margin between classes. • Soft Margin: • As most of the real-world data are not fully linearly separable, we will allow some margin violation to occur which is called soft margin classification. It is better to have a large margin, even though some constraints are violated. • Margin violation means choosing a hyperplane, which can allow some data points to stay on either the incorrect side of the hyperplane and between the margin and correct side of the hyperplane.
  • 34.
    Hard Margin Vs.Soft Margin
  • 35.
    Soft Margin • WhatSoft Margin does is • it tolerates a few dots to get misclassified. • it tries to balance the trade-off between finding a line that maximizes the margin and minimizes the misclassification. • Two types of misclassifications can happen: • The dot is on the wrong side of the decision boundary but on the correct side on the margin • The dot is on the wrong side of the decision boundary and on the wrong side of the margin • Either case, the support vector machine tolerates those dots to be misclassified when it tries to find the linear decision boundary.
  • 36.
  • 37.
    Optimal Hyperplane forNon Separable Data