Clustering
What is Clustering?
K-Means Clustering
Flowchart to understand K-means Clustering
Clustering of cars based on brands
Logistic Regression
What’s in it for you?
What is Logistic Regression?
Logistic Regression Curve & Sigmoid function
Classify whether a tumor is malignant or benign
based on features
Cover/transition slides
will be changed
Clustering
Suppose, we
have a pile of
books of
different genres!
Clustering
Now, we divide them into different groups like
Fiction
Horror
Educational
Well, organizing objects
into groups based on
their similarity is
Clustering!
Well, organizing objects
into groups based on
their similarity is
Clustering!
K-means Clustering
K-Means Clustering is an
example of Unsupervised
learning
K-Means Clustering is an
example of Unsupervised
learning
It is used when you have
unlabeled data!
K-Means Clustering is an
example of Unsupervised
learning
It is used when you have
unlabeled data!
To find clusters in the data
based on feature similarity!
Steps for K-Means
Suppose we have these data
points and we want to assign
them into clusters
STEP 1: Initialize Cluster Centroids
We pick ‘K’ clusters & assign random centroids to clusters
STEP 1: Initialize Cluster Centroids
We pick ‘K’ clusters & assign random centroids to clusters
Then, we compute distance from objects to centroids
STEP 2: Compute Minimum Distance
Now, we form new clusters based on minimum distance and calculate
their centroids
STEP 3: Assign Points to New Clusters
Repeat previous two steps iteratively till the cluster centroids stop
changing their positions and become static
STEP 3: Assign Points to New Clusters
Repeat previous two steps iteratively till the cluster centroids stop
changing their positions and become static
Shall we see a flowchart to
understand?
Flowchart to understand K-Means
Choose K (Elbow Method)
START
Assign random centroids to clusters
Compute distance from objects to centroids
Yes
Form new clusters based on minimum distance and calculate their centroids
Compute distance from objects to new centroids
Repeat until
no
observations
change
groups
Let’s see an example!
K-Means Algorithm
Subject A B
1 1 1
2 1.5 2
3 3 4
4 5 7
5 3.5 5
6 4.5 5
7 3.5 4.5
Suppose, we have this dataset of 7 individuals and their
score on two topics (A and B)
K-Means Algorithm
Now, lets take two farthest-apart points as initial cluster
centroids
Subject A B
1 1 1
2 1.5 2
3 3 4
4 5 7
5 3.5 5
6 4.5 5
7 3.5 4.5
K-Means Algorithm
Now, lets take two farthest-apart points as initial cluster
centroids
K-Means Algorithm
Each point is then assigned to the closest cluster with
respect to their distance from the centroids Cluster 1
Cluster 2
K-Means Algorithm
Now, we again calculate the centroids of each cluster:
Individual
Mean Vector
(centroid)
Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)
Cluster 1
Cluster 2
K-Means Algorithm
We compare each individual’s distance to its own cluster mean and to
that of the opposite cluster. And we find:
Individual
Distance to mean
(centroid) of Cluster
1
Distance to mean
(centroid) of
Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1
Using Eucledian Distance
between the points and the
mean
Cluster 1
Cluster 2
K-Means Algorithm
Individual
Distance to mean
(centroid) of Cluster
1
Distance to mean
(centroid) of
Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1
Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2)
than its own (Cluster 1)
Cluster 1
Cluster 2
Moving point 3 to new
cluster
K-Means Algorithm
Thus, individual 3 is relocated to Cluster 2 resulting in the new partition:
Individual
Distance to mean
(centroid) of Cluster
1
Distance to mean
(centroid) of
Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1
Cluster 1
Cluster 2
K-Means Algorithm
For the new clusters, we will find the actual cluster
centroids:
Individual
Mean Vector
(centroid)
Cluster 1 1, 2, 3 (1.25, 1.5)
Cluster 2 4, 5, 6, 7 (3.9, 5.1)
Cluster 1
Cluster 2
K-Means Algorithm
On comparing the distance of each individual’s distance
to it’s own cluster mean and to that of the opposite cluster,
we find that the data points are stable, hence we have our
final clusters!
Cluster 1
Cluster 2
K-Means Algorithm
To find appropriate number of clusters in a dataset, we use elbow method:
WSS
No . of. clusters
Elbow point
Within sum of squares (WSS) is defined
as the sum of the squared distance
between each member of the cluster and
its centroid
Finding the optimal number of clusters using
the elbow of the graph is called as the Elbow
method
Use Case
Using K-means clustering to cluster cars into brands using the
parameters such as horsepower, cubic inches, make year, etc.
Dataset: Cars data having information about 3 brands of cars namely
Toyota, Honda, Nissan
Use Case
Use Case
Use Case
Use Case
Use Case
Use Case
Use Case
Use Case
Use Case
Clustering
Today, we’ll dive into K-
means Clustering!
Well, organizing objects
into groups based on
their similarity is
Clustering!
Logistic Regression
Logistic Regression
Now, let’s look into
Logistic Regression
Logistic Regression
The Logistic Regression algorithm is the
simplest classification algorithm used for
binary or multi-classification problems
Logistic Regression
To brush up,
y = mx+c
The dependent variable is the
target class variable we are
going to predict
In the previous tutorial, we learnt about Linear Regression, dependent and independent variables
Logistic Regression
In the previous tutorial, we learnt about Linear Regression, dependent and independent variables
The independent variables
(x1…xn) are the features or
attributes we are going to use to
predict the target class
To brush up,
y = mx+c
The dependent variable is the
target class variable we are
going to predict
Logistic Regression
1
0
Marks
No. of hours studied
We know what a
linear regression
looks like, but using
this graph we
cannot divide the
outcome into
categories
100
Logistic Regression
100
0
We know what a
linear regression
looks like, but using
this graph we
cannot divide the
outcome into
categories
For example, a linear regression graph can
tell us that with increase in number of hours
studied, the marks of a student will
increase
But, it will not tell us whether the student
will pass or not!
Marks
No. of hours studied
Logistic Regression
In such cases, where we need the output
as categorical value, we will use logistic
regression! 100
0
No. of hours studied
Marks
Logistic Regression
0
100 1
0
Sigmoid
Curve
Sigmoid Function
y = m*x + c
p =
1
1 + ⅇ
− y
p
ln (
1-p
) = m*x + c
No. of hours studied No. of hours studied
Marks
Marks
Logistic Regression
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9
Logistic Regression
Threshold value
Probability > 0.50
Value is rounded off to 1 indicating that the
student will pass
Probability < 0.50 , the value is
rounded off to 0 indicating that the
student will fail
0.30
0.82
Problem statement: To classify whether a
tumor is ‘malignant’ or ‘benign’
Use Case
Use Case
Use Case
Use Case
Use Case
Use Case
Use Case
Use Case
Use Case
Use Case
So, this model is
able to predict the
type of tumor with
91% accuracy!
Finally, let’s discuss the answers to the quiz asked in
Machine Learning Tutorial Part-1
for the instructor
What do you understand from Measures and Dimensions?
Each field from the data source is automatically assigned a
datatype (such as string, integer) and a role (dimension or
measure)
Aggregation applied on measures is ‘Sum’ by default but you
can always change the default aggregation in the settings
Can you tell what’s happening in the
following cases?
A. Grouping documents into different categories based on the
topic and content of each document
“This is an example of Clustering where K-means
clustering can be used to group the documents by
topics using bag-of-words approach”
What do you understand from Measures and Dimensions?
Each field from the data source is automatically assigned a
datatype (such as string, integer) and a role (dimension or
measure)
Aggregation applied on measures is ‘Sum’ by default but you
can always change the default aggregation in the settings
Can you tell what’s happening in the
following cases?
B. Identifying hand-written digits in images correctly
“This is an example of Classification. The traditional
approach to solving this would be to extract digit
dependent features like curvature of different digits,
etc. and then use a classifier like SVM to distinguish
between images”
What do you understand from Measures and Dimensions?
Each field from the data source is automatically assigned a
datatype (such as string, integer) and a role (dimension or
measure)
Aggregation applied on measures is ‘Sum’ by default but you
can always change the default aggregation in the settings
Can you tell what’s happening in the
following cases?
C. Behavior of a website indicating that the site is not working
as designed
“This is an example of Anomaly Detection. In this case,
the algorithm learns what is "normal" and what is "not
normal", usually by observing the logs of the website”
What do you understand from Measures and Dimensions?
Each field from the data source is automatically assigned a
datatype (such as string, integer) and a role (dimension or
measure)
Aggregation applied on measures is ‘Sum’ by default but you
can always change the default aggregation in the settings
Can you tell what’s happening in the
following cases?
D. Predicting salary of an individual based his/her years
of experience
“This is an example of Regression. This problem can
be mathematically defined as a function between
independent (years of experience) and dependent
variable (salary of an individual)”
Summary
What is K-Means Elbow Method to choose K Clustering cars with K-means
Classifying tumor with logisticWhat is logistic regression
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners Part - 2 | Simplilearn

Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners Part - 2 | Simplilearn

  • 1.
    Clustering What is Clustering? K-MeansClustering Flowchart to understand K-means Clustering Clustering of cars based on brands Logistic Regression What’s in it for you? What is Logistic Regression? Logistic Regression Curve & Sigmoid function Classify whether a tumor is malignant or benign based on features Cover/transition slides will be changed
  • 2.
    Clustering Suppose, we have apile of books of different genres!
  • 3.
    Clustering Now, we dividethem into different groups like Fiction Horror Educational
  • 4.
    Well, organizing objects intogroups based on their similarity is Clustering!
  • 5.
    Well, organizing objects intogroups based on their similarity is Clustering! K-means Clustering
  • 6.
    K-Means Clustering isan example of Unsupervised learning
  • 7.
    K-Means Clustering isan example of Unsupervised learning It is used when you have unlabeled data!
  • 8.
    K-Means Clustering isan example of Unsupervised learning It is used when you have unlabeled data! To find clusters in the data based on feature similarity!
  • 9.
    Steps for K-Means Supposewe have these data points and we want to assign them into clusters
  • 10.
    STEP 1: InitializeCluster Centroids We pick ‘K’ clusters & assign random centroids to clusters
  • 11.
    STEP 1: InitializeCluster Centroids We pick ‘K’ clusters & assign random centroids to clusters Then, we compute distance from objects to centroids
  • 12.
    STEP 2: ComputeMinimum Distance Now, we form new clusters based on minimum distance and calculate their centroids
  • 13.
    STEP 3: AssignPoints to New Clusters Repeat previous two steps iteratively till the cluster centroids stop changing their positions and become static
  • 14.
    STEP 3: AssignPoints to New Clusters Repeat previous two steps iteratively till the cluster centroids stop changing their positions and become static
  • 15.
    Shall we seea flowchart to understand?
  • 16.
    Flowchart to understandK-Means Choose K (Elbow Method) START Assign random centroids to clusters Compute distance from objects to centroids Yes Form new clusters based on minimum distance and calculate their centroids Compute distance from objects to new centroids Repeat until no observations change groups
  • 17.
  • 18.
    K-Means Algorithm Subject AB 1 1 1 2 1.5 2 3 3 4 4 5 7 5 3.5 5 6 4.5 5 7 3.5 4.5 Suppose, we have this dataset of 7 individuals and their score on two topics (A and B)
  • 19.
    K-Means Algorithm Now, letstake two farthest-apart points as initial cluster centroids Subject A B 1 1 1 2 1.5 2 3 3 4 4 5 7 5 3.5 5 6 4.5 5 7 3.5 4.5
  • 20.
    K-Means Algorithm Now, letstake two farthest-apart points as initial cluster centroids
  • 21.
    K-Means Algorithm Each pointis then assigned to the closest cluster with respect to their distance from the centroids Cluster 1 Cluster 2
  • 22.
    K-Means Algorithm Now, weagain calculate the centroids of each cluster: Individual Mean Vector (centroid) Cluster 1 1, 2, 3 (1.8, 2.3) Cluster 2 4, 5, 6, 7 (4.1, 5.4) Cluster 1 Cluster 2
  • 23.
    K-Means Algorithm We compareeach individual’s distance to its own cluster mean and to that of the opposite cluster. And we find: Individual Distance to mean (centroid) of Cluster 1 Distance to mean (centroid) of Cluster 2 1 1.5 5.4 2 0.4 4.3 3 2.1 1.8 4 5.7 1.8 5 3.2 0.7 6 3.8 0.6 7 2.8 1.1 Using Eucledian Distance between the points and the mean Cluster 1 Cluster 2
  • 24.
    K-Means Algorithm Individual Distance tomean (centroid) of Cluster 1 Distance to mean (centroid) of Cluster 2 1 1.5 5.4 2 0.4 4.3 3 2.1 1.8 4 5.7 1.8 5 3.2 0.7 6 3.8 0.6 7 2.8 1.1 Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1) Cluster 1 Cluster 2 Moving point 3 to new cluster
  • 25.
    K-Means Algorithm Thus, individual3 is relocated to Cluster 2 resulting in the new partition: Individual Distance to mean (centroid) of Cluster 1 Distance to mean (centroid) of Cluster 2 1 1.5 5.4 2 0.4 4.3 3 2.1 1.8 4 5.7 1.8 5 3.2 0.7 6 3.8 0.6 7 2.8 1.1 Cluster 1 Cluster 2
  • 26.
    K-Means Algorithm For thenew clusters, we will find the actual cluster centroids: Individual Mean Vector (centroid) Cluster 1 1, 2, 3 (1.25, 1.5) Cluster 2 4, 5, 6, 7 (3.9, 5.1) Cluster 1 Cluster 2
  • 27.
    K-Means Algorithm On comparingthe distance of each individual’s distance to it’s own cluster mean and to that of the opposite cluster, we find that the data points are stable, hence we have our final clusters! Cluster 1 Cluster 2
  • 28.
    K-Means Algorithm To findappropriate number of clusters in a dataset, we use elbow method: WSS No . of. clusters Elbow point Within sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroid Finding the optimal number of clusters using the elbow of the graph is called as the Elbow method
  • 29.
    Use Case Using K-meansclustering to cluster cars into brands using the parameters such as horsepower, cubic inches, make year, etc. Dataset: Cars data having information about 3 brands of cars namely Toyota, Honda, Nissan
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
    Clustering Today, we’ll diveinto K- means Clustering! Well, organizing objects into groups based on their similarity is Clustering! Logistic Regression
  • 40.
    Logistic Regression Now, let’slook into Logistic Regression
  • 41.
    Logistic Regression The LogisticRegression algorithm is the simplest classification algorithm used for binary or multi-classification problems
  • 42.
    Logistic Regression To brushup, y = mx+c The dependent variable is the target class variable we are going to predict In the previous tutorial, we learnt about Linear Regression, dependent and independent variables
  • 43.
    Logistic Regression In theprevious tutorial, we learnt about Linear Regression, dependent and independent variables The independent variables (x1…xn) are the features or attributes we are going to use to predict the target class To brush up, y = mx+c The dependent variable is the target class variable we are going to predict
  • 44.
    Logistic Regression 1 0 Marks No. ofhours studied We know what a linear regression looks like, but using this graph we cannot divide the outcome into categories 100
  • 45.
    Logistic Regression 100 0 We knowwhat a linear regression looks like, but using this graph we cannot divide the outcome into categories For example, a linear regression graph can tell us that with increase in number of hours studied, the marks of a student will increase But, it will not tell us whether the student will pass or not! Marks No. of hours studied
  • 46.
    Logistic Regression In suchcases, where we need the output as categorical value, we will use logistic regression! 100 0 No. of hours studied Marks
  • 47.
    Logistic Regression 0 100 1 0 Sigmoid Curve SigmoidFunction y = m*x + c p = 1 1 + ⅇ − y p ln ( 1-p ) = m*x + c No. of hours studied No. of hours studied Marks Marks
  • 48.
    Logistic Regression 0 0.2 0.4 0.6 0.8 1 1.2 0 12 3 4 5 6 7 8 9 Logistic Regression Threshold value Probability > 0.50 Value is rounded off to 1 indicating that the student will pass Probability < 0.50 , the value is rounded off to 0 indicating that the student will fail 0.30 0.82
  • 49.
    Problem statement: Toclassify whether a tumor is ‘malignant’ or ‘benign’
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
    Use Case So, thismodel is able to predict the type of tumor with 91% accuracy!
  • 60.
    Finally, let’s discussthe answers to the quiz asked in Machine Learning Tutorial Part-1 for the instructor
  • 61.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings Can you tell what’s happening in the following cases? A. Grouping documents into different categories based on the topic and content of each document “This is an example of Clustering where K-means clustering can be used to group the documents by topics using bag-of-words approach”
  • 62.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings Can you tell what’s happening in the following cases? B. Identifying hand-written digits in images correctly “This is an example of Classification. The traditional approach to solving this would be to extract digit dependent features like curvature of different digits, etc. and then use a classifier like SVM to distinguish between images”
  • 63.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings Can you tell what’s happening in the following cases? C. Behavior of a website indicating that the site is not working as designed “This is an example of Anomaly Detection. In this case, the algorithm learns what is "normal" and what is "not normal", usually by observing the logs of the website”
  • 64.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings Can you tell what’s happening in the following cases? D. Predicting salary of an individual based his/her years of experience “This is an example of Regression. This problem can be mathematically defined as a function between independent (years of experience) and dependent variable (salary of an individual)”
  • 65.
    Summary What is K-MeansElbow Method to choose K Clustering cars with K-means Classifying tumor with logisticWhat is logistic regression

Editor's Notes