Clustering Algorithms
Utkarsh Sharma
Asst. Prof. CSE Dept.
JUET Guna (M.P.) India
13-04-2020 Utkarsh Sharma 1
Mean Shift Clustering
• Mean shift clustering is a sliding-window-based algorithm that
attempts to find dense areas of data points. It is a centroid-based
algorithm meaning that the goal is to locate the center points of each
group/class, which works by updating candidates for center points to
be the mean of the points within the sliding-window. These candidate
windows are then filtered in a post-processing stage to eliminate
near-duplicates, forming the final set of center points and their
corresponding groups.
13-04-2020 Utkarsh Sharma 2
Mean Shift Clustering Example
13-04-2020 Utkarsh Sharma 3
Mean Shift Clustering
1.To explain mean-shift we will consider a set of points in two-
dimensional space like the above illustration. We begin with a
circular sliding window centered at a point C (randomly
selected) and having radius r as the kernel. Mean shift is a hill-
climbing algorithm that involves shifting this kernel iteratively to
a higher density region on each step until convergence.
2.At every iteration, the sliding window is shifted towards regions
of higher density by shifting the center point to the mean of the
points within the window (hence the name). The density within
the sliding window is proportional to the number of points inside
it. Naturally, by shifting to the mean of the points in the window
it will gradually move towards areas of higher point density.
13-04-2020 Utkarsh Sharma 4
Mean Shift Clustering
3. We continue shifting the sliding window according to the mean
until there is no direction at which a shift can accommodate
more points inside the kernel. Check out the graphic above;
we keep moving the circle until we no longer are increasing
the density (i.e. number of points in the window).
4.This process of steps 1 to 3 is done with many sliding windows
until all points lie within a window. When multiple sliding
windows overlap the window containing the most points is
preserved. The data points are then clustered according to the
sliding window in which they reside.
13-04-2020 Utkarsh Sharma 5
Mean Shift Clustering Example
13-04-2020 Utkarsh Sharma 6
Density-Based Spatial
Clustering (DBSCAN)
• DBSCAN is a density-based clustered algorithm similar to mean-
shift, but with a couple of notable advantages.
• DBSCAN begins with an arbitrary starting data point that has not been
visited. The neighborhood of this point is extracted using a distance
epsilon ε (All points which are within the ε distance are neighborhood
points).
• If there are a sufficient number of points (according to minPoints) within
this neighborhood then the clustering process starts and the current data
point becomes the first point in the new cluster. Otherwise, the point will
be labeled as noise (later this noisy point might become the part of the
cluster). In both cases that point is marked as “visited”.
13-04-2020 Utkarsh Sharma 7
Density-Based Spatial
Clustering (DBSCAN)
• For this first point in the new cluster, the points within its ε distance neighborhood also
become part of the same cluster. This procedure of making all points in the ε
neighborhood belong to the same cluster is then repeated for all of the new points that
have been just added to the cluster group.
• This process of steps 2 and 3 is repeated until all points in the cluster are determined i.e
all points within the ε neighborhood of the cluster have been visited and labeled.
• Once we’re done with the current cluster, a new unvisited point is retrieved and
processed, leading to the discovery of a further cluster or noise. This process repeats
until all points are marked as visited. Since at the end of this all points have been visited,
each point will have been marked as either belonging to a cluster or being noise.
13-04-2020 Utkarsh Sharma 8
DB SCAN Example
13-04-2020 Utkarsh Sharma 9
Agglomerative Hierarchical Clustering
• Hierarchical clustering algorithms fall into 2 categories: top-
down or bottom-up. Bottom-up algorithms treat each data point
as a single cluster at the outset and then successively merge
(or agglomerate) pairs of clusters until all clusters have been
merged into a single cluster that contains all data points.
Bottom-up hierarchical clustering is therefore called hierarchical
agglomerative clustering or HAC. This hierarchy of clusters is
represented as a tree (or dendrogram). The root of the tree is
the unique cluster that gathers all the samples, the leaves being
the clusters with only one sample.
13-04-2020 Utkarsh Sharma 10
Agglomerative Hierarchical Clustering
13-04-2020 Utkarsh Sharma 11
Agglomerative Hierarchical Clustering
• We begin by treating each data point as a single cluster i.e if there are X data
points in our dataset then we have X clusters. We then select a distance metric
that measures the distance between two clusters. As an example, we will
use average linkage which defines the distance between two clusters to be the
average distance between data points in the first cluster and data points in the
second cluster.
• On each iteration, we combine two clusters into one. The two clusters to be
combined are selected as those with the smallest average linkage. I.e according to
our selected distance metric, these two clusters have the smallest distance
between each other and therefore are the most similar and should be combined.
• Step 2 is repeated until we reach the root of the tree i.e we only have one cluster
which contains all data points. In this way we can select how many clusters we
want in the end, simply by choosing when to stop combining the clusters i.e when
we stop building the tree!
13-04-2020 Utkarsh Sharma 12
Reference
• Source :- https://towardsdatascience.com/the-5-clustering-
algorithms-data-scientists-need-to-know-a36d136ef68
13-04-2020 Utkarsh Sharma 13

Density based Clustering Algorithms(DB SCAN, Mean shift )

  • 1.
    Clustering Algorithms Utkarsh Sharma Asst.Prof. CSE Dept. JUET Guna (M.P.) India 13-04-2020 Utkarsh Sharma 1
  • 2.
    Mean Shift Clustering •Mean shift clustering is a sliding-window-based algorithm that attempts to find dense areas of data points. It is a centroid-based algorithm meaning that the goal is to locate the center points of each group/class, which works by updating candidates for center points to be the mean of the points within the sliding-window. These candidate windows are then filtered in a post-processing stage to eliminate near-duplicates, forming the final set of center points and their corresponding groups. 13-04-2020 Utkarsh Sharma 2
  • 3.
    Mean Shift ClusteringExample 13-04-2020 Utkarsh Sharma 3
  • 4.
    Mean Shift Clustering 1.Toexplain mean-shift we will consider a set of points in two- dimensional space like the above illustration. We begin with a circular sliding window centered at a point C (randomly selected) and having radius r as the kernel. Mean shift is a hill- climbing algorithm that involves shifting this kernel iteratively to a higher density region on each step until convergence. 2.At every iteration, the sliding window is shifted towards regions of higher density by shifting the center point to the mean of the points within the window (hence the name). The density within the sliding window is proportional to the number of points inside it. Naturally, by shifting to the mean of the points in the window it will gradually move towards areas of higher point density. 13-04-2020 Utkarsh Sharma 4
  • 5.
    Mean Shift Clustering 3.We continue shifting the sliding window according to the mean until there is no direction at which a shift can accommodate more points inside the kernel. Check out the graphic above; we keep moving the circle until we no longer are increasing the density (i.e. number of points in the window). 4.This process of steps 1 to 3 is done with many sliding windows until all points lie within a window. When multiple sliding windows overlap the window containing the most points is preserved. The data points are then clustered according to the sliding window in which they reside. 13-04-2020 Utkarsh Sharma 5
  • 6.
    Mean Shift ClusteringExample 13-04-2020 Utkarsh Sharma 6
  • 7.
    Density-Based Spatial Clustering (DBSCAN) •DBSCAN is a density-based clustered algorithm similar to mean- shift, but with a couple of notable advantages. • DBSCAN begins with an arbitrary starting data point that has not been visited. The neighborhood of this point is extracted using a distance epsilon ε (All points which are within the ε distance are neighborhood points). • If there are a sufficient number of points (according to minPoints) within this neighborhood then the clustering process starts and the current data point becomes the first point in the new cluster. Otherwise, the point will be labeled as noise (later this noisy point might become the part of the cluster). In both cases that point is marked as “visited”. 13-04-2020 Utkarsh Sharma 7
  • 8.
    Density-Based Spatial Clustering (DBSCAN) •For this first point in the new cluster, the points within its ε distance neighborhood also become part of the same cluster. This procedure of making all points in the ε neighborhood belong to the same cluster is then repeated for all of the new points that have been just added to the cluster group. • This process of steps 2 and 3 is repeated until all points in the cluster are determined i.e all points within the ε neighborhood of the cluster have been visited and labeled. • Once we’re done with the current cluster, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise. This process repeats until all points are marked as visited. Since at the end of this all points have been visited, each point will have been marked as either belonging to a cluster or being noise. 13-04-2020 Utkarsh Sharma 8
  • 9.
  • 10.
    Agglomerative Hierarchical Clustering •Hierarchical clustering algorithms fall into 2 categories: top- down or bottom-up. Bottom-up algorithms treat each data point as a single cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all data points. Bottom-up hierarchical clustering is therefore called hierarchical agglomerative clustering or HAC. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample. 13-04-2020 Utkarsh Sharma 10
  • 11.
  • 12.
    Agglomerative Hierarchical Clustering •We begin by treating each data point as a single cluster i.e if there are X data points in our dataset then we have X clusters. We then select a distance metric that measures the distance between two clusters. As an example, we will use average linkage which defines the distance between two clusters to be the average distance between data points in the first cluster and data points in the second cluster. • On each iteration, we combine two clusters into one. The two clusters to be combined are selected as those with the smallest average linkage. I.e according to our selected distance metric, these two clusters have the smallest distance between each other and therefore are the most similar and should be combined. • Step 2 is repeated until we reach the root of the tree i.e we only have one cluster which contains all data points. In this way we can select how many clusters we want in the end, simply by choosing when to stop combining the clusters i.e when we stop building the tree! 13-04-2020 Utkarsh Sharma 12
  • 13.
    Reference • Source :-https://towardsdatascience.com/the-5-clustering- algorithms-data-scientists-need-to-know-a36d136ef68 13-04-2020 Utkarsh Sharma 13