Density based Clustering Algorithms(DB SCAN, Mean shift )

Clustering Algorithms
Utkarsh Sharma
Asst. Prof. CSE Dept.
JUET Guna (M.P.) India
13-04-2020 Utkarsh Sharma 1

Mean Shift Clustering
• Mean shift clustering is a sliding-window-based algorithm that
attempts to find dense areas of data points. It is a centroid-based
algorithm meaning that the goal is to locate the center points of each
group/class, which works by updating candidates for center points to
be the mean of the points within the sliding-window. These candidate
windows are then filtered in a post-processing stage to eliminate
near-duplicates, forming the final set of center points and their
corresponding groups.

Mean Shift Clustering Example

1.To explain mean-shift we will consider a set of points in two-
dimensional space like the above illustration. We begin with a
circular sliding window centered at a point C (randomly
selected) and having radius r as the kernel. Mean shift is a hill-
climbing algorithm that involves shifting this kernel iteratively to
a higher density region on each step until convergence.
2.At every iteration, the sliding window is shifted towards regions
of higher density by shifting the center point to the mean of the
points within the window (hence the name). The density within
the sliding window is proportional to the number of points inside
it. Naturally, by shifting to the mean of the points in the window
it will gradually move towards areas of higher point density.

3. We continue shifting the sliding window according to the mean
until there is no direction at which a shift can accommodate
more points inside the kernel. Check out the graphic above;
we keep moving the circle until we no longer are increasing
the density (i.e. number of points in the window).
4.This process of steps 1 to 3 is done with many sliding windows
until all points lie within a window. When multiple sliding
windows overlap the window containing the most points is
preserved. The data points are then clustered according to the
sliding window in which they reside.

Mean Shift Clustering Example

Density-Based Spatial
Clustering (DBSCAN)
• DBSCAN is a density-based clustered algorithm similar to mean-
shift, but with a couple of notable advantages.
• DBSCAN begins with an arbitrary starting data point that has not been
visited. The neighborhood of this point is extracted using a distance
epsilon ε (All points which are within the ε distance are neighborhood
points).
• If there are a sufficient number of points (according to minPoints) within
this neighborhood then the clustering process starts and the current data
point becomes the first point in the new cluster. Otherwise, the point will
be labeled as noise (later this noisy point might become the part of the
cluster). In both cases that point is marked as “visited”.

Density-Based Spatial
Clustering (DBSCAN)
• For this first point in the new cluster, the points within its ε distance neighborhood also
become part of the same cluster. This procedure of making all points in the ε
neighborhood belong to the same cluster is then repeated for all of the new points that
have been just added to the cluster group.
• This process of steps 2 and 3 is repeated until all points in the cluster are determined i.e
all points within the ε neighborhood of the cluster have been visited and labeled.
• Once we’re done with the current cluster, a new unvisited point is retrieved and
processed, leading to the discovery of a further cluster or noise. This process repeats
until all points are marked as visited. Since at the end of this all points have been visited,
each point will have been marked as either belonging to a cluster or being noise.

DB SCAN Example

Agglomerative Hierarchical Clustering
• Hierarchical clustering algorithms fall into 2 categories: top-
down or bottom-up. Bottom-up algorithms treat each data point
as a single cluster at the outset and then successively merge
(or agglomerate) pairs of clusters until all clusters have been
merged into a single cluster that contains all data points.
Bottom-up hierarchical clustering is therefore called hierarchical
agglomerative clustering or HAC. This hierarchy of clusters is
represented as a tree (or dendrogram). The root of the tree is
the unique cluster that gathers all the samples, the leaves being
the clusters with only one sample.

• We begin by treating each data point as a single cluster i.e if there are X data
points in our dataset then we have X clusters. We then select a distance metric
that measures the distance between two clusters. As an example, we will
use average linkage which defines the distance between two clusters to be the
average distance between data points in the first cluster and data points in the
second cluster.
• On each iteration, we combine two clusters into one. The two clusters to be
combined are selected as those with the smallest average linkage. I.e according to
our selected distance metric, these two clusters have the smallest distance
between each other and therefore are the most similar and should be combined.
• Step 2 is repeated until we reach the root of the tree i.e we only have one cluster
which contains all data points. In this way we can select how many clusters we
want in the end, simply by choosing when to stop combining the clusters i.e when
we stop building the tree!

Reference
• Source :- https://towardsdatascience.com/the-5-clustering-
algorithms-data-scientists-need-to-know-a36d136ef68

Density based Clustering Algorithms(DB SCAN, Mean shift )

More Related Content

What's hot

Similar to Density based Clustering Algorithms(DB SCAN, Mean shift )

More from Utkarsh Sharma

Recently uploaded

Density based Clustering Algorithms(DB SCAN, Mean shift )