Faculty of Computer Engineering
Seminar for Master degree in
The Major of Artificial Intelligence And Robotic
Title
DATA CLUSTERING
‫ها‬ ‫داده‬ ‫بندی‬ ‫خوشه‬
Supervisor
Associate professor. Askar Poer
Advisor
Prof. …………
Researcher
Mohammed Ayoub Mamaseeni
Outline
 Introduction
 What is Data Clustering?
 Types of Clustering Algorithms
 K-Means Clustering
 Hierarchical Clustering
 DBSCAN Clustering
 Choosing the Right Clustering Algorithm
 Evaluating Clustering Performance
 Applications of Data Clustering
 Conclusion and Key Take aways
2
Introduction to
Data Clustering
Data clustering is a powerful technique in machine learning and data
analysis that groups similar data points together, revealing
underlying patterns and structures within complex datasets. This
provides valuable insights for a wide range of applications, from
customer segmentation to image recognition.
3
What is Data Clustering?
Data clustering is the process of grouping similar data points together into distinct clusters or
groups. The goal is to identify natural patterns and structures within complex datasets, enabling
deeper insights and better decision-making. By organizing data into meaningful clusters, analysts
can uncover hidden relationships and trends that may not be immediately apparent.
4
Types of Clustering
Algorithms
1. Partitioning Algorithms: These divide data into k distinct
clusters, such as K-Means, which assigns each data point to the
nearest cluster center.
2. Hierarchical Algorithms: These build a hierarchy of clusters,
allowing analysis at different levels of granularity, like
Agglomerative and Divisive clustering.
3. Density-Based Algorithms: These identify clusters based on
the density of data points, like DBSCAN, which finds high-
density regions separated by low-density areas.
5
K-Means Clustering
K-Means is a popular partitioning clustering algorithm that groups
data points into k distinct clusters based on their similarity. It works
by iteratively assigning each data point to the nearest cluster centroid
and then recalculating the centroids until convergence.
The key advantages of K-Means are its simplicity, scalability, and the
ability to handle large datasets effectively. It is widely used in
customer segmentation, image segmentation, and anomaly
detection applications.
6
Hierarchical Clustering
Hierarchical clustering is a powerful technique that builds a hierarchy
of clusters, allowing analysis at different levels of granularity. It can
identify complex, nested structures within data by iteratively merging
or splitting clusters based on their proximity.
This approach is particularly useful when the number of clusters is
unknown or the data exhibits a clear hierarchical relationship.
Hierarchical methods include Agglomerative and Divisive clustering,
each with its own strengths and applications.
7
DBSCAN Clustering
Density-Based
Clustering
DBSCAN is a density-based
clustering algorithm that
groups together data points
that are close to each other
based on density, identifying
clusters of arbitrary shape
and size.
Handling Outliers
One of the key advantages of
DBSCAN is its ability to
identify and handle outliers,
which are data points that do
not belong to any well-
defined cluster.
Parameters and
Considerations
The performance of DBSCAN
depends on the selection of
its two key parameters,
epsilon (eps) and the
minimum number of points
(minPoints), which determine
the density threshold for
cluster formation.
8
Choosing the Right Clustering
Algorithm
Data
Characteristics
Consider the size,
dimensionality, and
structure of your
dataset. Different
algorithms excel
with specific data
types and
properties.
Cluster Shapes
K-Means works
best for spherical
clusters, while
DBSCAN can
handle arbitrary
shapes.
Hierarchical
methods suit
nested structures.
Noise Handling
DBSCAN can
identify and isolate
outliers, while K-
Means is more
sensitive to noise.
Hierarchical
methods have
varied noise
tolerance.
Computational
Efficiency
K-Means is highly
scalable, while
DBSCAN and
hierarchical
methods can be
more
computationally
intensive for large
datasets.
9
Evaluating Clustering Performance
Assessing the quality and effectiveness of clustering models is crucial to ensure they deliver
meaningful insights. Several evaluation metrics can be used to measure clustering performance,
such as intra-cluster distance, inter-cluster distance, and silhouette score.
The chart presents the performance of a clustering model based on three key evaluation metrics.
The low intra-cluster distance and high inter-cluster distance indicate that the clusters are well-
separated and compact. The silhouette score, which measures how well each data point fits its
assigned cluster, further validates the clustering quality.
10
Applications of Data Clustering
Customer
Segmentation
Cluster customers
based on their
behaviors,
preferences, and
demographics to
personalize
marketing and
improve user
experiences.
Biomedical
Research
Identify subgroups
of patients with
similar genetic
profiles or disease
characteristics to
enable precision
medicine.
Image
Segmentation
Partition images into
meaningful regions
or objects, enabling
applications like
object detection and
recognition.
Network
Analysis
Cluster nodes in a
network to uncover
communities, detect
anomalies, and
understand complex
relationships.
11
Related Studies
1- Two-pronged feature reduction in spectral clustering with optimized
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=qNQSCOoAAAAJ&pagesize=80&citft=3&email_for_op=
mahamad97ayoub%40gmail.com&authuser=1&citation_for_view=qNQSCOoAAAAJ:EUQCXRtRnyEC
The paper discusses a novel spectral clustering algorithm called BVA_LSC (Barnes-Hut t-SNE Variational Autoencoder Landmark-
based Spectral Clustering), which aims to improve the performance and efficiency of spectral clustering on high-dimensional
datasets. The key contributions and methods presented in the paper are as follows:
 Two-Pronged Feature Reduction:
- Barnes-Hut t-SNE: This method is used for dimensionality reduction, which optimizes the computational cost by reducing the
size of the similarity matrix used in spectral clustering. Barnes-Hut t-SNE is particularly effective for high-dimensional data.
- Variational Autoencoder (VAE): A deep learning technique used alongside Barnes-Hut t-SNE to capture non-linear
relationships in data and further reduce dimensionality.
 Adaptive Landmark Selection:
- K-harmonic means clustering: This algorithm is used initially to group data points and narrow down potential landmarks (a
subset of representative data points).
- Grey Wolf Optimization (GWO): An optimization algorithm inspired by the social hierarchy of grey wolves, which is used to
select the most effective landmarks based on a novel objective function. This selection process ensures that the landmarks are
evenly distributed across the dataset and represent the data well.
12
Related Studies
 Optimized Similarity Matrix:
- By reducing the number of features and carefully selecting landmarks, the algorithm decreases the size of the similarity
matrix, which reduces the computational burden during eigen decomposition—a critical step in spectral clustering.
 Dynamic Landmark Count Determination:
- The paper introduces a new equation to dynamically determine the optimal number of landmarks based on the dataset’s
features. This allows the algorithm to adapt to different datasets without requiring manual tuning.
 Experimental Validation:
- The algorithm was tested on several real-world datasets (e.g., MNIST, USPS, Fashion-MNIST) and compared against various
state-of-the-art spectral clustering methods. The results showed that BVA_LSC generally outperforms other methods in terms of
clustering accuracy (ACC) and normalized mutual information (NMI), particularly for complex and high-dimensional datasets.
 Computational Efficiency:
- While BVA_LSC demonstrates superior clustering performance, it does so at the cost of slightly higher computational time
compared to some of the other methods, especially as the number of landmarks increases.
 Overall, the paper introduces a robust and efficient spectral clustering method that leverages advanced feature reduction and
optimized landmark selection to tackle the challenges of high-dimensional data clustering. The approach balances accuracy
with computational efficiency, making it suitable for large-scale data analysis tasks.
13
Conclusion and Key Takeaways
Powerful Insights from Data
Clustering algorithms unlock hidden
patterns and structures in complex data,
enabling organizations to uncover
valuable business insights.
Adaptable to Various Domains
From customer segmentation to image
analysis, clustering techniques can be
applied across a wide range of industries
and use cases.
Importance of Algorithm
Selection
Carefully choosing the right clustering
algorithm based on data characteristics
and business objectives is crucial for
successful deployment.
Continuous Improvement
Evaluating clustering performance and
iterating on models can lead to ongoing
refinements and better decision-making
support.
14

Data Clustering

  • 1.
    Faculty of ComputerEngineering Seminar for Master degree in The Major of Artificial Intelligence And Robotic Title DATA CLUSTERING ‫ها‬ ‫داده‬ ‫بندی‬ ‫خوشه‬ Supervisor Associate professor. Askar Poer Advisor Prof. ………… Researcher Mohammed Ayoub Mamaseeni
  • 2.
    Outline  Introduction  Whatis Data Clustering?  Types of Clustering Algorithms  K-Means Clustering  Hierarchical Clustering  DBSCAN Clustering  Choosing the Right Clustering Algorithm  Evaluating Clustering Performance  Applications of Data Clustering  Conclusion and Key Take aways 2
  • 3.
    Introduction to Data Clustering Dataclustering is a powerful technique in machine learning and data analysis that groups similar data points together, revealing underlying patterns and structures within complex datasets. This provides valuable insights for a wide range of applications, from customer segmentation to image recognition. 3
  • 4.
    What is DataClustering? Data clustering is the process of grouping similar data points together into distinct clusters or groups. The goal is to identify natural patterns and structures within complex datasets, enabling deeper insights and better decision-making. By organizing data into meaningful clusters, analysts can uncover hidden relationships and trends that may not be immediately apparent. 4
  • 5.
    Types of Clustering Algorithms 1.Partitioning Algorithms: These divide data into k distinct clusters, such as K-Means, which assigns each data point to the nearest cluster center. 2. Hierarchical Algorithms: These build a hierarchy of clusters, allowing analysis at different levels of granularity, like Agglomerative and Divisive clustering. 3. Density-Based Algorithms: These identify clusters based on the density of data points, like DBSCAN, which finds high- density regions separated by low-density areas. 5
  • 6.
    K-Means Clustering K-Means isa popular partitioning clustering algorithm that groups data points into k distinct clusters based on their similarity. It works by iteratively assigning each data point to the nearest cluster centroid and then recalculating the centroids until convergence. The key advantages of K-Means are its simplicity, scalability, and the ability to handle large datasets effectively. It is widely used in customer segmentation, image segmentation, and anomaly detection applications. 6
  • 7.
    Hierarchical Clustering Hierarchical clusteringis a powerful technique that builds a hierarchy of clusters, allowing analysis at different levels of granularity. It can identify complex, nested structures within data by iteratively merging or splitting clusters based on their proximity. This approach is particularly useful when the number of clusters is unknown or the data exhibits a clear hierarchical relationship. Hierarchical methods include Agglomerative and Divisive clustering, each with its own strengths and applications. 7
  • 8.
    DBSCAN Clustering Density-Based Clustering DBSCAN isa density-based clustering algorithm that groups together data points that are close to each other based on density, identifying clusters of arbitrary shape and size. Handling Outliers One of the key advantages of DBSCAN is its ability to identify and handle outliers, which are data points that do not belong to any well- defined cluster. Parameters and Considerations The performance of DBSCAN depends on the selection of its two key parameters, epsilon (eps) and the minimum number of points (minPoints), which determine the density threshold for cluster formation. 8
  • 9.
    Choosing the RightClustering Algorithm Data Characteristics Consider the size, dimensionality, and structure of your dataset. Different algorithms excel with specific data types and properties. Cluster Shapes K-Means works best for spherical clusters, while DBSCAN can handle arbitrary shapes. Hierarchical methods suit nested structures. Noise Handling DBSCAN can identify and isolate outliers, while K- Means is more sensitive to noise. Hierarchical methods have varied noise tolerance. Computational Efficiency K-Means is highly scalable, while DBSCAN and hierarchical methods can be more computationally intensive for large datasets. 9
  • 10.
    Evaluating Clustering Performance Assessingthe quality and effectiveness of clustering models is crucial to ensure they deliver meaningful insights. Several evaluation metrics can be used to measure clustering performance, such as intra-cluster distance, inter-cluster distance, and silhouette score. The chart presents the performance of a clustering model based on three key evaluation metrics. The low intra-cluster distance and high inter-cluster distance indicate that the clusters are well- separated and compact. The silhouette score, which measures how well each data point fits its assigned cluster, further validates the clustering quality. 10
  • 11.
    Applications of DataClustering Customer Segmentation Cluster customers based on their behaviors, preferences, and demographics to personalize marketing and improve user experiences. Biomedical Research Identify subgroups of patients with similar genetic profiles or disease characteristics to enable precision medicine. Image Segmentation Partition images into meaningful regions or objects, enabling applications like object detection and recognition. Network Analysis Cluster nodes in a network to uncover communities, detect anomalies, and understand complex relationships. 11
  • 12.
    Related Studies 1- Two-prongedfeature reduction in spectral clustering with optimized https://scholar.google.com/citations?view_op=view_citation&hl=en&user=qNQSCOoAAAAJ&pagesize=80&citft=3&email_for_op= mahamad97ayoub%40gmail.com&authuser=1&citation_for_view=qNQSCOoAAAAJ:EUQCXRtRnyEC The paper discusses a novel spectral clustering algorithm called BVA_LSC (Barnes-Hut t-SNE Variational Autoencoder Landmark- based Spectral Clustering), which aims to improve the performance and efficiency of spectral clustering on high-dimensional datasets. The key contributions and methods presented in the paper are as follows:  Two-Pronged Feature Reduction: - Barnes-Hut t-SNE: This method is used for dimensionality reduction, which optimizes the computational cost by reducing the size of the similarity matrix used in spectral clustering. Barnes-Hut t-SNE is particularly effective for high-dimensional data. - Variational Autoencoder (VAE): A deep learning technique used alongside Barnes-Hut t-SNE to capture non-linear relationships in data and further reduce dimensionality.  Adaptive Landmark Selection: - K-harmonic means clustering: This algorithm is used initially to group data points and narrow down potential landmarks (a subset of representative data points). - Grey Wolf Optimization (GWO): An optimization algorithm inspired by the social hierarchy of grey wolves, which is used to select the most effective landmarks based on a novel objective function. This selection process ensures that the landmarks are evenly distributed across the dataset and represent the data well. 12
  • 13.
    Related Studies  OptimizedSimilarity Matrix: - By reducing the number of features and carefully selecting landmarks, the algorithm decreases the size of the similarity matrix, which reduces the computational burden during eigen decomposition—a critical step in spectral clustering.  Dynamic Landmark Count Determination: - The paper introduces a new equation to dynamically determine the optimal number of landmarks based on the dataset’s features. This allows the algorithm to adapt to different datasets without requiring manual tuning.  Experimental Validation: - The algorithm was tested on several real-world datasets (e.g., MNIST, USPS, Fashion-MNIST) and compared against various state-of-the-art spectral clustering methods. The results showed that BVA_LSC generally outperforms other methods in terms of clustering accuracy (ACC) and normalized mutual information (NMI), particularly for complex and high-dimensional datasets.  Computational Efficiency: - While BVA_LSC demonstrates superior clustering performance, it does so at the cost of slightly higher computational time compared to some of the other methods, especially as the number of landmarks increases.  Overall, the paper introduces a robust and efficient spectral clustering method that leverages advanced feature reduction and optimized landmark selection to tackle the challenges of high-dimensional data clustering. The approach balances accuracy with computational efficiency, making it suitable for large-scale data analysis tasks. 13
  • 14.
    Conclusion and KeyTakeaways Powerful Insights from Data Clustering algorithms unlock hidden patterns and structures in complex data, enabling organizations to uncover valuable business insights. Adaptable to Various Domains From customer segmentation to image analysis, clustering techniques can be applied across a wide range of industries and use cases. Importance of Algorithm Selection Carefully choosing the right clustering algorithm based on data characteristics and business objectives is crucial for successful deployment. Continuous Improvement Evaluating clustering performance and iterating on models can lead to ongoing refinements and better decision-making support. 14