Clustering Algorithms: An Introduction
Classification Method of Supervised learning Learns a method for predicting the instance class from pre-labeled (classified)  instances
Clustering Method of unsupervised   learning Finds “natural” grouping of instances given un-labeled data
Clustering Methods Many different method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up
Clusters:  exclusive vs. overlapping a k j i h g f e d c b
Example of Outlier x  x x  x  x  x x  x x  x  x  x  x x  x x xx  x x  x  x  x  x  x x x  x x x  x x  x  x  x x  x  x x  x x Outlier
Methods of Clustering Hierarchical (Agglomerative): Initially, each point in cluster by itself. Repeatedly combine the two “nearest” clusters into one. Point Assignment: Maintain a set of clusters. Place points into their “nearest” cluster.
Hierarchical clustering Bottom up Start with single-instance clusters At each step, join the two closest clusters  Design decision: distance between clusters E.g. two closest instances in clusters vs. distance between means Top down Start with one universal cluster Find two clusters Proceed recursively on each subset Can be very fast Both methods produce a dendrogram
Incremental clustering Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally Start:  tree consists of empty root node Then:  add instances one by one update tree appropriately at each stage to update, find the right leaf for an instance May involve restructuring the tree Base update decisions on  category utility
And in the Non-Euclidean Case? The only “locations” we can talk about are the points themselves. I.e., there is no “average” of two points. Approach 1:  clustroid   = point “closest” to other points. Treat clustroid as if it were centroid, when computing intercluster distances.
“ Closest” Point? Possible meanings: Smallest maximum distance to the other points. Smallest average distance to other points. Smallest sum of squares of distances to other points. Etc., etc.
k  – Means Algorithm(s) Assumes Euclidean space. Start by picking  k , the number of clusters. Initialize clusters by picking one point per cluster. Example: pick one point at random, then  k  -1 other points, each as far away as possible from the previous points.
Populating Clusters For each point, place it in the cluster whose current centroid it is nearest. After all points are assigned, fix the centroids of the  k   clusters. Optional : reassign all points to their closest centroid. Sometimes moves points between clusters.
Simple Clustering: K-means Works with numeric data only Pick a number (K) of cluster centers (at random) Assign every item to its nearest cluster center (e.g. using Euclidean distance) Move each cluster center to the mean of its assigned items Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)
K-means clustering summary Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers
K-means variations K-medoids  – instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 9 is  Mean of 1, 3, 5, 7, 1009 is Median of 1, 3, 5, 7, 1009 is  Median advantage: not affected by extreme values For large databases, use sampling 5 205 5
Examples of Clustering Applications Marketing:  discover customer groups and use them for targeted marketing and re-organization Astronomy:  find groups of similar stars and galaxies Earth-quake studies:  Observed earth quake epicenters should be clustered along continent faults Genomics:  finding groups of gene with similar expression And many more.
Clustering Summary unsupervised many approaches K-means – simple, sometimes useful K-medoids is less sensitive to outliers Hierarchical clustering – works for symbolic attributes
References This PPT is complied from: Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor, Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

Clustering

  • 1.
  • 2.
    Classification Method ofSupervised learning Learns a method for predicting the instance class from pre-labeled (classified) instances
  • 3.
    Clustering Method ofunsupervised learning Finds “natural” grouping of instances given un-labeled data
  • 4.
    Clustering Methods Manydifferent method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up
  • 5.
    Clusters: exclusivevs. overlapping a k j i h g f e d c b
  • 6.
    Example of Outlierx x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x Outlier
  • 7.
    Methods of ClusteringHierarchical (Agglomerative): Initially, each point in cluster by itself. Repeatedly combine the two “nearest” clusters into one. Point Assignment: Maintain a set of clusters. Place points into their “nearest” cluster.
  • 8.
    Hierarchical clustering Bottomup Start with single-instance clusters At each step, join the two closest clusters Design decision: distance between clusters E.g. two closest instances in clusters vs. distance between means Top down Start with one universal cluster Find two clusters Proceed recursively on each subset Can be very fast Both methods produce a dendrogram
  • 9.
    Incremental clustering Heuristicapproach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally Start: tree consists of empty root node Then: add instances one by one update tree appropriately at each stage to update, find the right leaf for an instance May involve restructuring the tree Base update decisions on category utility
  • 10.
    And in theNon-Euclidean Case? The only “locations” we can talk about are the points themselves. I.e., there is no “average” of two points. Approach 1: clustroid = point “closest” to other points. Treat clustroid as if it were centroid, when computing intercluster distances.
  • 11.
    “ Closest” Point?Possible meanings: Smallest maximum distance to the other points. Smallest average distance to other points. Smallest sum of squares of distances to other points. Etc., etc.
  • 12.
    k –Means Algorithm(s) Assumes Euclidean space. Start by picking k , the number of clusters. Initialize clusters by picking one point per cluster. Example: pick one point at random, then k -1 other points, each as far away as possible from the previous points.
  • 13.
    Populating Clusters Foreach point, place it in the cluster whose current centroid it is nearest. After all points are assigned, fix the centroids of the k clusters. Optional : reassign all points to their closest centroid. Sometimes moves points between clusters.
  • 14.
    Simple Clustering: K-meansWorks with numeric data only Pick a number (K) of cluster centers (at random) Assign every item to its nearest cluster center (e.g. using Euclidean distance) Move each cluster center to the mean of its assigned items Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)
  • 15.
    K-means clustering summaryAdvantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers
  • 16.
    K-means variations K-medoids – instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 9 is Mean of 1, 3, 5, 7, 1009 is Median of 1, 3, 5, 7, 1009 is Median advantage: not affected by extreme values For large databases, use sampling 5 205 5
  • 17.
    Examples of ClusteringApplications Marketing: discover customer groups and use them for targeted marketing and re-organization Astronomy: find groups of similar stars and galaxies Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Genomics: finding groups of gene with similar expression And many more.
  • 18.
    Clustering Summary unsupervisedmany approaches K-means – simple, sometimes useful K-medoids is less sensitive to outliers Hierarchical clustering – works for symbolic attributes
  • 19.
    References This PPTis complied from: Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor, Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6
  • 20.
    Visit more selfhelp tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net