200 likes | 212 Views
Learn about clustering, including its definition, similarity measures, and methods for determining the number of clusters. Explore algorithms such as K-means and spectral clustering, and examine their pros and cons. Use various datasets to understand and compare different clustering methods.
E N D
Clustering Gilad Lerman Math Department, UMN Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore
What is Clustering? • Partitioning data into classes with high intra-class similarity low inter-class similarity • Is it well-defined?
What is Similarity? • Clearly, subjective measure or problem-dependent
How Similar Clusters are? • Ex1: Two clusters or one clusters?
How Similar Clusters are? • Ex2: Cluster or outliers
Sum-Squares Intra-class Similarity • Given Cluster Mean: Within Cluster Sum of Squares: • Note that
Within Cluster Sum of Squares • For Set of Clusters S={S1,…,SK} • Can use • So get Within Clusters Manhattan Distance • Question: how to compute/estimate c?
Minimizing WCSS • Precise minimization is “NP-hard” • Approximate minimization for WCSS by K-means • Approximate minimization for WCMD by K-medians
The K-means Algorithm • Input: Data & number of clusters (K) • Randomly guess locations of K cluster centers • For each center – assign nearest cluster • Repeat till convergence ….
Demonstration: K-means/medians • Applet
K-means: Pros and Cons • Pros • Often fast • Often terminates at a local minimum • Cons • May not obtain the global minimum • Depends on initialization • Need to specify K • Sensitive to outliers • Sensitive to variations in sizes and densities of clusters • Not suitable for non-convex shapes • Does not apply directly to categorical data
Spectral Clustering Idea: embed data for easy clustering • Construct weights based on proximity: (Normalize W ) • Embed using eigenvectors of W
Clustering vs. Classification • Clustering – find classes in an unsupervised way (often K is given though) • Classification – labels of clusters are given for some data points (supervised learning)
Data 1: Face images • Facial images (e.g., of persons 5,8,10) live on different “planes” in the “image space” • They are often well-separated so that simple clustering can apply to them (but not always…) • Question: What is the high-dimensional image space? • Question: How can we present high-dim. data in 3D?
Data 2: Iris Data Set • 50 samples from each of 3 species • 4 features per sample: length & width of sepal and petal Setosa Versicolor Virginica
Data 2: Iris Data Set • Setosa is clearly separated from 2 others • Can’t separate Virginica and Versicolor (need training set as done by Fischer in 1936) • Question: What are other ways to visualize?
Data 3: Color-based Compression of Images • Applet • Question: What are the actual data points? • Question: What does the error mean?
Some methods for # of Clusters(with online codes) • Gap statistics • Model-based clustering • G-means • X-means • Data-spectroscopic clustering • Self-tuning clustering
Your mission • Learn about clustering (theoretical results, algorithms, codes) • Focus: methods for determining # of clusters • Understand details • Compare using artificial and real data • Conclude good/bad scenarios for each (prove?) • Come up with new/improved methods • Summarize info: literature survey and possibly new/improved demos/applets • We can suggest additional questions tailored to your interest