Clustering

Clustering Gilad Lerman Math Department, UMN Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore

What is Clustering? • Partitioning data into classes with high intra-class similarity low inter-class similarity • Is it well-defined?

What is Similarity? • Clearly, subjective measure or problem-dependent

How Similar Clusters are? • Ex1: Two clusters or one clusters?

How Similar Clusters are? • Ex2: Cluster or outliers

Sum-Squares Intra-class Similarity • Given Cluster Mean: Within Cluster Sum of Squares: • Note that

Within Cluster Sum of Squares • For Set of Clusters S={S1,…,SK} • Can use • So get Within Clusters Manhattan Distance • Question: how to compute/estimate c?

Minimizing WCSS • Precise minimization is “NP-hard” • Approximate minimization for WCSS by K-means • Approximate minimization for WCMD by K-medians

The K-means Algorithm • Input: Data & number of clusters (K) • Randomly guess locations of K cluster centers • For each center – assign nearest cluster • Repeat till convergence ….

Demonstration: K-means/medians • Applet

K-means: Pros and Cons • Pros • Often fast • Often terminates at a local minimum • Cons • May not obtain the global minimum • Depends on initialization • Need to specify K • Sensitive to outliers • Sensitive to variations in sizes and densities of clusters • Not suitable for non-convex shapes • Does not apply directly to categorical data

Spectral Clustering Idea: embed data for easy clustering • Construct weights based on proximity: (Normalize W ) • Embed using eigenvectors of W

Clustering vs. Classification • Clustering – find classes in an unsupervised way (often K is given though) • Classification – labels of clusters are given for some data points (supervised learning)

Data 1: Face images • Facial images (e.g., of persons 5,8,10) live on different “planes” in the “image space” • They are often well-separated so that simple clustering can apply to them (but not always…) • Question: What is the high-dimensional image space? • Question: How can we present high-dim. data in 3D?

Data 2: Iris Data Set • 50 samples from each of 3 species • 4 features per sample: length & width of sepal and petal Setosa Versicolor Virginica

Data 2: Iris Data Set

Data 2: Iris Data Set • Setosa is clearly separated from 2 others • Can’t separate Virginica and Versicolor (need training set as done by Fischer in 1936) • Question: What are other ways to visualize?

Data 3: Color-based Compression of Images • Applet • Question: What are the actual data points? • Question: What does the error mean?

Some methods for # of Clusters(with online codes) • Gap statistics • Model-based clustering • G-means • X-means • Data-spectroscopic clustering • Self-tuning clustering

Your mission • Learn about clustering (theoretical results, algorithms, codes) • Focus: methods for determining # of clusters • Understand details • Compare using artificial and real data • Conclude good/bad scenarios for each (prove?) • Come up with new/improved methods • Summarize info: literature survey and possibly new/improved demos/applets • We can suggest additional questions tailored to your interest

Clustering

Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering