Exploring Unsupervised Learning: Statistical and Computational Insights

Unsupervised learning: Statistical and computational perspectives Werner Stuetzle Professor and Chair, StatisticsAdjunct Professor, Computer Science and EngineeringUniversity of Washington, Seattle Supported by NSF grant DMS-9803226 and NSA grant 62-1942. Work performed while on sabbatical at AT&T Labs - Research.

1. Introduction • Given:Collection of n objects, characterized by feature vectors x1, … , xn. • General goal of unsupervised learning: • Detect presence of distinct groups • Assign objects to groups • Note: Important to distinguish between unsupervised learningandcompact partitioning • Unsupervised learning: Identify distinct groups • Compact partitioning:Partition collection of objects into compact strata

The prototypical compact partitioning method:K-means clustering • Let Pk = P1 ,…, Pk be a partition of the observations into k groups. • Measure badness of a partition by the sum of squared distances of observations from their group means: • Find optimal partition (for example with the Lloyd algorithm) • Note: • K-means clustering can be successful at finding groups if • we picked the correct k • groups are roughly spherical, and • approximately of the same size • For the remainder of the talk, will focus on unsupervised learning

2. Approaches to Unsupervised Learning • Regard feature vectors x1, … , xn as sample from some density p(x) • Parametric approach: (Cheeseman, McLachlan, Raftery) • Based on premise that each group g is represented by density pg that is a member of some parametric family => p(x) is a mixture • Estimate the parameters of the group densities, the mixing proportions, and the number of groups from the sample. • Nonparametric approach: (Wishart, Hartigan) • Based on the premise that distinct groups manifest themselves as multiple modes of p(x) • Estimate modes from sample • Will pursue nonparametric approach

3. Describing the modal structure of a density Consider feature vectors x1 , …. , xn as a sample from some density p(x) . Define level set L(c ; p) as the subset of feature space for which the density p(x) is greater than c. Note: Level sets with multiple connected components indicate multi-modality There might not be a single level set that reveals all the modes

The cluster tree of a density • Modal structure of density is described by cluster tree. • Each node N of cluster tree • represents a subset D(N) of feature space • is associated with a density level c(N) • Root node • represents the entire feature space • is associated with density level c(N) = 0 • Tree defined recursively: to determine descendents of node N • Find lowest level c for which intersection of D(N) with L(c ; p) has two connected components • If there is no such c then N is leaf of tree; leaves of tree <==> modes • Otherwise, create daughter nodes representing the connected components, with associated level c

Goal: Estimate the cluster tree of the underlying density p(x) from the sample feature vectors x1 , …. , xn First step: Estimate p(x) by density estimate p*(x) (see below) Second step: Compute cluster tree of p* (maybe approximately)

4. Density estimation Consider feature vectors x1 , …. , xn as a sample from some density p(x). Goal: Estimate p(x) Simplest idea: Let S(x, r) denote a sphere in feature space with radius r, centered at x. Assuming density is roughly constant over S(x, r), the expected number of sample points inS(x, r) is k ~ n * Volume ( S(x, r) ) * p(x), giving p(x) ~ k / (n * Volume ( S(x, r) ) Kernel estimate: Fix radius r ; k = # of sample feature vectors in S(x, r) K-near-neighbor estimate: Fix count k; r = smallest radius for which S(x, r) contains k sample feature vectors Many refinements have been suggested

Example - kernel density estimate in 2-d • Swept under the rug: • Choice of sphere radius r (for kernel estimate) or count k (for near-neighbor estimate) --- critical !! There are automatic methods. • Down-weight observations depending on distance from query point • Adaptive estimation --- vary radius r depending on density • Other types of estimates, etc, etc, etc (extensive literature)

Computational complexity • Computing kernel or near-neighbor estimate at query point x requires finding nearest neighbors of x in sample x1 , …. , xn. • Can find k nearest neighbors of x in time ~ log n using spatial partitioning schemes such as k-d trees, after n log n pre-processing • However • Spatial partitioning most effective if n large relative to d. • Theoretical analysis shows that number of nearest neighbors should increase with n and decrease with dimensionality d: k ~ n ^ (4 / (d + 4)). Relevance ? • In low dimensions (d <= 4) can use histogram or average shifted histogram density estimates based on regular binning. • Evaluation for query point in constant time, after pre-processing ~ n • High dimensionality may present problem

5. Recursive algorithms for constructing a cluster tree • For most density estimates p*(x), computing level sets and finding their connected components is a daunting problem --- especially in high dimensions. • Idea: Computesample cluster tree instead • Each node N of sample cluster tree • represents a subset X(N) of the sample • is associated with a density level c(N) • Root node • represents the entire sample • is associated with density level c(N) = 0

To determine descendents of node N • Find lowest level c for which the intersection of X(N) with L(c ; p*) falls into two connected components Note: Intersection of X(N) with L(c ; p*) consists of those feature vectors in the node N for which estimated density p*(xi) > c. @ • If there is no such c then N is leaf of tree; • Otherwise, create daughter nodes representing the “connected components”, with associated level c. • Note: • @ is the critical step. Will in general have to rely on heuristic. • Daughters of a node N do not define a partition of X(N). Assigning low density observations in X(N) to one of the daughters is supervised learning problem

Illustration

Critical step • Find lowest level c for which observations in X(N) with estimated density p*(xi) > c fall into two connected components of level set L(c ; p*) • Heuristic 1 : (goes with k-near-neighbor density estimate) • Select feature vectors xiin X(N) withp*(xi) > c • Generate graph connecting each feature vector to its k nearest neighbors • Check whether graph has 1 or 2 connected components • Heuristic 2 : (goes with kernel density estimate) • Select feature vectors xiin X(N) withp*(xi) > c • Generate graph connecting feature vectors with distance < r • Check whether graph has 1 or 2 connected components

Related work • Looking for the connected components of a level set --- One-level Mode Analysis --- was first suggested by David Wishart (1969). • Wishart’s paper appeared in obscure place --- Proceedings of the Colloquium in Numerical Taxonomy, St. Andrews, 1968. Nobody in CS cites Wishart. • Idea has been re-invented multiple times --- “sharpening” (Tukey & Tukey); DBSCAN (Ester et al)… Methods differ in heuristics for finding connected components of level set. • Wishart also realized that looking at single level set might not be enough to detect all the modes ==> Hierarchical Mode Analysis. Did not think of it as estimating cluster tree. Algorithm awkward --- based on iterative merging instead of recursive partitioning.OPTICS method of Ankerst et al also considers level sets for different levels.

6. Constructing the cluster tree of the 1-near neighbor density estimate The 1-near-neighbor density estimate is defined by p*(x) ~ 1 / distd (x, X) Advantage of 1-near-neighbor estimate: Connected components of level sets of p* can be found exactly by analyzing the minimal spanning tree of the sample. Disadvantage of 1-near-neighbor estimate: Not a very good density estimate: noisy, singularities at observed feature vectors xi. (Not necessarily fatal --- we don’t care about density per se) Noise and singularities produce spurious nodes => specify a minimum cluster size

Computationally attractive • Computing and pre-processing minimal spanning tree ~ n log n. • Deciding on whether a cluster with m observations should be split ~ m • Have implemented this method and run a number of experiments on simulated data and data sets from machine learning. • Competitive with other methods that make implicit assumptions about shape of groups (like k-means, average linkage ..) • A lot better when assumptions made by those methods are violated.

7. Summary and future work • The term “clustering” is ambiguous --- need to distinguish between compact partitioning and unsupervised learning. • Goal of unsupervised learning: detect presence of distinct groups. • Assumption: groups ~ modes --- connected components of level sets --- of feature density. • This definition accommodates elongated and non-linear groups. • Modal structure of density is described by cluster tree. • Cluster tree is defined recursively --- suggests recursive partitioning. • Potentially many variations on basic algorithm, differing in • (1) estimate of feature density (2) heuristic for deciding when to split a node • Attractive choice: 1-near-neighbor density estimate. Level sets and their connected components can be found exactly by analyzing minimal spanning tree of sample

Future work • Principled method for deciding on number of groups --- hard! • Sampling or aggregation methods for dealing with large data sets • Visualization: Link cluster tree with other displays such as histograms, scatterplots, etc, to understand location and shape of clusters in feature space • Quantitative evaluation and comparison of methods

4. Finding the cluster tree of the estimated density For most density estimates p*(x), computing level sets and finding their connected components is a daunting problem --- especially in high dimensions. Idea: Computesample cluster tree instead Sample cluster tree Densitycluster tree • Each node N • represents a subset D(N) of feature space • is associated with a density level c(N) • Root node • represents the entire feature space • is associated with density level c(N) = 0 • Each node N • represents a subset X(N) of the sample • is associated with a density level c(N) • Root node • represents the entire sample • is associated with density level c(N) = 0

Densitycluster tree Sample cluster tree • To determine descendents of node N • Find lowest level b for which intersection of D(N) with L(b ; p) has two connected components • If there is no such b then N is leaf of tree; • Otherwise, create daughter nodes representing the connected components, with associated level b • To determine descendents of node N • Find lowest level b for which the intersection of X(N) with L(b ; p*) falls into two connected components @ • If there is no such b then N is leaf of tree; • Otherwise, create daughter nodes representing the subsets of X(N), with associated level b • @ The critical step: • Easy to compute intersection of X(N) with level set L(b, p*): it is the subset of the observations in X(N) for which p*(xi) > b • Hard to decide whether they fall into one or two connected components --- usually need heuristic

Exploring Unsupervised Learning: Statistical and Computational Insights