120 likes | 168 Views
8. Estimating the cluster tree of a density from the MST by Runt Pruning Problem: 1-nn density estimate is very noisy --- singularity at each observation => cluster tree would have n leaves
E N D
8. Estimating the cluster tree of a density from the MST by Runt Pruning Problem: 1-nn density estimate is very noisy --- singularity at each observation => cluster tree would have n leaves Idea: Control size of cluster tree byrunt size thresholdSplit of connected component of L(c, p*) is considered “significant” if both daughter components are larger than runt size threshold. Sketch of algorithm Repeat { Break longest edge of MST} Until min (size of left subtree, size of right subtree) > runt size threshold If … apply recursively to subtrees
rs = 2 rs = 1 rs = 5 • Runt analysis • Define runt size (J. H.) of MST edge e: • Break all MST edges that are longer than e • runt_size (e) = min (#obs in left subtree, #obs in right subtree) Algorithm: compute_cluster_tree (mst, runt_size_threshold) { node = new_cluster_tree_node; node.leftson = node.rightson = NULL; node.obs = leaves (mst); cut_edge = longest_edge_with_large_runt_size (mst, runt_size_threshold); if (cut_edge) { node.leftson = compute_cluster_tree (left_subtree(mst, cut_edge), runt_size_threshold); node.rightson = compute_cluster_tree (right_subtree(mst, cut_edge), runt_size_threshold); } return(node); }
Heuristic justification: MST edges with large runt size indicate presence of multiple modes • Recall multi-fragment algorithm for MST construction: • Define distance d (G1, G2) between groups as minimum distance between observations • Initialize each obs to form its own group • Repeat { Find closest groups Add shortest edge connecting them Merge closest groups} Until only one group remains • What will happen? • Fragments will start and grow in high density regions, where distances are small • Eventually, those fragments will be joined by edges • Those edges will have large runt size
Illustration Left: data setMiddle: rootogram of runt sizesRight: MST after removal of all edges with length > length (edge with largest runt size)
Computational complexity • Computing MST: O (n log n) using spatial hashing • Computing runt sizes for edges of MST: O (n log n) • Deciding on whether a cluster with m observations should be split: O (m) • However • Spatial partitioning most effective if n large relative to d.
Relationship to single linkage clustering • Single linkage clustering = standard way of extracting clusters from MSTTo obtain k clusters, break k-1 longest edges in MST • Problems: • Breaking longest edges tends to separate stragglers from the bulk of the data and often results in one large and many small clusters (“chaining”) • Choosing a single threshold for edge length <=> choosing a single cut level for 1-NN density estimate. However, there might not be a single cut level that reveals all the leaves of the mode tree. Cut at upper level reveals two leftmost modes. Cut at lower level reveals right mode. Need to consider cuts at all levels
Illustration - olive oil data Objects: 572 olive oil samples coming from 9 different areas, grouped into 3 regions (1, 2, 3, 4) (5, 6) (7, 8, 9) Features: Concentration of 8 different chemicals Question: How well canwe recover the grouping into regions and areas Note:To evaluate performance of unsupervised learning methods, need labeled data 20 largest runt sizes: 168 97 59 51 42 42 33 13 13 12 11 11 11 10 10 8 8 8 8 7 Fairly clear gap: Choose runt size 33 as threshold Note: Situation not always that clear cut
Estimate of cluster tree, olive oil data • Interpretation: • Bottom split separates region 3 from regions 1, 2 • Next split on left separates region 1 from region 2 • Not able to correctly partition region 1 into areas
Areas vs clusters: Interpretation of table: There are 25 olive oil samples from area 1. One of them ended up in cluster 2, 17 in cluster 6, and 7 in cluster 8 Not able to recognize areas 1- 4 in region 1
Diagnostic plot: Do the two clusters in area 3 really correspond to modes ? (a) cluster tree with node splitting area 3 selected; (b) projection of data in node on Fisher discriminant direction separating daughters; (c) cluster tree with node separating area 3 from area 2 selected; (d) projection of data on Fisher direction
Diagnostic plot: Do areas 1 and 4 really correspond to modes ? Projection of areas 1 (black), 2 (green), 3 (blue), and 4 (red) on the plane spanned by first two discriminant coordinates Note: Not an operational diagnostic --- assumes knowledge of true labels