层次聚类

层次聚类 Hierarchical Clustering

Hierarchical Clustering • Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram • A tree like diagram that records the sequences of merges or splits

Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level • They may correspond to meaningful taxonomies • Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

Two main types of hierarchical clustering • Agglomerative (凝聚): • Start with the points as individual clusters • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left • Divisive (分裂的) : • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters) • Traditional hierarchical algorithms use a similarity or distance matrix • Merge or split one cluster at a time

An Agglomerative Clustering Algorithm • Basic algorithm is straightforward • Compute the proximity (接近程度） matrix • Let each data point be a cluster • Repeat • Merge the two closest clusters • Update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • Different approaches to defining the distance between clusters distinguish the different algorithms

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . Starting Situation • Start with clusters of individual points and a proximity matrix Proximity Matrix

C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation • After some merging steps, we have some clusters C3 C4 C1 Proximity Matrix C5 C2

C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C3 C4 Proximity Matrix C1 C5 C2

After Merging C2 U C5 • The question is “How do we update the proximity matrix?” C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity Similarity? • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX (两集合之间） • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity   • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

1 2 3 4 5 Cluster Similarity: MIN or Single Link • Similarity of two clusters is based on the two most similar (closest) points in the different clusters • Determined by one pair of points, i.e., by one link in the proximity graph.

5 1 3 5 2 1 2 3 6 4 4 Hierarchical Clustering: MIN Nested Clusters Dendrogram （系统书图）

Dist({3,6},{2,5}) = min {dist(3,2),dist(6,2),dist(3,5),dist(6,5)) = min{0.15,0.25,0.28,0.39} = 0.15

Two Clusters Strength of MIN Original Points • Can handle non-elliptical (椭圆）shapes

Two Clusters Limitations of MIN Original Points • Sensitive to noise and outliers

1 2 3 4 5 Cluster Similarity: MAX or Complete Linkage • Similarity of two clusters is based on the two least similar (most distant) points in the different clusters • Determined by all pairs of points in the two clusters

4 1 2 5 5 2 3 6 3 1 4 Hierarchical Clustering: MAX Nested Clusters Dendrogram

Dist({3,6},{4}) = max(dist(3,4),dist(6,4)} = max(0.15,0.22) = 0.22 Dist({3,6},{2,5}) = max{dist(3,2),dist(6,2),dist(3,5),dist(6,5)} = max {0.15,0.25,0.28,0.39} = 0.39

Two Clusters Strength of MAX Original Points • Less susceptible（易受影响） to noise and outliers

Two Clusters Limitations of MAX Original Points • Tends to break large clusters • Biased towards globular clusters

1 2 3 4 5 Cluster Similarity: Group Average • Proximity of two clusters is the average of pairwise proximity between points in the two clusters. • Need to use average connectivity for scalability since total proximity favors large clusters

5 4 1 2 5 2 3 6 1 4 3 Hierarchical Clustering: Group Average Nested Clusters Dendrogram

Hierarchical Clustering: Group Average • Compromise between Single and Complete Link • Strengths • Less susceptible to noise and outliers • Limitations • Biased towards globular clusters

Cluster Similarity: Ward’s Method • Similarity of two clusters is based on the increase in squared error when two clusters are merged • Similar to group average if distance between points is distance squared • Less susceptible to noise and outliers • Biased towards globular clusters • Hierarchical analogue of K-means • Can be used to initialize K-means

5 1 5 5 4 1 3 1 4 1 2 2 5 2 5 5 2 1 5 2 5 2 2 2 3 3 6 6 3 6 3 1 6 3 3 1 4 4 4 1 3 4 4 4 Hierarchical Clustering: Comparison MIN MAX Ward’s Method Group Average

Hierarchical Clustering: Time and Space requirements • O(N2) space since it uses the proximity matrix. • N is the number of points. • O(N3) time in many cases • There are N steps and at each step the size, N2, proximity matrix must be updated and searched • Complexity can be reduced to O(N2 log(N) ) time for some approaches

Hierarchical Clustering: Problems and Limitations • 缺乏全局目标函数 • Different schemes have problems with one or more of the following: • Sensitivity to noise and outliers • Difficulty handling different sized clusters and convex shapes • Breaking large clusters • 一旦合并，不能回退。 • O(n2logn)时间复杂性影响应用。

MST: Divisive Hierarchical Clustering • Build MST (Minimum Spanning Tree) • Start with a tree that consists of any point • In successive steps, look for the closest pair of points (p, q) such that one point (p) is in the current tree but the other (q) is not • Add q to the tree and put an edge between p and q

MST: Divisive Hierarchical Clustering • Use MST for constructing hierarchy of clusters

The advantage • It can be considered the algorithm for both planar or hierarchical clustering • The time complexity of the algorithm is only O(n2 ) ，which is time complexity of building a MST

DBSCAN • DBSCAN is a density-based algorithm. • Density = number of points within a specified radius （半径） (Eps) • A point is a core point if it has more than a specified number of points (MinPts) within Eps • These are points that are at the interior of a cluster • A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point • A noise point is any point that is not a core point or a border point.

DBSCAN: Core, Border, and Noise Points

BASCAN算法 • 输入：eps, minpt; 输出：聚类 • 根据eps,minpt，把每个顶点标记为核心、边界和噪音 • 删除噪音点 • 为距离在eps 之内的所有核心点建立一条边。 • 计算图的连图分支，每个连图分支为一个簇； • 对所有的边界点，随机地指派一个与之关联的簇。

DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4

Clusters When DBSCAN Works Well Original Points • Resistant to Noise • Can handle clusters of different shapes and sizes

When DBSCAN Does NOT Work Well (MinPts=4, Eps=9.75). Original Points • Varying densities • High-dimensional data (MinPts=4, Eps=9.92)

聚类评测 • Accuracy（准确率), precision, recall • 评测的目的 • To avoid finding patterns in noise • To compare clustering algorithms • To compare two sets of clusters • To compare two clusters

DBSCAN K-means Complete Link Clusters found in Random Data Random Points

聚类验证的不同方面 • Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. • Evaluating how well the results of a cluster analysis fit the data without reference to external information. - Use only the data • Comparing the results of two different sets of cluster analyses to determine which is better. • Determining the ‘correct’ number of clusters. • For 1, 2 and 3, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.

Cluster 验证测度 • SSE 点到聚类中心的距离之和设 c 是所有顶点的中心，并且距离计算是欧式距离，定义总其中， mi表示簇 i的顶点个数。SSB 越大，说明簇之间的分离度越好。

若共有K 个簇，并且满足 mi = m/K, m 为所有顶点的总数，则可定义若定义总的平方数TSS 为

则可以证明： TSS= SSE + SSB 即SSE +SSB　是个常数非监督族评估：使用邻近度矩阵如果给定数据集的相似度矩阵和聚类标号。显然。在理想情况下，簇内的任两顶点的相似度为１，而不同簇中的两点相似度为０.若我们将顶点按簇标号排序，则对应的相似度矩阵应为对角矩阵。在具体实现时，可以取相似度的值。我们可以把相似度的值作为图像的色彩值，这样可以通过可视化的方法评价聚类效果。

Ward’s algorithm • start out with all sample units in n clusters of size 1 each. Merge two clusters with In the largest r2 value (equally the minimum error square), repeat the process until there is only one cluster.

Measuring Cluster Validity Via Correlation • Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Corr = -0.9235 明显分离的簇 Corr = -0.5810 不是很明显分离的簇

Using Similarity Matrix for Cluster Validation • Order the similarity matrix with respect to cluster labels and inspect visually. 通过相似度矩阵可视化的评价聚类

Using Similarity Matrix for Cluster Validation • Clusters in random data are not so crisp DBSCAN

层次聚类

层次聚类

Presentation Transcript