580 likes | 589 Views
层次聚类. Hierarchical Clustering. Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits. Strengths of Hierarchical Clustering.
E N D
层次聚类 Hierarchical Clustering
Hierarchical Clustering • Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram • A tree like diagram that records the sequences of merges or splits
Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level • They may correspond to meaningful taxonomies • Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)
Two main types of hierarchical clustering • Agglomerative (凝聚): • Start with the points as individual clusters • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left • Divisive (分裂的) : • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters) • Traditional hierarchical algorithms use a similarity or distance matrix • Merge or split one cluster at a time
An Agglomerative Clustering Algorithm • Basic algorithm is straightforward • Compute the proximity (接近程度) matrix • Let each data point be a cluster • Repeat • Merge the two closest clusters • Update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • Different approaches to defining the distance between clusters distinguish the different algorithms
p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . Starting Situation • Start with clusters of individual points and a proximity matrix Proximity Matrix
C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation • After some merging steps, we have some clusters C3 C4 C1 Proximity Matrix C5 C2
C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C3 C4 Proximity Matrix C1 C5 C2
After Merging C2 U C5 • The question is “How do we update the proximity matrix?” C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5
p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity Similarity? • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix
p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix
p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX (两集合之间) • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix
p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix
p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix
1 2 3 4 5 Cluster Similarity: MIN or Single Link • Similarity of two clusters is based on the two most similar (closest) points in the different clusters • Determined by one pair of points, i.e., by one link in the proximity graph.
5 1 3 5 2 1 2 3 6 4 4 Hierarchical Clustering: MIN Nested Clusters Dendrogram (系统书图)
Dist({3,6},{2,5}) = min {dist(3,2),dist(6,2),dist(3,5),dist(6,5)) = min{0.15,0.25,0.28,0.39} = 0.15
Two Clusters Strength of MIN Original Points • Can handle non-elliptical (椭圆)shapes
Two Clusters Limitations of MIN Original Points • Sensitive to noise and outliers
1 2 3 4 5 Cluster Similarity: MAX or Complete Linkage • Similarity of two clusters is based on the two least similar (most distant) points in the different clusters • Determined by all pairs of points in the two clusters
4 1 2 5 5 2 3 6 3 1 4 Hierarchical Clustering: MAX Nested Clusters Dendrogram
Dist({3,6},{4}) = max(dist(3,4),dist(6,4)} = max(0.15,0.22) = 0.22 Dist({3,6},{2,5}) = max{dist(3,2),dist(6,2),dist(3,5),dist(6,5)} = max {0.15,0.25,0.28,0.39} = 0.39
Two Clusters Strength of MAX Original Points • Less susceptible(易受影响) to noise and outliers
Two Clusters Limitations of MAX Original Points • Tends to break large clusters • Biased towards globular clusters
1 2 3 4 5 Cluster Similarity: Group Average • Proximity of two clusters is the average of pairwise proximity between points in the two clusters. • Need to use average connectivity for scalability since total proximity favors large clusters
5 4 1 2 5 2 3 6 1 4 3 Hierarchical Clustering: Group Average Nested Clusters Dendrogram
Hierarchical Clustering: Group Average • Compromise between Single and Complete Link • Strengths • Less susceptible to noise and outliers • Limitations • Biased towards globular clusters
Cluster Similarity: Ward’s Method • Similarity of two clusters is based on the increase in squared error when two clusters are merged • Similar to group average if distance between points is distance squared • Less susceptible to noise and outliers • Biased towards globular clusters • Hierarchical analogue of K-means • Can be used to initialize K-means
5 1 5 5 4 1 3 1 4 1 2 2 5 2 5 5 2 1 5 2 5 2 2 2 3 3 6 6 3 6 3 1 6 3 3 1 4 4 4 1 3 4 4 4 Hierarchical Clustering: Comparison MIN MAX Ward’s Method Group Average
Hierarchical Clustering: Time and Space requirements • O(N2) space since it uses the proximity matrix. • N is the number of points. • O(N3) time in many cases • There are N steps and at each step the size, N2, proximity matrix must be updated and searched • Complexity can be reduced to O(N2 log(N) ) time for some approaches
Hierarchical Clustering: Problems and Limitations • 缺乏全局目标函数 • Different schemes have problems with one or more of the following: • Sensitivity to noise and outliers • Difficulty handling different sized clusters and convex shapes • Breaking large clusters • 一旦合并,不能回退。 • O(n2logn)时间复杂性影响应用。
MST: Divisive Hierarchical Clustering • Build MST (Minimum Spanning Tree) • Start with a tree that consists of any point • In successive steps, look for the closest pair of points (p, q) such that one point (p) is in the current tree but the other (q) is not • Add q to the tree and put an edge between p and q
MST: Divisive Hierarchical Clustering • Use MST for constructing hierarchy of clusters
The advantage • It can be considered the algorithm for both planar or hierarchical clustering • The time complexity of the algorithm is only O(n2 ) ,which is time complexity of building a MST
DBSCAN • DBSCAN is a density-based algorithm. • Density = number of points within a specified radius (半径) (Eps) • A point is a core point if it has more than a specified number of points (MinPts) within Eps • These are points that are at the interior of a cluster • A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point • A noise point is any point that is not a core point or a border point.
BASCAN算法 • 输入:eps, minpt; 输出:聚类 • 根据eps,minpt, 把每个顶点标记为核心、边界和噪音 • 删除噪音点 • 为距离在eps 之内的所有核心点建立一条边。 • 计算图的连图分支,每个连图分支为一个簇; • 对所有的边界点,随机地指派一个与之关联的簇。
DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4
Clusters When DBSCAN Works Well Original Points • Resistant to Noise • Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well (MinPts=4, Eps=9.75). Original Points • Varying densities • High-dimensional data (MinPts=4, Eps=9.92)
聚类评测 • Accuracy(准确率), precision, recall • 评测的目的 • To avoid finding patterns in noise • To compare clustering algorithms • To compare two sets of clusters • To compare two clusters
DBSCAN K-means Complete Link Clusters found in Random Data Random Points
聚类验证的不同方面 • Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. • Evaluating how well the results of a cluster analysis fit the data without reference to external information. - Use only the data • Comparing the results of two different sets of cluster analyses to determine which is better. • Determining the ‘correct’ number of clusters. • For 1, 2 and 3, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.
Cluster 验证测度 • SSE 点到聚类中心的距离之和 设 c 是所有顶点的中心,并且距离计算是欧式距离,定义 总 其中, mi表示簇 i的顶点个数。SSB 越大,说明簇之间的分离度越好。
若共有K 个簇,并且满足 mi = m/K, m 为所有顶点的总数,则可定义 若定义总的平方数TSS 为
则可以证明: TSS= SSE + SSB 即SSE +SSB 是个常数 非监督族评估:使用邻近度矩阵 如果给定数据集的相似度矩阵和聚类标号。显然。在理想情况下,簇内的任两顶点的相似度为1,而不同簇中的两点相似度为0.若我们将顶点按簇标号排序,则对应的相似度矩阵应为对角矩阵。在具体实现时,可以取相似度的值。我们可以把相似度的值作为图像的色彩值,这样可以通过可视化的方法评价聚类效果。
Ward’s algorithm • start out with all sample units in n clusters of size 1 each. Merge two clusters with In the largest r2 value (equally the minimum error square), repeat the process until there is only one cluster.
Measuring Cluster Validity Via Correlation • Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Corr = -0.9235 明显分离的簇 Corr = -0.5810 不是很明显分离的簇
Using Similarity Matrix for Cluster Validation • Order the similarity matrix with respect to cluster labels and inspect visually. 通过相似度矩阵可视化的评价聚类
Using Similarity Matrix for Cluster Validation • Clusters in random data are not so crisp DBSCAN