270 likes | 432 Views
Clustering II. Relevant keywords: K-nearest neighbor, Single-link, Complete-link, Average-link, and Centroid-based Clustering. Outline. Motivations Hierarchical Overview Dendrogram Single-link vs. centroid-based approach Other clustering approaches. Motivation. Help identify structure
E N D
Clustering II Relevant keywords: K-nearest neighbor, Single-link, Complete-link, Average-link, and Centroid-based Clustering
Outline • Motivations • Hierarchical • Overview • Dendrogram • Single-link vs. centroid-based approach • Other clustering approaches
Motivation • Help identify structure • To support browsing • To help refine queries • Recall vs. precision • To improve retrieval efficiency
Hierarchical Agglomerative 0 1 2 3 4 a a, b b a, b, c, d, e c c, d, e e d, e d 4 3 2 1 0 Divisive
Agglomerative • More widely applied • Start with n distinct entities (single member clusters) and end with a cluster of all n entitites • At each stage fuse entities that are closest (most similar) • Variant agglomerative methods exist depending on how distance (similarity) is defined between an entity and a group or between two groups
Single link agglomerative approach • Requires starting with a similarity matrix • Distance between two clusters is defined in terms distance between the closes two pairs of items in the two clusters, i.e., nearest neighbor
Example - SLA = D1 Let’s assume we begin with a distance matrix D1 of five entities. The goal is to derive a dendrogram as depicted in the next slide.
5.0 P5 [12 3 4 5] 4.0 P4 [1 2] [3 4 5] 3.0 P3 [1 2] [3] [4 5] 2.0 P2 [1 2] [3] [4] [5] 1.0 P1 [1] [2] [3] [4] [5] 0.0 1 2 3 4 5
Dendrogram • The height represents the distance between pair of items • Branching represents density of merges conducted to achieve clusters
Agglomerative SL Algorithm • Given the distance matrix find the smallest non-zero entry and merge the corresponding two clusters • Recalculate distances between clusters based on the closest two neighbors of all clusters (i.e, nearest neighbor approach) • Test to see if the desired number of clusters is achieved; if not loop back to the top step
Stepping through the SLA – Merge 1 • The smallest non-zero entry in the initial matrix is for items 1 and 2; these are fused and the distance is recalculated based on SL • After the first merge: • d(12)3 = min[d13, d23] = d23 = 5.0 • d(12)4 = min[d14, d24] = d24 = 9.0 • d(12)5 = min[d15, d25] = d25 = 8.0
SLA – Post merge 1 • The new matrix after merge 1 is above. The smallest entry is for entities 4 and 5. D2 =
SLA – Merge 2 • The entities 4 and 5 are merged and we recalculate the distances: • d(12)3 = 5.0 as before • d(12)(45) = min[d14, d15, d24, d25] = d25 = 8.0 • d(45)3 = min[d34, d35] = d34 = 4.0
SLA – Post merge 2 • The smallest entry in the matrix above is d(45)3
SL – Merges 3 and 4 • The item 3 is merged with (45) and we achieve two clusters, namely (345) and (12) • The above is at the partition level 4 (P4) in the dendrogram • These two clusters, (345) and (12) are then merged into one to form the top level cluster
Centroid Clustering • Another type of clustering takes into account all members of a cluster and requires access to the original raw data • The centroid approach may produce clusters with different “topologies” compared to the single link method
Euclidean Distance Centroid Clustering • Recall Euclidean distance is “as the crow flies” distance – i.e., geometric measure • Most such distance measures are special cases of the so called Minkowski metric Euclidean distance dij =
CC - Example • Let’s assume we start with the “raw” matrix below:
Euclidean Distance Matrix • Inter-object distance based on Euclidean distance is as below: C1 =
CC - Merge 1 • Examination of C1 shows that the c12 is the smallest entry and objects 1 and 2 are merged into one cluster • The mean vector centroid of the group (12) is calculated (1, 1.5) and new Euclidean distance matrix is produced
CC – Post Merge 1 • The new resulting matrix is as below: C2 =
CC – Merge 2 • In the new matrix the smallest entry is (45), hence these two entities are merged to form a second cluster • The mean vector centroid of the new cluster containing 4 and 5 is (8.0, 1.0) • A new distance matrix is now calculated
CC – Post Merge 2 • After calculating a new distance matrix the following is achieved: C3 =
CC – Merges 3 and 4 • In the last distance matrix the smallest entry is C(45)3 and so entities 4,5, and 3 are merged into one cluster – Merge 3 • Now there are only two clusters (12) and (453) and in the next iteration these two are merged into one – Merge 4
Additional Clustering Methods • We can think of two classes of techniques: • Those that rely only on the proximity or distance matrix • Single link, complete link, and average link • Those that require access to the “raw” data matrix • Centroid clustering
Illustrations of other methods Cluster B Cluster A Cluster B 4 5 3 1 Single link Complete link 2 Average link Cluster A DAB = (d12+d14+d15+d23+d24+d25)/6
Additional Methods - Explanation • Complete link – distance between clusters defined in terms of the distance between furthest members of the two clusters • In the average link approach the distance between clusters is calculated by averaging the distances of all members to each other