Hierarchical Clustering Algorithms for Document Structuring

CS533 Information Retrieval Dr. Michal Cutler Lecture #18 April 6, 1999

Clustering • Applications in information retrieval • Algorithms

Applications • Document clustering • libraries • retrieval • “information browsing” • document structuring

Applications • Item clustering • Term variants (n-grams method) • Thesaurus • Semantic networks

Clustering algorithms • An algorithm for agglomerative hierarchical clustering • Single link, complete link, average link, centroid and Ward clustering • Heuristic single pass clustering • Cluster based retrieval

Clustering • Similar items are clustered together • Need to compute similarity between pairs of clusters • Number of clusters needed decided in advance, or • Determined by using a cut-off value of similarity

Similarity functions for binary vectors • Based on matching corresponding 1s and 0s in the two vectors. • For example “the number of matching 1s+the number of matching 0s divided by the size of a vector” or, • using inner product, Dice, Cosine, Jaccard

Similarity functions for points in space • Distance usually used • Most similar pair is minimum distance pair

Similarity functions for points in space • For Euclidean space cosine, Euclidian distance commonly used. • Statistical measures such as correlation coefficients between two items are also used

HierarchicalMethods • Agglomerative • Divisive • Heuristic (faster methods)

Agglomerative • Initially each item is in single cluster • Clusters joined until all items in one cluster

Divisive • All items in one cluster • Clusters divided into smaller clusters • Process continues until each item in separate cluster

Heuristic (faster methods) • Do not compute similarity between every pair • Important for dynamic incremental clustering

Agglomerative clustering (initialize) C:=N {one item per cluster} for i:=1 to N do {create the N clusters} Gi ={xi} endfor S={G1,…., GN} {create a set of all clusters S}

Agglomerative clustering (initialize) {compute the similarity s(xi ,xj) between all cluster pairs} for 1£i,j£n where i¹j do s(Gi, Gj):= s(xi ,xj) endfor

Hierarchical clustering (main) while C>1 do find Gp, Gq s.t. s(Gp, Gq)=max{s(Gi, Gj)|i¹j} Gr:= GpÈ Gq delete Gp and Gq from S and add Gr to S

Hierarchical clustering (main) Save information needed to build a dendrogram C:=C-1 {decrease number of clusters}

Hierarchical clustering (main) {Compute the similarity to the new cluster} forall GiÎS s.t. Gi¹Grdo Compute s(Gi ,Gr ) endfor endwhile

Analysis • Number of clusters O(N) • Number of pairs O(N2) • Initialization O(N2)

Analysis • Finding the pair of clusters to join O(N2) • Computing the similarity to the new cluster O(Ng(N)) where g(N) is an upper bound on the computation of s(Gi, Gj) • We get O(N3+N2g(N))

Analysis • Performance is better for some clustering algorithms for example single link • It is further Improved by using parallel computation • When a similarity matrix is used time is (N2) • For very large N computing the similarity matrix is too time consuming

Single link • Uses similarity between the most similar pair of items • xÎGr and yÎGi such that: s(Gr, Gi) = max{s(x,y)|xÎGr and yÎGi}

Complete link • Uses similarity between the least similar pair of items • xÎGr and yÎGi such that: s(Gr, Gi) = min{s(x,y)|xÎGr and yÎG}

Single and complete link • To save computation time: • s(Gr, Gi) can be computed from s(Gp, Gi) and s(Gq, Gi) by taking • the maximum of the two for single link, • and the minimum for complete link

Single Link - update example Edges show distance (minimum distance= max similarity) The similarity to new cluster The joined new cluster

Complete link - update ex. Maximum distance (minimum similarity) Gr Gp s(Gq, Gi) Gq Gi s(Gp, Gi) • s(Gr, Gi) =min{s(Gp, Gi), s(Gq, Gi)}= • s(Gp, Gi)

Complete Link The similarity to new cluster The joined new cluster

Maximum spanning tree • Single link can be solved using a maximum spanning tree algorithm • Complete graph. The weight of edges is the similarity • Kruskal gives the hierarchy directly 0(N2lgN) • Prim will derive the tree in 0(N2) • Then cluster hierarchy is derived

Example a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1

0.4 0.1 0.3 0.2 b 0.1 e 0.2 0.4 0.8 0.7 c d 0.5 The graph a

a b c d e Single link - Kruskal • s(b,c)=0.8 - {b,c} • s(d,e)=0.7 - {d,e} • s(c,d)=0.5 - {b,c,d,e} • s(a,b)=0.4 - {a,b,c,d,e}

Single link - Problem • Problem with single link method is that items such as e and b with a similarity of only .1 are clustered together

Single link dendrogram a 0.4 b e 0.4 0.8 0.5 0.7 0.7 0.8 c d 0.5 a b c d e

Single link (based on agglomerative) a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 Initially {a}, {b}, {c}, {d}, {e} {b, c} with s(b, c) = 0.8 s(a, {b,c})=0.4 s(d, {b,c})=0.5 s(e, {b,c})=0.4

Single link a {b, c} d e a 1 0.4 0.2 0.1 {b,c} 1 0.5 0.4 d 1 0.7 e 1 {d,e} with s(d, e) = 0.7 s(a, {d,e})=0.2 s({d,e}, {b,c})=0.5

Single link a {b, c} {d,e} a 1 0.4 0.2 {b,c} 1 0.5 {d,e} 1 {b,c,d,e} with s({b,c},{d,e})=0.5 s(a, {b,c,d,e})=0.4

Single link a {b, c,d,e} a 1 0.4 {b,c,d,e} 1 {a,b,c,d,e}

Single link a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 b c 0.4 d a 0.5 e 0.7 0.8 a b c d e

Complete link ex. • Initially {a}, {b}, {c}, {d}, {e} • The most similar pair of clusters is {b} and {c} with s(b,c)=0.8 • The first cluster is {b,c}

Complete link (based on agglomerative) a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 s(a, {b,c})=0.3 s(d, {b,c})=0.2 s(e, {b,c})=0.1

Complete link a {b, c} d e a 1 0.3 0.2 0.1 {b,c} 1 0.2 0.1 d 1 0.7 e 1 {d,e} with s(d, e) = 0.7 s(a, {d,e})=0.1 s({d,e}, {b,c})=0.1

Complete link a {b, c} {d,e} a 1 0.3 0.1 {b,c} 1 0.1 {d,e} 1 {a,b,c} with s(a, {b,c})=0.3

Complete link {a, b ,c} {d,e} {a, b, c} 1 0.1 {d,e} 1 S({d, e} {a,b,c})=min{0.1, 0.1}=0.1 {a, b, c, d, e}

Complete link dendrogram a 0.1 0.3 b 0.1 0.3 e 0.8 0.7 0.7 0.8 c d a b c d e

Average link • Average link - use the average of the similarities between any pair of items xÎGr and yÎGi

Average link ex. • The most similar pair of clusters is {b} and {c} with s(b,c)=0.8 • The first cluster is {b,c}

Average link ex. • s(a, {b,c})=(0.4+0.3)/2=0.35 • s(d, {b,c})=(0.2+0.5)/2=0.35 • s(e, {b,c})=(0.1+0.4)/2=0.25 • s(a,d)=0.2 s(a,e)=0.1 s(d,e)=0.7 • The second cluster is {d,e}

Average link ex. • s(a, {d,e})=(0.2+0.1)/2=0.15 • s({d,e},{b,c})=(0.2 +0.5+0.1+0.4)/4=0.3 • s(a, {b,c})=0.35 • The next cluster is {a,b,c}

Average link ex. • s({a,b,c} {d,e})= =(0.2+0.1+0.2+0.1+0.5+0.4)/6 =0.25 • The last cluster is {a,b,c,d,e}

Average link dendrogram a 0.35 b 0.25 0.35 0.25 e 0.8 0.7 0.7 c 0.8 d a b c d e

Hierarchical Clustering Algorithms for Document Structuring

Hierarchical Clustering Algorithms for Document Structuring

Presentation Transcript

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval