1 / 88

Hierarchical Clustering Algorithms for Document Structuring

Learn about clustering algorithms for document structuring, including agglomerative and divisive methods, similarity functions, and performance analysis.

mariandavis
Download Presentation

Hierarchical Clustering Algorithms for Document Structuring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #18 April 6, 1999

  2. Clustering • Applications in information retrieval • Algorithms

  3. Applications • Document clustering • libraries • retrieval • “information browsing” • document structuring

  4. Applications • Item clustering • Term variants (n-grams method) • Thesaurus • Semantic networks

  5. Clustering algorithms • An algorithm for agglomerative hierarchical clustering • Single link, complete link, average link, centroid and Ward clustering • Heuristic single pass clustering • Cluster based retrieval

  6. Clustering • Similar items are clustered together • Need to compute similarity between pairs of clusters • Number of clusters needed decided in advance, or • Determined by using a cut-off value of similarity

  7. Similarity functions for binary vectors • Based on matching corresponding 1s and 0s in the two vectors. • For example “the number of matching 1s+the number of matching 0s divided by the size of a vector” or, • using inner product, Dice, Cosine, Jaccard

  8. Similarity functions for points in space • Distance usually used • Most similar pair is minimum distance pair

  9. Similarity functions for points in space • For Euclidean space cosine, Euclidian distance commonly used. • Statistical measures such as correlation coefficients between two items are also used

  10. HierarchicalMethods • Agglomerative • Divisive • Heuristic (faster methods)

  11. Agglomerative • Initially each item is in single cluster • Clusters joined until all items in one cluster

  12. Divisive • All items in one cluster • Clusters divided into smaller clusters • Process continues until each item in separate cluster

  13. Heuristic (faster methods) • Do not compute similarity between every pair • Important for dynamic incremental clustering

  14. Agglomerative clustering (initialize) C:=N {one item per cluster} for i:=1 to N do {create the N clusters} Gi ={xi} endfor S={G1,…., GN} {create a set of all clusters S}

  15. Agglomerative clustering (initialize) {compute the similarity s(xi ,xj) between all cluster pairs} for 1£i,j£n where i¹j do s(Gi, Gj):= s(xi ,xj) endfor

  16. Hierarchical clustering (main) while C>1 do find Gp, Gq s.t. s(Gp, Gq)=max{s(Gi, Gj)|i¹j} Gr:= GpÈ Gq delete Gp and Gq from S and add Gr to S

  17. Hierarchical clustering (main) Save information needed to build a dendrogram C:=C-1 {decrease number of clusters}

  18. Hierarchical clustering (main) {Compute the similarity to the new cluster} forall GiÎS s.t. Gi¹Grdo Compute s(Gi ,Gr ) endfor endwhile

  19. Analysis • Number of clusters O(N) • Number of pairs O(N2) • Initialization O(N2)

  20. Analysis • Finding the pair of clusters to join O(N2) • Computing the similarity to the new cluster O(Ng(N)) where g(N) is an upper bound on the computation of s(Gi, Gj) • We get O(N3+N2g(N))

  21. Analysis • Performance is better for some clustering algorithms for example single link • It is further Improved by using parallel computation • When a similarity matrix is used time is (N2) • For very large N computing the similarity matrix is too time consuming

  22. Single link • Uses similarity between the most similar pair of items • xÎGr and yÎGi such that: s(Gr, Gi) = max{s(x,y)|xÎGr and yÎGi}

  23. Complete link • Uses similarity between the least similar pair of items • xÎGr and yÎGi such that: s(Gr, Gi) = min{s(x,y)|xÎGr and yÎG}

  24. Single and complete link • To save computation time: • s(Gr, Gi) can be computed from s(Gp, Gi) and s(Gq, Gi) by taking • the maximum of the two for single link, • and the minimum for complete link

  25. Single Link - update example Edges show distance (minimum distance= max similarity) The similarity to new cluster The joined new cluster

  26. Complete link - update ex. Maximum distance (minimum similarity) Gr Gp s(Gq, Gi) Gq Gi s(Gp, Gi) • s(Gr, Gi) =min{s(Gp, Gi), s(Gq, Gi)}= • s(Gp, Gi)

  27. Complete Link The similarity to new cluster The joined new cluster

  28. Maximum spanning tree • Single link can be solved using a maximum spanning tree algorithm • Complete graph. The weight of edges is the similarity • Kruskal gives the hierarchy directly 0(N2lgN) • Prim will derive the tree in 0(N2) • Then cluster hierarchy is derived

  29. Example a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1

  30. 0.4 0.1 0.3 0.2 b 0.1 e 0.2 0.4 0.8 0.7 c d 0.5 The graph a

  31. a b c d e Single link - Kruskal • s(b,c)=0.8 - {b,c} • s(d,e)=0.7 - {d,e} • s(c,d)=0.5 - {b,c,d,e} • s(a,b)=0.4 - {a,b,c,d,e}

  32. Single link - Problem • Problem with single link method is that items such as e and b with a similarity of only .1 are clustered together

  33. Single link dendrogram a 0.4 b e 0.4 0.8 0.5 0.7 0.7 0.8 c d 0.5 a b c d e

  34. Single link (based on agglomerative) a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 Initially {a}, {b}, {c}, {d}, {e} {b, c} with s(b, c) = 0.8 s(a, {b,c})=0.4 s(d, {b,c})=0.5 s(e, {b,c})=0.4

  35. Single link a {b, c} d e a 1 0.4 0.2 0.1 {b,c} 1 0.5 0.4 d 1 0.7 e 1 {d,e} with s(d, e) = 0.7 s(a, {d,e})=0.2 s({d,e}, {b,c})=0.5

  36. Single link a {b, c} {d,e} a 1 0.4 0.2 {b,c} 1 0.5 {d,e} 1 {b,c,d,e} with s({b,c},{d,e})=0.5 s(a, {b,c,d,e})=0.4

  37. Single link a {b, c,d,e} a 1 0.4 {b,c,d,e} 1 {a,b,c,d,e}

  38. Single link a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 b c 0.4 d a 0.5 e 0.7 0.8 a b c d e

  39. Complete link ex. • Initially {a}, {b}, {c}, {d}, {e} • The most similar pair of clusters is {b} and {c} with s(b,c)=0.8 • The first cluster is {b,c}

  40. Complete link (based on agglomerative) a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 s(a, {b,c})=0.3 s(d, {b,c})=0.2 s(e, {b,c})=0.1

  41. Complete link a {b, c} d e a 1 0.3 0.2 0.1 {b,c} 1 0.2 0.1 d 1 0.7 e 1 {d,e} with s(d, e) = 0.7 s(a, {d,e})=0.1 s({d,e}, {b,c})=0.1

  42. Complete link a {b, c} {d,e} a 1 0.3 0.1 {b,c} 1 0.1 {d,e} 1 {a,b,c} with s(a, {b,c})=0.3

  43. Complete link {a, b ,c} {d,e} {a, b, c} 1 0.1 {d,e} 1 S({d, e} {a,b,c})=min{0.1, 0.1}=0.1 {a, b, c, d, e}

  44. Complete link dendrogram a 0.1 0.3 b 0.1 0.3 e 0.8 0.7 0.7 0.8 c d a b c d e

  45. Average link • Average link - use the average of the similarities between any pair of items xÎGr and yÎGi

  46. Average link ex. • The most similar pair of clusters is {b} and {c} with s(b,c)=0.8 • The first cluster is {b,c}

  47. Average link ex. • s(a, {b,c})=(0.4+0.3)/2=0.35 • s(d, {b,c})=(0.2+0.5)/2=0.35 • s(e, {b,c})=(0.1+0.4)/2=0.25 • s(a,d)=0.2 s(a,e)=0.1 s(d,e)=0.7 • The second cluster is {d,e}

  48. Average link ex. • s(a, {d,e})=(0.2+0.1)/2=0.15 • s({d,e},{b,c})=(0.2 +0.5+0.1+0.4)/4=0.3 • s(a, {b,c})=0.35 • The next cluster is {a,b,c}

  49. Average link ex. • s({a,b,c} {d,e})= =(0.2+0.1+0.2+0.1+0.5+0.4)/6 =0.25 • The last cluster is {a,b,c,d,e}

  50. Average link dendrogram a 0.35 b 0.25 0.35 0.25 e 0.8 0.7 0.7 c 0.8 d a b c d e

More Related