1 / 85

CS533 Information Retrieval

This lecture covers applications and algorithms of document clustering, including item clustering, term variants, and the usage of similarity functions for clustering. It delves into hierarchical clustering methods and analysis of clustering processes.

eloisec
Download Presentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #20 April 14, 1999

  2. Clustering • Applications in information retrieval • Algorithms

  3. Applications • Document clustering • libraries • retrieval • “information browsing” • document structuring

  4. Applications • Item clustering • Term variants (n-grams method) • Thesaurus • Semantic networks

  5. Clustering algorithms • An algorithm for agglomerative hierarchical clustering • Single link, complete link, average link, centroid and Ward clustering • Clique, star, string • Heuristic clustering • Cluster based retrieval

  6. Clustering • Similar items are clustered together • Need to compute similarity between pairs of clusters • Number of clusters needed decided in advance, or • Determined by using a cut-off value of similarity

  7. Similarity functions for binary vectors • Based on matching corresponding 1s and 0s in the two vectors. • For example “the number of matching 1s+the number of matching 0s divided by the size of a vector” or, • using inner product, Dice, Cosine, Jaccard

  8. Similarity functions for points in space • Distance usually used • Most similar pair is minimum distance pair

  9. Similarity functions for points in space • For Euclidean space cosine, Euclidian distance commonly used. • Statistical measures such as correlation coefficients between two items are also used

  10. HierarchicalMethods • Agglomerative • Divisive • Heuristic (faster methods)

  11. Agglomerative • Initially each item is in single cluster • Clusters joined until all items in one cluster

  12. Divisive • All items in one cluster • Clusters divided into smaller clusters • Process continues until each item in separate cluster

  13. Heuristic (faster methods) • Do not compute similarity between every pair • Important for dynamic incremental clustering

  14. Agglomerative clustering (initialize) C:=N {one item per cluster} for i:=1 to N do {create the N clusters} Gi ={xi} endfor S={G1,…., GN} {create a set of all clusters S}

  15. Agglomerative clustering (initialize) {compute the similarity s(xi ,xj) between all cluster pairs} for 1£i,j£n where i¹j do s(Gi, Gj):= s(xi ,xj) endfor

  16. Hierarchical clustering (main) while C>1 do find Gp, Gq s.t. s(Gp, Gq)=max{s(Gi, Gj)|i¹j} Gr:= GpÈ Gq delete Gp and Gq from S and add Gr to S

  17. Hierarchical clustering (main) Save information needed to build a dendrogram C:=C-1 {decrease number of clusters}

  18. Hierarchical clustering (main) {Compute the similarity to the new cluster} forall GiÎS s.t. Gi¹Grdo Compute s(Gi ,Gr ) endfor endwhile

  19. Analysis • Number of clusters O(N) • Number of pairs O(N2) • Initialization O(N2)

  20. Analysis • Finding the pair of clusters to join O(N2) • Computing the similarity to the new cluster O(Ng(N)) where g(N) is an upper bound on the computation of s(Gi, Gj) • We get O(N3+N2g(N))

  21. Single link • Uses similarity between the most similar pair of items • xÎGr and yÎGi such that: s(Gr, Gi) = max{s(x,y)|xÎGr and yÎGi}

  22. Complete link • Uses similarity between the least similar pair of items • xÎGr and yÎGi such that: s(Gr, Gi) = min{s(x,y)|xÎGr and yÎG}

  23. Single and complete link • To save computation time: • s(Gr, Gi) can be computed from s(Gp, Gi) and s(Gq, Gi) by taking • the maximum of the two for single link, • and the minimum for complete link

  24. Single Link - update example Edges show distance (minimum distance= max similarity) The similarity to new cluster The joined new cluster

  25. Complete link - update ex. Maximum distance (minimum similarity) Gr Gp s(Gq, Gi) Gq Gi s(Gp, Gi) • s(Gr, Gi) =min{s(Gp, Gi), s(Gq, Gi)}= • s(Gp, Gi)

  26. Complete Link The similarity to new cluster The joined new cluster

  27. Maximum spanning tree • Single link can be solved using a maximum spanning tree algorithm • The weight of edges is the similarity • Kruskal gives the hierarchy directly • Prim will first derive the tree and then the cluster hierarchy

  28. Example a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1

  29. 0.4 0.1 0.3 0.2 b 0.1 e 0.2 0.4 0.8 0.7 c d 0.5 The graph a

  30. a b c d e Single link - Kruskal • s(b,c)=0.8 - {b,c} • s(d,e)=0.7 - {d,e} • s(c,d)=0.5 - {b,c,d,e} • s(a,b)=0.4 - {a,b,c,d,e}

  31. Single link - Problem • Problem with single link method is that items such as e and b with a similarity of only .1 are clustered together

  32. Single link dendrogram a 0.4 b e 0.4 0.8 0.5 0.7 0.7 0.8 c d 0.5 a b c d e

  33. Single link (based on agglomerative) a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 Initially {a}, {b}, {c}, {d}, {e} {b, c} with s(b, c) = 0.8 s(a, {b,c})=0.4 s(d, {b,c})=0.5 s(e, {b,c})=0.4

  34. Single link a {b, c} d e a 1 0.4 0.2 0.1 {b,c} 1 0.5 0.4 d 1 0.7 e 1 {d,e} with s(d, e) = 0.7 s(a, {d,e})=0.2 s({d,e}, {b,c})=0.5

  35. Single link a {b, c} {d,e} a 1 0.4 0.2 {b,c} 1 0.5 {d,e} 1 {b,c,d,e} with s({b,c},{d,e})=0.5 s(a, {b,c,d,e})=0.4

  36. Single link a {b, c,d,e} a 1 0.4 {b,c,d,e} 1 {a,b,c,d,e}

  37. Single link a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 b c 0.4 d a 0.5 e 0.7 0.8 a b c d e

  38. Complete link ex. • Initially {a}, {b}, {c}, {d}, {e} • The most similar pair of clusters is {b} and {c} with s(b,c)=0.8 • The first cluster is {b,c}

  39. Complete link (based on agglomerative) a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 s(a, {b,c})=0.3 s(d, {b,c})=0.2 s(e, {b,c})=0.1

  40. Complete link a {b, c} d e a 1 0.3 0.2 0.1 {b,c} 1 0.2 0.1 d 1 0.7 e 1 {d,e} with s(d, e) = 0.7 s(a, {d,e})=0.1 s({d,e}, {b,c})=0.1

  41. Complete link a {b, c} {d,e} a 1 0.3 0.1 {b,c} 1 0.1 {d,e} 1 {a,b,c} with s(a, {b,c})=0.3 s(a, {b,c,d,e})=0.1 {a,b,c,d,e}

  42. Complete link a {b, c,d,e} a 1 0.1 {b,c,d,e} 1 {a,b,c,d,e}

  43. Complete link dendrogram a 0.1 0.3 b 0.1 0.3 e 0.8 0.7 0.7 0.8 c d a b c d e

  44. Average link • Average link - use the average of the similarities between any pair of items xÎGr and yÎGi

  45. Average link ex. • The most similar pair of clusters is {b} and {c} with s(b,c)=0.8 • The first cluster is {b,c}

  46. Average link ex. • s(a, {b,c})=(0.4+0.3)/2=0.35 • s(d, {b,c})=(0.2+0.5)/2=0.35 • s(e, {b,c})=(0.1+0.4)/2=0.25 • s(a,d)=0.2 s(a,e)=0.1 s(d,e)=0.7 • The second cluster is {d,e}

  47. Average link ex. • s(a, {d,e})=(0.2+0.1)/2=0.15 • s({d,e},{b,c})=(0.2 +0.5+0.1+0.4)/4=0.3 • s(a, {b,c})=0.35 • The next cluster is {a,b,c}

  48. Average link ex. • s({a,b,c} {d,e})= =(0.2+0.1+0.2+0.1+0.5+0.4)/6 =0.25 • The last cluster is {a,b,c,d,e}

  49. Average link dendrogram a 0.35 b 0.25 0.35 0.25 e 0.8 0.7 0.7 c 0.8 d a b c d e

  50. Centroids • Item (document, or term) is a point in t dimensional space. • The centroid is a new point in t dimensional space which represents the cluster • Centroid computed by averaging coordinate values of all cluster points, for each of the t coordinates

More Related