880 likes | 891 Views
CS533 Information Retrieval. Dr. Michal Cutler Lecture #18 April 6, 1999. Clustering. Applications in information retrieval Algorithms. Applications. Document clustering libraries retrieval “information browsing” document structuring. Applications. Item clustering
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #18 April 6, 1999
Clustering • Applications in information retrieval • Algorithms
Applications • Document clustering • libraries • retrieval • “information browsing” • document structuring
Applications • Item clustering • Term variants (n-grams method) • Thesaurus • Semantic networks
Clustering algorithms • An algorithm for agglomerative hierarchical clustering • Single link, complete link, average link, centroid and Ward clustering • Heuristic single pass clustering • Cluster based retrieval
Clustering • Similar items are clustered together • Need to compute similarity between pairs of clusters • Number of clusters needed decided in advance, or • Determined by using a cut-off value of similarity
Similarity functions for binary vectors • Based on matching corresponding 1s and 0s in the two vectors. • For example “the number of matching 1s+the number of matching 0s divided by the size of a vector” or, • using inner product, Dice, Cosine, Jaccard
Similarity functions for points in space • Distance usually used • Most similar pair is minimum distance pair
Similarity functions for points in space • For Euclidean space cosine, Euclidian distance commonly used. • Statistical measures such as correlation coefficients between two items are also used
HierarchicalMethods • Agglomerative • Divisive • Heuristic (faster methods)
Agglomerative • Initially each item is in single cluster • Clusters joined until all items in one cluster
Divisive • All items in one cluster • Clusters divided into smaller clusters • Process continues until each item in separate cluster
Heuristic (faster methods) • Do not compute similarity between every pair • Important for dynamic incremental clustering
Agglomerative clustering (initialize) C:=N {one item per cluster} for i:=1 to N do {create the N clusters} Gi ={xi} endfor S={G1,…., GN} {create a set of all clusters S}
Agglomerative clustering (initialize) {compute the similarity s(xi ,xj) between all cluster pairs} for 1£i,j£n where i¹j do s(Gi, Gj):= s(xi ,xj) endfor
Hierarchical clustering (main) while C>1 do find Gp, Gq s.t. s(Gp, Gq)=max{s(Gi, Gj)|i¹j} Gr:= GpÈ Gq delete Gp and Gq from S and add Gr to S
Hierarchical clustering (main) Save information needed to build a dendrogram C:=C-1 {decrease number of clusters}
Hierarchical clustering (main) {Compute the similarity to the new cluster} forall GiÎS s.t. Gi¹Grdo Compute s(Gi ,Gr ) endfor endwhile
Analysis • Number of clusters O(N) • Number of pairs O(N2) • Initialization O(N2)
Analysis • Finding the pair of clusters to join O(N2) • Computing the similarity to the new cluster O(Ng(N)) where g(N) is an upper bound on the computation of s(Gi, Gj) • We get O(N3+N2g(N))
Analysis • Performance is better for some clustering algorithms for example single link • It is further Improved by using parallel computation • When a similarity matrix is used time is (N2) • For very large N computing the similarity matrix is too time consuming
Single link • Uses similarity between the most similar pair of items • xÎGr and yÎGi such that: s(Gr, Gi) = max{s(x,y)|xÎGr and yÎGi}
Complete link • Uses similarity between the least similar pair of items • xÎGr and yÎGi such that: s(Gr, Gi) = min{s(x,y)|xÎGr and yÎG}
Single and complete link • To save computation time: • s(Gr, Gi) can be computed from s(Gp, Gi) and s(Gq, Gi) by taking • the maximum of the two for single link, • and the minimum for complete link
Single Link - update example Edges show distance (minimum distance= max similarity) The similarity to new cluster The joined new cluster
Complete link - update ex. Maximum distance (minimum similarity) Gr Gp s(Gq, Gi) Gq Gi s(Gp, Gi) • s(Gr, Gi) =min{s(Gp, Gi), s(Gq, Gi)}= • s(Gp, Gi)
Complete Link The similarity to new cluster The joined new cluster
Maximum spanning tree • Single link can be solved using a maximum spanning tree algorithm • Complete graph. The weight of edges is the similarity • Kruskal gives the hierarchy directly 0(N2lgN) • Prim will derive the tree in 0(N2) • Then cluster hierarchy is derived
Example a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1
0.4 0.1 0.3 0.2 b 0.1 e 0.2 0.4 0.8 0.7 c d 0.5 The graph a
a b c d e Single link - Kruskal • s(b,c)=0.8 - {b,c} • s(d,e)=0.7 - {d,e} • s(c,d)=0.5 - {b,c,d,e} • s(a,b)=0.4 - {a,b,c,d,e}
Single link - Problem • Problem with single link method is that items such as e and b with a similarity of only .1 are clustered together
Single link dendrogram a 0.4 b e 0.4 0.8 0.5 0.7 0.7 0.8 c d 0.5 a b c d e
Single link (based on agglomerative) a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 Initially {a}, {b}, {c}, {d}, {e} {b, c} with s(b, c) = 0.8 s(a, {b,c})=0.4 s(d, {b,c})=0.5 s(e, {b,c})=0.4
Single link a {b, c} d e a 1 0.4 0.2 0.1 {b,c} 1 0.5 0.4 d 1 0.7 e 1 {d,e} with s(d, e) = 0.7 s(a, {d,e})=0.2 s({d,e}, {b,c})=0.5
Single link a {b, c} {d,e} a 1 0.4 0.2 {b,c} 1 0.5 {d,e} 1 {b,c,d,e} with s({b,c},{d,e})=0.5 s(a, {b,c,d,e})=0.4
Single link a {b, c,d,e} a 1 0.4 {b,c,d,e} 1 {a,b,c,d,e}
Single link a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 b c 0.4 d a 0.5 e 0.7 0.8 a b c d e
Complete link ex. • Initially {a}, {b}, {c}, {d}, {e} • The most similar pair of clusters is {b} and {c} with s(b,c)=0.8 • The first cluster is {b,c}
Complete link (based on agglomerative) a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 s(a, {b,c})=0.3 s(d, {b,c})=0.2 s(e, {b,c})=0.1
Complete link a {b, c} d e a 1 0.3 0.2 0.1 {b,c} 1 0.2 0.1 d 1 0.7 e 1 {d,e} with s(d, e) = 0.7 s(a, {d,e})=0.1 s({d,e}, {b,c})=0.1
Complete link a {b, c} {d,e} a 1 0.3 0.1 {b,c} 1 0.1 {d,e} 1 {a,b,c} with s(a, {b,c})=0.3
Complete link {a, b ,c} {d,e} {a, b, c} 1 0.1 {d,e} 1 S({d, e} {a,b,c})=min{0.1, 0.1}=0.1 {a, b, c, d, e}
Complete link dendrogram a 0.1 0.3 b 0.1 0.3 e 0.8 0.7 0.7 0.8 c d a b c d e
Average link • Average link - use the average of the similarities between any pair of items xÎGr and yÎGi
Average link ex. • The most similar pair of clusters is {b} and {c} with s(b,c)=0.8 • The first cluster is {b,c}
Average link ex. • s(a, {b,c})=(0.4+0.3)/2=0.35 • s(d, {b,c})=(0.2+0.5)/2=0.35 • s(e, {b,c})=(0.1+0.4)/2=0.25 • s(a,d)=0.2 s(a,e)=0.1 s(d,e)=0.7 • The second cluster is {d,e}
Average link ex. • s(a, {d,e})=(0.2+0.1)/2=0.15 • s({d,e},{b,c})=(0.2 +0.5+0.1+0.4)/4=0.3 • s(a, {b,c})=0.35 • The next cluster is {a,b,c}
Average link ex. • s({a,b,c} {d,e})= =(0.2+0.1+0.2+0.1+0.5+0.4)/6 =0.25 • The last cluster is {a,b,c,d,e}
Average link dendrogram a 0.35 b 0.25 0.35 0.25 e 0.8 0.7 0.7 c 0.8 d a b c d e