CS533 Information Retrieval

CS533 Information Retrieval Dr. Michal Cutler Lecture #20 April 14, 1999

Clustering • Applications in information retrieval • Algorithms

Applications • Document clustering • libraries • retrieval • “information browsing” • document structuring

Applications • Item clustering • Term variants (n-grams method) • Thesaurus • Semantic networks

Clustering algorithms • An algorithm for agglomerative hierarchical clustering • Single link, complete link, average link, centroid and Ward clustering • Clique, star, string • Heuristic clustering • Cluster based retrieval

Clustering • Similar items are clustered together • Need to compute similarity between pairs of clusters • Number of clusters needed decided in advance, or • Determined by using a cut-off value of similarity

Similarity functions for binary vectors • Based on matching corresponding 1s and 0s in the two vectors. • For example “the number of matching 1s+the number of matching 0s divided by the size of a vector” or, • using inner product, Dice, Cosine, Jaccard

Similarity functions for points in space • Distance usually used • Most similar pair is minimum distance pair

Similarity functions for points in space • For Euclidean space cosine, Euclidian distance commonly used. • Statistical measures such as correlation coefficients between two items are also used

HierarchicalMethods • Agglomerative • Divisive • Heuristic (faster methods)

Agglomerative • Initially each item is in single cluster • Clusters joined until all items in one cluster

Divisive • All items in one cluster • Clusters divided into smaller clusters • Process continues until each item in separate cluster

Heuristic (faster methods) • Do not compute similarity between every pair • Important for dynamic incremental clustering

Agglomerative clustering (initialize) C:=N {one item per cluster} for i:=1 to N do {create the N clusters} Gi ={xi} endfor S={G1,…., GN} {create a set of all clusters S}

Agglomerative clustering (initialize) {compute the similarity s(xi ,xj) between all cluster pairs} for 1£i,j£n where i¹j do s(Gi, Gj):= s(xi ,xj) endfor

Hierarchical clustering (main) while C>1 do find Gp, Gq s.t. s(Gp, Gq)=max{s(Gi, Gj)|i¹j} Gr:= GpÈ Gq delete Gp and Gq from S and add Gr to S

Hierarchical clustering (main) Save information needed to build a dendrogram C:=C-1 {decrease number of clusters}

Hierarchical clustering (main) {Compute the similarity to the new cluster} forall GiÎS s.t. Gi¹Grdo Compute s(Gi ,Gr ) endfor endwhile

Analysis • Number of clusters O(N) • Number of pairs O(N2) • Initialization O(N2)

Analysis • Finding the pair of clusters to join O(N2) • Computing the similarity to the new cluster O(Ng(N)) where g(N) is an upper bound on the computation of s(Gi, Gj) • We get O(N3+N2g(N))

Single link • Uses similarity between the most similar pair of items • xÎGr and yÎGi such that: s(Gr, Gi) = max{s(x,y)|xÎGr and yÎGi}

Complete link • Uses similarity between the least similar pair of items • xÎGr and yÎGi such that: s(Gr, Gi) = min{s(x,y)|xÎGr and yÎG}

Single and complete link • To save computation time: • s(Gr, Gi) can be computed from s(Gp, Gi) and s(Gq, Gi) by taking • the maximum of the two for single link, • and the minimum for complete link

Single Link - update example Edges show distance (minimum distance= max similarity) The similarity to new cluster The joined new cluster

Complete link - update ex. Maximum distance (minimum similarity) Gr Gp s(Gq, Gi) Gq Gi s(Gp, Gi) • s(Gr, Gi) =min{s(Gp, Gi), s(Gq, Gi)}= • s(Gp, Gi)

Complete Link The similarity to new cluster The joined new cluster

Maximum spanning tree • Single link can be solved using a maximum spanning tree algorithm • The weight of edges is the similarity • Kruskal gives the hierarchy directly • Prim will first derive the tree and then the cluster hierarchy

Example a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1

0.4 0.1 0.3 0.2 b 0.1 e 0.2 0.4 0.8 0.7 c d 0.5 The graph a

a b c d e Single link - Kruskal • s(b,c)=0.8 - {b,c} • s(d,e)=0.7 - {d,e} • s(c,d)=0.5 - {b,c,d,e} • s(a,b)=0.4 - {a,b,c,d,e}

Single link - Problem • Problem with single link method is that items such as e and b with a similarity of only .1 are clustered together

Single link dendrogram a 0.4 b e 0.4 0.8 0.5 0.7 0.7 0.8 c d 0.5 a b c d e

Single link (based on agglomerative) a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 Initially {a}, {b}, {c}, {d}, {e} {b, c} with s(b, c) = 0.8 s(a, {b,c})=0.4 s(d, {b,c})=0.5 s(e, {b,c})=0.4

Single link a {b, c} d e a 1 0.4 0.2 0.1 {b,c} 1 0.5 0.4 d 1 0.7 e 1 {d,e} with s(d, e) = 0.7 s(a, {d,e})=0.2 s({d,e}, {b,c})=0.5

Single link a {b, c} {d,e} a 1 0.4 0.2 {b,c} 1 0.5 {d,e} 1 {b,c,d,e} with s({b,c},{d,e})=0.5 s(a, {b,c,d,e})=0.4

Single link a {b, c,d,e} a 1 0.4 {b,c,d,e} 1 {a,b,c,d,e}

Single link a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 b c 0.4 d a 0.5 e 0.7 0.8 a b c d e

Complete link ex. • Initially {a}, {b}, {c}, {d}, {e} • The most similar pair of clusters is {b} and {c} with s(b,c)=0.8 • The first cluster is {b,c}

Complete link (based on agglomerative) a b c d e a 1 0.4 0.3 0.2 0.1 b 1 0.8 0.2 0.1 c 1 0.5 0.4 d 1 0.7 e 1 s(a, {b,c})=0.3 s(d, {b,c})=0.2 s(e, {b,c})=0.1

Complete link a {b, c} d e a 1 0.3 0.2 0.1 {b,c} 1 0.2 0.1 d 1 0.7 e 1 {d,e} with s(d, e) = 0.7 s(a, {d,e})=0.1 s({d,e}, {b,c})=0.1

Complete link a {b, c} {d,e} a 1 0.3 0.1 {b,c} 1 0.1 {d,e} 1 {a,b,c} with s(a, {b,c})=0.3 s(a, {b,c,d,e})=0.1 {a,b,c,d,e}

Complete link a {b, c,d,e} a 1 0.1 {b,c,d,e} 1 {a,b,c,d,e}

Complete link dendrogram a 0.1 0.3 b 0.1 0.3 e 0.8 0.7 0.7 0.8 c d a b c d e

Average link • Average link - use the average of the similarities between any pair of items xÎGr and yÎGi

Average link ex. • The most similar pair of clusters is {b} and {c} with s(b,c)=0.8 • The first cluster is {b,c}

Average link ex. • s(a, {b,c})=(0.4+0.3)/2=0.35 • s(d, {b,c})=(0.2+0.5)/2=0.35 • s(e, {b,c})=(0.1+0.4)/2=0.25 • s(a,d)=0.2 s(a,e)=0.1 s(d,e)=0.7 • The second cluster is {d,e}

Average link ex. • s(a, {d,e})=(0.2+0.1)/2=0.15 • s({d,e},{b,c})=(0.2 +0.5+0.1+0.4)/4=0.3 • s(a, {b,c})=0.35 • The next cluster is {a,b,c}

Average link ex. • s({a,b,c} {d,e})= =(0.2+0.1+0.2+0.1+0.5+0.4)/6 =0.25 • The last cluster is {a,b,c,d,e}

Average link dendrogram a 0.35 b 0.25 0.35 0.25 e 0.8 0.7 0.7 c 0.8 d a b c d e

Centroids • Item (document, or term) is a point in t dimensional space. • The centroid is a new point in t dimensional space which represents the cluster • Centroid computed by averaging coordinate values of all cluster points, for each of the t coordinates

CS533 Information Retrieval