160 likes | 313 Views
Semantic, Hierarchical, Online Clustering of Web Search Results. Yisheng Dong. 발표자 : 조이현. Overview. Previous step result Identifying base cluster Basic idea Basic definition Orthogonal clustering Determine cluster number Combining base clusters Prototype system Conclusion.
E N D
Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong 발표자 : 조이현
Overview • Previous step result • Identifying base cluster • Basic idea • Basic definition • Orthogonal clustering • Determine cluster number • Combining base clusters • Prototype system • Conclusion
Previous step result • Term-document matrix • Row vectors represent the terms (key-phrases). • Column vectors represent the documents. • The element A(i, j)=1 if i-th term Ti occurs in j-th document Dj . The mⅹn matrix A
Terms (key phrases) Documents The association between terms and documents Basic idea • The terms(documents) linked with the same document(term) should be close in semantic. • Densely-linked terms or documents should be grouped together.
Definition concerning cluster • Cluster vector(xg) • Cg is a cluster of m objects t1, t2,···,tm . • xg is denoted as the cluster vector of Cg . • xg is a m-dimensional vector and |xg| = 1. • xg(i) represents the intensity ti belonging to Cg . • Cluster density • Assume xg(yg) is a cluster of the row(column) vectors of A. • The cluster density of xg(yg) is |xgTA|(|Ayg|).
Eigenvalue & Eigenvector • A is a linear transformation represented by a matrix A. • λis a eigenvalue • x is a right eigenvector
- x1 = A cluster with maximum density - x2 = another cluster. Orthogonal clustering def.(1/2) • Clusters with high density captures the main implicit concepts. • The larger η the higher cluster density of x2. • x2 will be arbitrary close to x1. (no constraint on x2). • x2 should be orthogonal x1. (to get a meaningful clusters)
Orthogonal clustering def.(2/2) • The orthogonal clustering of row (column) vectors of Ais discovering a set of cluster vectors x1, x2,···,xk . • xg(1≤g≤k) is the cluster with maximum density subject to being orthogonal to x1,···,xg-1 .
Find out solution(1/3)(orthogonal clustering problem) • Rayleigh Quotient def. • M is a real mⅹm symmetrical matrix. • λ1≥λ2≥···≥λm are eigenvalues of M • p1≥p2≥···≥pm are orthonormal eigenvectors corresponding eigenvalues. • Theorem1
Find out solution(2/3)(orthogonal clustering problem) • SVD (Singular Value Decomposition) def. • A is a mⅹn matrix • rank(A) = r • λ1≥λ2≥···≥λr>0 = r non-zero eigenvalues of AAT • x1, x2,···,xm (y1, y2,···,yn)= orthonormal eigenvectors of AAT(ATA) (called left(right) singular vectors of A) • U=[x1, x2,···,xm], V= [y1, y2,···,yn ] • σg is called singular value of A
Find out solution(3/3)(orthogonal clustering problem) • Theorem2 • The left(right) singular vector of A are the cluster vectors discovered through orthogonal clustering of row(column) vectors of A. • proof • cg should have maximum density subject to being orthogonal to x1,···,cg-1.(by definition of orthogonal clustering) • cg must be the g-th eigenvector pg of AAT.
Determine cluster number(1/2) • Cluster matrix def. • The clusters described respectively by xg and yg in fact have the same “meaning”.
Determine cluster number(2/2) • Orthogonal clustering quality def. • Given a quality threshold q*(e.g. 80%), the ideal cluster number k* is the minimum number k satisfying q(A,k) ≥ q*.
Combine base cluster X and Y if ( |X ∩ Y| / |X ∪ Y| > t1 ) { X and Y are merged into one cluster; } else { if ( |X| > |Y| ) { if ( |X ∩ Y| / |Y| > t2 ) { let Y become X’s child; } } else { if ( |X ∩ Y| / |X| > t2 ) { let X become Y’s child; } } } Merging Label if ( label x is a substring of label y ) { label_xy = label_y; } else if ( label_y is a substring of label_x ){ label_xy = label_x; } else { label_xy = “ label_x + label_y ”; } Combining base clusters
Prototype system • Crate a prototype system named WICE (Web Information Clustering Engine). • Doing well for dealing with the special problems related to Chinese. • Output for query “object oriented” • object oriented programming • object oriented analysis, etc.
Conclusion • Main contribution • The benefit of using key phrase. • Method based on suffix array for key phrase. • The concept of orthogonal clustering. • The WICE system is designed and implemented. • Further works • Further experimenting. • Detailed analysis and interpretation of experiment results. • Comparing with other clustering algorithms.