120 likes | 235 Views
Document Clustering with Cluster Refinement and Model Selection Capabilities. Advisor : Dr. Hsu Presenter : Shu-Ya Li Authors : Xin Liu, Yihong Gong, Wei Xu, Shenghuo Zhu. 2002 . SIGIR . Page(s) : 191 - 198. Outline. Motivation Objective Method Experimental Result
E N D
Document Clustering with Cluster Refinement and Model Selection Capabilities Advisor : Dr. Hsu Presenter : Shu-Ya Li Authors :Xin Liu, Yihong Gong, Wei Xu, Shenghuo Zhu 2002 . SIGIR . Page(s) : 191 - 198
Outline • Motivation • Objective • Method • Experimental Result • Conclusion • Personal Opinions
Motivation • The problems and limitations: • The user must formulate the query using the keywords. • Traditional text search engines is a narrowly specified search for documents matching the user’s query. • Traditional search engine returns hundreds, or even thousands of hits.
Objective • We propose a document clustering method that strives to achieve: • a high accuracy of document clustering • the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability)
Method • Feature Set • Term frequencies (TF) • Name entities (NE) • Term pairs (TP) The documents reporting the Clinton-Lewinsky scandal The common name entities: ”Clinton”, ”Lewinsky”, ”Ken Starr”, ”Linda Tripp”, etc The word pairs: ”grand jury”, ”independent counsel”, ”supreme court”
Apply the iterative voting scheme to refine the document clusters. GMM + EM algorithm GMM EM algorithm Method - self-refinement process
Method - self-refinement process • Identify discriminative features F = {f1, f2, . . . , fΛ} along with cluster labels S = {σ1, σ2, . . . , σΛ} • Define the discriminative feature metric DFM(fi) • Compare the new document cluster set with C. • The result converges →terminate the process • Otherwise →set C to the new cluster set, and go to Step 2.
Method - Model Selection • measure the similarity between C and C’ • The model selection algorithm • Guess the possible number of document clusters from the data range (Rl,Rh). • Set k = Rl. • Cluster the document corpus into k clusters. • Compute between each pair of the results, and take the average on all the . • If k < Rh, k = k + 1, go to Step 3; otherwise, go to Step 6. • Select the k which yields the largest average .
Experimental Result - Document Clustering Evaluation • GMM + EM algorithm ABC+CNN-01-13-18-32-48-70-71-77-86 [新聞機構-新聞事件類別-報導次數]
Experimental Result - Model Selection Evaluation • Compared with the BIC-based model selection method
Conclusion • To accurately cluster the given document corpus by using the GMM Model together with EM algorithm. • The model selection capability has been achieved by guessing a value C for the number of clusters N.
Personal Opinions • Advantage • high accuracy of document clustering • the model selection capability • Drawback • … • Application • …