Document Clustering with Cluster Refinement and Model Selection Capabilities

Document Clustering with Cluster Refinement and Model Selection Capabilities Advisor : Dr. Hsu Presenter : Shu-Ya Li Authors :Xin Liu, Yihong Gong, Wei Xu, Shenghuo Zhu 2002 . SIGIR . Page(s) : 191 - 198

Outline • Motivation • Objective • Method • Experimental Result • Conclusion • Personal Opinions

Motivation • The problems and limitations: • The user must formulate the query using the keywords. • Traditional text search engines is a narrowly specified search for documents matching the user’s query. • Traditional search engine returns hundreds, or even thousands of hits.

Objective • We propose a document clustering method that strives to achieve: • a high accuracy of document clustering • the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability)

Method • Feature Set • Term frequencies (TF) • Name entities (NE) • Term pairs (TP) The documents reporting the Clinton-Lewinsky scandal The common name entities： ”Clinton”, ”Lewinsky”, ”Ken Starr”, ”Linda Tripp”, etc The word pairs： ”grand jury”, ”independent counsel”, ”supreme court”

Apply the iterative voting scheme to refine the document clusters. GMM + EM algorithm GMM EM algorithm Method - self-refinement process

Method - self-refinement process • Identify discriminative features F = {f1, f2, . . . , fΛ} along with cluster labels S = {σ1, σ2, . . . , σΛ} • Define the discriminative feature metric DFM(fi) • Compare the new document cluster set with C. • The result converges →terminate the process • Otherwise →set C to the new cluster set, and go to Step 2.

Method - Model Selection • measure the similarity between C and C’ • The model selection algorithm • Guess the possible number of document clusters from the data range (Rl,Rh). • Set k = Rl. • Cluster the document corpus into k clusters. • Compute between each pair of the results, and take the average on all the . • If k < Rh, k = k + 1, go to Step 3; otherwise, go to Step 6. • Select the k which yields the largest average .

Experimental Result - Document Clustering Evaluation • GMM + EM algorithm ABC+CNN-01-13-18-32-48-70-71-77-86 [新聞機構-新聞事件類別-報導次數]

Experimental Result - Model Selection Evaluation • Compared with the BIC-based model selection method

Conclusion • To accurately cluster the given document corpus by using the GMM Model together with EM algorithm. • The model selection capability has been achieved by guessing a value C for the number of clusters N.

Personal Opinions • Advantage • high accuracy of document clustering • the model selection capability • Drawback • … • Application • …

Document Clustering with Cluster Refinement and Model Selection Capabilities

Document Clustering with Cluster Refinement and Model Selection Capabilities

Presentation Transcript

Web Document Clustering

Domain Model Refinement

Model Uncertainty and Model Selection

Web Mining: Phrase-based Document Indexing and Document Clustering

Hierarchical Stability Based Model Selection for Data Clustering

Document Clustering

Web Document Clustering

Concrete Model Checking with Abstract Matching and Refinement

Term and Document Clustering

Document Clustering

The Multipole Model and Refinement

Low Energy Adaptive Clustering Hierarchy with Deterministic Cluster-Head Selection

Model Building, Refinement, and Validation

Model selection and model building

Web Document Clustering

Low Energy Adaptive Clustering Hierarchy with Deterministic Cluster-Head Selection

Concrete Model Checking with Abstract Matching and Refinement

Document Clustering with Prior Knowledge

Document Clustering via Dirichlet Process Mixture Model with Feature Selection