230 likes | 638 Views
Semantic Smoothing of Document Models for Agglomerative Clustering. Xiaohua Zhou , Xiaodan Zhang, Tony Hu College of Information Science & Technology Drexel University, USA. Agglomerative Clustering. Algorithm Overview
E N D
Semantic Smoothing of Document Models for Agglomerative Clustering Xiaohua Zhou, Xiaodan Zhang, Tony Hu College of Information Science & Technology Drexel University, USA 1
Agglomerative Clustering • Algorithm Overview • Initially assign each document into its own cluster and repeatedly merge pairs of most similar clusters until only one cluster is left. • The core of this algorithm is to compute pair-wise document similarities. • Cosine and Euclidean similarity (distance) are frequently used. 2
Where is the problem? • Density of Topic-free General Words • An extreme example is stop words. • Those words will be assigned with high probability (or high score), but no contribution to the clustering task. • Any document pair could be considered similar for clustering because they share lots of common words (Steinbach et al., 2000 ) • Need to discount the effect of those general words. The same idea as TF.IDF weighting schema. 3
Where is the problem? • Sparsity of Topic-specific Words I am looking for any information about the space program. This includes NASA, the shuttles, history, anything! I would like to know if anyone could suggest books, periodicals, even ftp sites for a novice who is interested in the space program. The Phobosmission did return some useful data including images of Phobos itself By the way, the new book entitled "Mars" (Kieffer et al, 1992, University of Arizona Press) has a great chapter on spacecraftexploration of the planet. The chapter is co-authored by V.I. Moroz of the Space Research Institute in Moscow, and includes details never before published in the West. (From 20-Newsgroup) 4
Existing Solutions • Density of general words • Removing stop words • Using TF-IDF score • Term reweighing techniques • Sparsity of topic-specific words • Ontology-based term similarity • Problems of existing solutions • All these approaches are heuristic • The ontology is not available or the ontology is very limited. 5
Language Modeling Approach • Agglomerative Clustering • Assume each document is generated by a language model. • The pairwise document similarity is defined as the similarity (i.e., KL-divergence) of corresponding document models 6
Jelinek-Mercer Smoothing • The document model is smoothed by the corpus model (simple language model) • Discounting general words • Partially solve the data sparsity problem c(w; d) is the count of word w in document d. C denotes the corpus (Zhai and Lafferty 2001) 7
Semantic Smoothing • Descriptions • Like the statistical translation model (Berger and Lafferty 1999), term semantic relationships are used for model smoothing. • Unlike the statistical translation model, contextual and sense information are considered • Decompose a document into a set of context-sensitive multiword phrases and then statistically translate phrases into individual words. 8
Semantic Smoothing Model • Linearly interpolate the phrase-based translation model with a simple language model Where the translation coefficient (λ) controls the influence of the translation component in the mixture model. c(ti, d) is the frequency of topic signature ti in document d. 9
Doc1 I am looking for any information about the space program. This includes NASA, the shuttles, history, anything! I would like to know if anyone could suggest books, periodicals, even ftp sites for a novice who is interested in the space program. Doc2 the Phobos mission did return some useful data including images of Phobos itselfBy the way, the new book entitled "Mars" (Kieffer et al, 1992, University of Arizona Press) has a great chapter on spacecraft exploration of the planet. The chapter is co-authored by V.I. Moroz of the Space Research Institute in Moscow, and includes details never before published in the West. Semantic Smoothing Example • Doc3: • ROCKETLAUNCH OBSERVED! A bright light phenomenon was observed in the Eastern Finland on April 21. I don't know if there were satellitelaunches in Plesetsk Cosmodrome near Arkhangelsk, but this may be a rocket experiment too. 11
Translation Probability Estimate • Method • Use co-occurrence counts (multiword phrase and individual words) • Use a mixture model to remove noise from topic-free general words Figure 1. Illustration of document indexing. Vt, Vd and Vw are phrase set, document set and word set, respectively. • Denotes Dk the set of documents containing the phrase tk. The parameter α is the coefficient controlling the influence of the corpus model in the mixture model. 12
Translation Probability Estimate • Log likelihood of generating Dk • EM for estimation Where is the document frequency of term w in Dk, i.e., the cooccurrence count of w and tkin the whole collection. 13
Phrase Extraction • Phrase Dictionary • Use Xtract (Smadja 1993) to learn a phrase dictionary. • Phrase Extraction • Extract phrases from documents using exact string matching. 14
Experiment Settings • Agglomerative clustering • Complete linkage • Evaluation criterion • Normalized mutual information (NMI, Banerjee and Ghosh, 2002) • Entropy (Steinbach et al., 2000 ) • Purity (Zhao and Karypis, 2001 ) • Experiment Design • Randomly create testing collections. 100 documents are randomly selected for each class. • Execute 5 runs for each collection and average the results 15
Statistics of Three Datasets Table 1. Statistics of three datasets 16
Agglomerative Clustering Table 2. NMI results of the agglomerative hierarchical clustering with complete linkage criterion. “JM” and “Semantic” denote Jelinek-Mercer smoothing and semantic smoothing, respectively. * means stop words are not removed. The translation coefficient λ is trained from TDT2. 17
Effect of Document Smoothing Figure 2. The variance of the cluster quality with the translation coefficient (λ) which controls the influence of semantic smoothing 18
Comparison to the K-Means Table 3. * means stop words are not removed. The agglomerative clustering with semantic smoothing is comparable to the standard K-Means clustering. 19
Summary • Proposed a context-sensitive semantic smoothing method which statistically translates multiword phrases into individual terms. • Semantic smoothing not only discounted general words, but also solved data sparsity problem very well. • Semantic smoothing is much more effective than other schemes on agglomerative clustering where data sparsity is the major problem. • Removing stops or not have no effect on TF*IDF, background smoothing, and semantic smoothing, but significant effect on other schemes. 20
Future Work • How to optimize translation coefficient • Alternative translation intermediates (e.g. word pair, concept pair) • Applies semantic document smoothing to other applications such as text retrieval, text summarization, and text classification. 21
Dragon Toolkit • Descriptions • Text retrieval and mining toolkit • Written in Java • Used for this work • Phrase extraction • Phrase-word translation probability estimates • Clustering • Download • http://www.ischool.drexel.edu/dmbio/dragontool • Search Google with keywords “dragon toolkit” 22