90 likes | 236 Views
A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997). Paper By: Yiming Yang, CMU Jan O. Pedersen, Verity, Inc. Presented By: Prerak Sanghvi Computer Science and Engineering Department
E N D
A Comparative Study on Feature Selection in Text Categorization(Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang, CMU Jan O. Pedersen, Verity, Inc. Presented By: Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo
Introduction • This paper is a comparative study of feature selection methods in statistical learning of text categorization. • Five methods were evaluated: • Document Frequency (DF) • Information Gain (IG) • Mutual Information (MI) • 2 test (CHI) • Term Strength (TS)
Document Frequency (DF) • Document Frequency is the number of documents in which a term occurs. • Terms whose document frequency is less than some predetermined threshold, are removed from the feature space. • The basic assumption is that rare terms are either non-informative for category prediction, or not influential in global performance. However, this assumption must be handled carefully.
Information Gain (IG) • IG measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document. • For a term t, and set of classes ci: G (t) = - i=1 to m Pr (ci) log Pr (ci) + Pr(t) i=1 to m Pr (ci|t) log Pr (ci|t) + Pr(t) i=1 to m Pr (ci|t) log Pr (ci|t)
Information Gain (IG)… • Given a training corpus, for each unique term, IG is computed, and those terms are removed from the feature space whose IG is less than some predetermined threshold.
Mutual Information (MI) • Each word is ranked according to its mutual information with respect to the class labels. • Mutual information criterion is defined as: I(t, c) = log [ Pr (t c) / {Pr(t) · Pr(c)} ] • Category specific scores are often combined as: Iavg (t) = i=1 to m Pr (ci) I (t, ci) Imax (t) = maxi=1 to m I (t, ci)
2 statistic (CHI) • The 2 statistic measures the lack of independence between t and c. • 2 statistic is known to be not reliable for low-frequency terms.
Term Strength (TS) • This method estimates term importance based on how commonly a term is likely to appear in ‘closely-related’ documents. • It uses a training set of documents to derive document pairs whose similarity is above a threshold. • This criterion is based on document clustering, assuming that documents with many shared words are related, and that terms in the heavily overlapping area of related documents are relatively informative.
Conclusion • IG and CHI were found to be most effective in aggressive term removal without losing categorization accuracy in experiments with kNN and LLSF (Linear Least Squares Fit) on Reuters 22173 and OHSUMED collection. • DF is found comparable to IG and CHI with up to 90% term removal, while TS is comparable with up to 50-60%. • MI has inferior performance.