130 likes | 253 Views
A Text Categorization Based on summarization Technique. Sue J .Ker Department of Computer Science, Soochow University. Jen-Nan Chen Department of Management, Ming Chuan University. ACL2000. 報告人 : 翁鴻加. Abstract. Text categorization base on summarization
E N D
A Text Categorization Based on summarization Technique Sue J .Ker Department of Computer Science, Soochow University Jen-Nan Chen Department of Management, Ming Chuan University ACL2000 報告人:翁鴻加
Abstract • Text categorization base on summarization • Combine word-based frequency and position method to get knowledge • Summarization_based categorization can achieve acceptable performance
Introduction • Growth of internet usage • Categorization should provide accurate information quickly • Predefined categories to label new document • Get knowledge from title field only
Text Summarization • Why uses title in categorization 1.summarization identify information evidence from a document 2.summarization techniques include position, cue phrase, word frequency, discourse segmentation. 3.word frequency and position are easy to implement 4.title fits position method(Hovy and Lin -1997) • 5.TREC evaluation shows that no significant • difference between long and short query
Preprocessing and Features Select • delineate by white space and punctuation • lower-case • remove stop word • stem
Term Weight • W(f,c) : weight of term f in category c C1 C2 Cn D1 D2 Dm …….. …. TFf,c : frequency of feature f in category c T : the number of categories DFf :the number of categories that contain feature f MAXc : max frequency of any feature in category c Nc : the document number belonging category c
Term Weight Normalize tf Probability of category
Category Ranking Fc : the set of features f in category c tf f,d : the frequency of features f appearing in the document d
Experiments • The Reuters Corpus • 7789 training documents • 3309 test documents • 93 categories • Average numbers of categories per document: 1.23 • Training documents per categories varies widely (2~2877) => P(c) is varies widely
Experiments Design • Only use title field as the scope of text • 1.Test Maxc and P(c) • 2.Locate the minimum term frequency
Experiments Design • Large feature sets perform better • full text is about 92%
Experiments Design Title contain small noise…
Conclusion • Small text size (title) is not bad for categorization • Short title field will reduce execution time • This system suits online document classifier • Position method can use some specific position