1 / 13

A Text Categorization Based on summarization Technique

A Text Categorization Based on summarization Technique. Sue J .Ker Department of Computer Science, Soochow University. Jen-Nan Chen Department of Management, Ming Chuan University. ACL2000. 報告人 : 翁鴻加. Abstract. Text categorization base on summarization

kyrene
Download Presentation

A Text Categorization Based on summarization Technique

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Text Categorization Based on summarization Technique Sue J .Ker Department of Computer Science, Soochow University Jen-Nan Chen Department of Management, Ming Chuan University ACL2000 報告人:翁鴻加

  2. Abstract • Text categorization base on summarization • Combine word-based frequency and position method to get knowledge • Summarization_based categorization can achieve acceptable performance

  3. Introduction • Growth of internet usage • Categorization should provide accurate information quickly • Predefined categories to label new document • Get knowledge from title field only

  4. Text Summarization • Why uses title in categorization 1.summarization identify information evidence from a document 2.summarization techniques include position, cue phrase, word frequency, discourse segmentation. 3.word frequency and position are easy to implement 4.title fits position method(Hovy and Lin -1997) • 5.TREC evaluation shows that no significant • difference between long and short query

  5. Preprocessing and Features Select • delineate by white space and punctuation • lower-case • remove stop word • stem

  6. Term Weight • W(f,c) : weight of term f in category c C1 C2 Cn D1 D2 Dm …….. …. TFf,c : frequency of feature f in category c T : the number of categories DFf :the number of categories that contain feature f MAXc : max frequency of any feature in category c Nc : the document number belonging category c

  7. Term Weight Normalize tf Probability of category

  8. Category Ranking Fc : the set of features f in category c tf f,d : the frequency of features f appearing in the document d

  9. Experiments • The Reuters Corpus • 7789 training documents • 3309 test documents • 93 categories • Average numbers of categories per document: 1.23 • Training documents per categories varies widely (2~2877) => P(c) is varies widely

  10. Experiments Design • Only use title field as the scope of text • 1.Test Maxc and P(c) • 2.Locate the minimum term frequency

  11. Experiments Design • Large feature sets perform better • full text is about 92%

  12. Experiments Design Title contain small noise…

  13. Conclusion • Small text size (title) is not bad for categorization • Short title field will reduce execution time • This system suits online document classifier • Position method can use some specific position

More Related