1 / 27

Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG LAI AND CHUNG-HSIEN WU

Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology. Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG LAI AND CHUNG-HSIEN WU. ACM Transactions on Asian Language Information Processing, 2002, Pages 34-64. Outline.

josephdias
Download Presentation

Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG LAI AND CHUNG-HSIEN WU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology Advisor :Dr. Hsu Presenter: Yu Cheng Chen Author: YU-SHENG LAI AND CHUNG-HSIEN WU ACM Transactions on Asian Language Information Processing, 2002, Pages 34-64

  2. Outline • Motivation • Objective • Introduction • System Overview • Term Extraction and Selection • Discriminative Term Selection • Indexing And Classification • Experimental Result • Conclusions • Personal Opinion

  3. Motivation • In text categorization, terms are extracted from documents and used for estimating the textual similarity between documents. • The extracted terms often determine system performance. • N-grams are typically employed for textual indexing. • Need comparatively higher storage space. • N-gram is not a meaningful unit in linguistics • Inconsistencies problem. • Unknown words presented are more domain-specific than traditional words. • Domain dependency

  4. Objective • Propose a method for extracting meaningful and highly domain-specific unknown words form Chinese text documents.

  5. Introduction • Two main methods for detecting unknown words • Statistical • Some of which are restricted to particular type • Rule-based • Using dictionary • Need part-of-speech information • Limited length unknown word

  6. System Overview T1 新聞 T2 體育 n=1~8 document j

  7. System Overview

  8. Term Extraction and Selection • Phrase-like Unit (PLU) • A frequently occurring word sequence P, if a word wi in the sequence P and the preceding word w1w2…wi is always followed by the word sequence wi+1wi+2… • P is probably an unknown word or phrase • For example, 陳水扁 • PLU-base likelihood ratio PLR(p) • 陳水扁 250 陳 1000 水扁 200

  9. Term Extraction and Selection • A word sequence p is considered an unknown word if • n>1 • tf (p)>=c • PLR(p) >= 1-εor PLR(p)*tf(p) >= d

  10. Further Purification • Some PLUs are useless or interfering • Discard stopping terms • Deal with cross-included terms • Reliability degree

  11. Discriminative Term Selection • Here the term “discriminative” indicates the utility in distinguishing categories. • A term, 陳水扁, is used for distinguish 政治, 體育 classes. • For a term t representing category g • discriminability W(t, g) can be defined as

  12. INDEXING AND CLASSIFICATION • Index machine • Using for locating keywords in a text. • M = (S, I, g, f, s0, O) • For example, “半自動套裝遊程”

  13. INDEXING AND CLASSIFICATION • For improving performance • The vector space model (VSM) is used. • The document is represented as a vector • The member of vector is a weighted indexing feature • Term weighting for training documents • K categories, Nk documents in k category • Dk, j is the jth document in kth category

  14. INDEXING AND CLASSIFICATION • Term weighting for training documents • S(w) is a smooth 0-1 function for avoiding bias problem • α is a constant

  15. INDEXING AND CLASSIFICATION • Term weighting for unclassified documents • not know the category of an unclassified document, • each unclassified document should be represented as multiple description vectors. • unclassified document is represented as K vectors • Xk, k=1…K

  16. INDEXING AND CLASSIFICATION • Classification Function • Combine the vectors of each category into a mean vector • Classification function fGk(X; A) is

  17. EXPERIMENTAL RESULT • CORPUS • Min-Sheng Daily News (MSDN) • 44,675 text documents, consisting of over 35 million words • 1997 to April 1997 was for training, and 1999 to July 1999 was for testing. • Performance Evaluation

  18. EXPERIMENTAL RESULT

  19. EXPERIMENTAL RESULT • Baseline performance • Using the words defined in dictionary

  20. EXPERIMENTAL RESULT • Parameter Testing • The number of representative terms is variable • Constrain the number of terms selected from each category or not • Examine discriminablility (Nor) effect on performance

  21. EXPERIMENTAL RESULT • Parameter Testing • The number of representative terms is variable • Constrain the number of terms selected from each category or not • Examine discriminablility (Nor) effect on performance

  22. Experimental Result • Experimental Results on Purification Process

  23. Experimental Result • Combined Approach-unknown word-based

  24. Experimental Result • Comparative Performance

  25. Experimental Result • Consistency between Training and Testing Data

  26. Conclusions • we have proposed two new concepts, meaningful term extraction and discriminative term selection. • PLUs improve the performance of text • Purification process reduces the dimensionality of the feature space.

  27. Personal Opinion • Advantages • Take into account meaningful and discriminative terms. • Purification process save time • Terms can be extracted automatically and systematically • Application • ICD9 codes classifications and so on. • May solve the problem that Patient records with Chinese and English • Limited • Sparse data problem need to solve.

More Related