270 likes | 297 Views
Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology. Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG LAI AND CHUNG-HSIEN WU. ACM Transactions on Asian Language Information Processing, 2002, Pages 34-64. Outline.
E N D
Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology Advisor :Dr. Hsu Presenter: Yu Cheng Chen Author: YU-SHENG LAI AND CHUNG-HSIEN WU ACM Transactions on Asian Language Information Processing, 2002, Pages 34-64
Outline • Motivation • Objective • Introduction • System Overview • Term Extraction and Selection • Discriminative Term Selection • Indexing And Classification • Experimental Result • Conclusions • Personal Opinion
Motivation • In text categorization, terms are extracted from documents and used for estimating the textual similarity between documents. • The extracted terms often determine system performance. • N-grams are typically employed for textual indexing. • Need comparatively higher storage space. • N-gram is not a meaningful unit in linguistics • Inconsistencies problem. • Unknown words presented are more domain-specific than traditional words. • Domain dependency
Objective • Propose a method for extracting meaningful and highly domain-specific unknown words form Chinese text documents.
Introduction • Two main methods for detecting unknown words • Statistical • Some of which are restricted to particular type • Rule-based • Using dictionary • Need part-of-speech information • Limited length unknown word
System Overview T1 新聞 T2 體育 n=1~8 document j
Term Extraction and Selection • Phrase-like Unit (PLU) • A frequently occurring word sequence P, if a word wi in the sequence P and the preceding word w1w2…wi is always followed by the word sequence wi+1wi+2… • P is probably an unknown word or phrase • For example, 陳水扁 • PLU-base likelihood ratio PLR(p) • 陳水扁 250 陳 1000 水扁 200
Term Extraction and Selection • A word sequence p is considered an unknown word if • n>1 • tf (p)>=c • PLR(p) >= 1-εor PLR(p)*tf(p) >= d
Further Purification • Some PLUs are useless or interfering • Discard stopping terms • Deal with cross-included terms • Reliability degree
Discriminative Term Selection • Here the term “discriminative” indicates the utility in distinguishing categories. • A term, 陳水扁, is used for distinguish 政治, 體育 classes. • For a term t representing category g • discriminability W(t, g) can be defined as
INDEXING AND CLASSIFICATION • Index machine • Using for locating keywords in a text. • M = (S, I, g, f, s0, O) • For example, “半自動套裝遊程”
INDEXING AND CLASSIFICATION • For improving performance • The vector space model (VSM) is used. • The document is represented as a vector • The member of vector is a weighted indexing feature • Term weighting for training documents • K categories, Nk documents in k category • Dk, j is the jth document in kth category
INDEXING AND CLASSIFICATION • Term weighting for training documents • S(w) is a smooth 0-1 function for avoiding bias problem • α is a constant
INDEXING AND CLASSIFICATION • Term weighting for unclassified documents • not know the category of an unclassified document, • each unclassified document should be represented as multiple description vectors. • unclassified document is represented as K vectors • Xk, k=1…K
INDEXING AND CLASSIFICATION • Classification Function • Combine the vectors of each category into a mean vector • Classification function fGk(X; A) is
EXPERIMENTAL RESULT • CORPUS • Min-Sheng Daily News (MSDN) • 44,675 text documents, consisting of over 35 million words • 1997 to April 1997 was for training, and 1999 to July 1999 was for testing. • Performance Evaluation
EXPERIMENTAL RESULT • Baseline performance • Using the words defined in dictionary
EXPERIMENTAL RESULT • Parameter Testing • The number of representative terms is variable • Constrain the number of terms selected from each category or not • Examine discriminablility (Nor) effect on performance
EXPERIMENTAL RESULT • Parameter Testing • The number of representative terms is variable • Constrain the number of terms selected from each category or not • Examine discriminablility (Nor) effect on performance
Experimental Result • Experimental Results on Purification Process
Experimental Result • Combined Approach-unknown word-based
Experimental Result • Comparative Performance
Experimental Result • Consistency between Training and Testing Data
Conclusions • we have proposed two new concepts, meaningful term extraction and discriminative term selection. • PLUs improve the performance of text • Purification process reduces the dimensionality of the feature space.
Personal Opinion • Advantages • Take into account meaningful and discriminative terms. • Purification process save time • Terms can be extracted automatically and systematically • Application • ICD9 codes classifications and so on. • May solve the problem that Patient records with Chinese and English • Limited • Sparse data problem need to solve.