Advisor ： Dr. Hsu Presenter ： Yu Cheng Chen Author: YU-SHENG LAI AND CHUNG-HSIEN WU

Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology Advisor ：Dr. Hsu Presenter： Yu Cheng Chen Author: YU-SHENG LAI AND CHUNG-HSIEN WU ACM Transactions on Asian Language Information Processing, 2002, Pages 34-64

Outline • Motivation • Objective • Introduction • System Overview • Term Extraction and Selection • Discriminative Term Selection • Indexing And Classification • Experimental Result • Conclusions • Personal Opinion

Motivation • In text categorization, terms are extracted from documents and used for estimating the textual similarity between documents. • The extracted terms often determine system performance. • N-grams are typically employed for textual indexing. • Need comparatively higher storage space. • N-gram is not a meaningful unit in linguistics • Inconsistencies problem. • Unknown words presented are more domain-specific than traditional words. • Domain dependency

Objective • Propose a method for extracting meaningful and highly domain-specific unknown words form Chinese text documents.

Introduction • Two main methods for detecting unknown words • Statistical • Some of which are restricted to particular type • Rule-based • Using dictionary • Need part-of-speech information • Limited length unknown word

System Overview T1 新聞 T2 體育 n=1~8 document j

System Overview

Term Extraction and Selection • Phrase-like Unit (PLU) • A frequently occurring word sequence P, if a word wi in the sequence P and the preceding word w1w2…wi is always followed by the word sequence wi+1wi+2… • P is probably an unknown word or phrase • For example, 陳水扁 • PLU-base likelihood ratio PLR(p) • 陳水扁 250 陳 1000 水扁 200

Term Extraction and Selection • A word sequence p is considered an unknown word if • n>1 • tf (p)>=c • PLR(p) >= 1-εor PLR(p)*tf(p) >= d

Further Purification • Some PLUs are useless or interfering • Discard stopping terms • Deal with cross-included terms • Reliability degree

Discriminative Term Selection • Here the term “discriminative” indicates the utility in distinguishing categories. • A term, 陳水扁, is used for distinguish 政治, 體育 classes. • For a term t representing category g • discriminability W(t, g) can be defined as

INDEXING AND CLASSIFICATION • Index machine • Using for locating keywords in a text. • M = (S, I, g, f, s0, O) • For example, “半自動套裝遊程”

INDEXING AND CLASSIFICATION • For improving performance • The vector space model (VSM) is used. • The document is represented as a vector • The member of vector is a weighted indexing feature • Term weighting for training documents • K categories, Nk documents in k category • Dk, j is the jth document in kth category

INDEXING AND CLASSIFICATION • Term weighting for training documents • S(w) is a smooth 0-1 function for avoiding bias problem • α is a constant

INDEXING AND CLASSIFICATION • Term weighting for unclassified documents • not know the category of an unclassified document, • each unclassified document should be represented as multiple description vectors. • unclassified document is represented as K vectors • Xk, k=1…K

INDEXING AND CLASSIFICATION • Classification Function • Combine the vectors of each category into a mean vector • Classification function fGk(X; A) is

EXPERIMENTAL RESULT • CORPUS • Min-Sheng Daily News (MSDN) • 44,675 text documents, consisting of over 35 million words • 1997 to April 1997 was for training, and 1999 to July 1999 was for testing. • Performance Evaluation

EXPERIMENTAL RESULT

EXPERIMENTAL RESULT • Baseline performance • Using the words defined in dictionary

EXPERIMENTAL RESULT • Parameter Testing • The number of representative terms is variable • Constrain the number of terms selected from each category or not • Examine discriminablility (Nor) effect on performance

Experimental Result • Experimental Results on Purification Process

Experimental Result • Combined Approach-unknown word-based

Experimental Result • Comparative Performance

Experimental Result • Consistency between Training and Testing Data

Conclusions • we have proposed two new concepts, meaningful term extraction and discriminative term selection. • PLUs improve the performance of text • Purification process reduces the dimensionality of the feature space.

Personal Opinion • Advantages • Take into account meaningful and discriminative terms. • Purification process save time • Terms can be extracted automatically and systematically • Application • ICD9 codes classifications and so on. • May solve the problem that Patient records with Chinese and English • Limited • Sparse data problem need to solve.

Advisor ： Dr. Hsu Presenter ： Yu Cheng Chen Author: YU-SHENG LAI AND CHUNG-HSIEN WU