150 likes | 315 Views
A comparative study of TF*IDF , LSI and multi-words for text classification. Presenter : Jian-Ren Chen Authors : W en Zhang , T aketoshi Y oshida , X ijin T ang 2011.ESWA. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation.
E N D
A comparative study of TF*IDF, LSI and multi-words for text classification Presenter : Jian-Ren ChenAuthors : Wen Zhang, TaketoshiYoshida, XijinTang2011.ESWA
Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments
Motivation Although TF*IDF, LSI and multi-word have been proposed for a long time, there is no comparative study on these indexing methods, and no results are reported concerning their classification performances.
Objectives • A comparative study of TF*IDF, LSI and multi-words for text classification. - information retrieval - text categorization • indexing term: • semantic quality • statistical quality
Methodology - TF*IDF wi,j :the weight for term i in document j N:the number of documents in the collection tfi,j :is the term frequency of term i in document j dfi :is the document frequency of term i in the collection Terms (keywords) of the document collection documents
Methodology - LSI Given a term-document matrix X = [x1 , x2 , ... , xn ] єRm and suppose the rank of X is r, LSI decomposes the X using SVD as follows: 1. 2. Xk=Uk’ΣkVkT’ Terms (keywords) of the document collection documents
Methodology - Multi-word its occurrence frequency should be at least twice in a document. the length of the multi-word should be between 2 and 6
Conclusions • LSIcan produce better indexing in discriminative power. • LSI and multi-word have better semantic quality than TF*IDF, and TF*IDF has better statistical quality than the other two methods. • The number of dimension is still a decisive factor for indexing when we use different indexing methods for classification.
Comments • Advantages - Compare with TF*IDF, LSI and multi-words • Disadvantage - semantic quality and statistical quality are considered merely by our intuition instead of theory • Applications - text mining