Advancing Chinese Text Categorization with Feature Engineering: A Comparative Study

Automatic Chinese Text Categorization Feature Engineering and Comparison of Classification Approaches Yi-An Lin and Yu-Te Lin

Motivation • Text categorization (TC) is extensively researched in English but not in Chinese. • What’s feature engineering help in Chinese? • Should Chinese content be segmented? • What’s ML best for TC? – Naïve Bayes, SVM, Decision Tree, k Nearest Neighbor, MaxEnt, or Language Model Methods?

Outline • Data Preparation • Feature Selection • Feature Vector Encoding • Comparison of Classifiers • Feature Engineering • Comparison after Feature Engineering • Conclusion

Data Preparation • Tool: Yahoo News Crawler • Category • Entertainment • Politics • Business • Sports

Feature Selection • statistics:

Top Features by

Feature Vector Encoding • Binary: whether contains a word. • Count: number of occurrence. • TF: ratio of words occurrence. • TF-IDF: with inverse document freq.

Comparison of different encoding

Classifier Comparison Ⅰ

Classifier Comparison Ⅱ

Feature Engineering • Stop Terms: similar to stop words in English. • Group Terms: common substrings. • Key Terms: distinctive terms.

Comparison of feature engineering methods S: stop terms G: group terms K: key terms

Comparison after FE

Conclusion • N-gram model outperforms other methods: • Language Models’ nature: considering all features and avoid error-prone ones. • No restrictive independence (ex. NB). • Better smoothing. • Feature engineering also helps reducing the sparsity but may cause ambiguity. • Semantic understanding could be the next to try in future research.

Advancing Chinese Text Categorization with Feature Engineering: A Comparative Study

Advancing Chinese Text Categorization with Feature Engineering: A Comparative Study

Presentation Transcript

Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Yi-Bing Lin Email: linycsie.nctu.tw

STCC Yu-Lin Eda Chang

Yu-Lin Eda Chang

Professor: Ming- Shyan Wang Student: Yi-Ting Lin

Yun - Nung ( Vivian ) Chen, Yu Huang, Sheng -Yi Kong, Lin-Shan Lee

INTERSPEECH 2012 Hung-Yi Lee, Yu-Yu Chou, Yow-Bang Wang, Lin- S han Lee

Maya Lin

Grace Lin

Maya Lin

Wilderness and Cultural Resources: Symbiotic Management Pei-Lin Yu

Ying-Chen Lin Chi-Ting Lin Yu-Chun Lin Ling Ling Yang

Jack Lin

Project Report Static Analyzer Lin Yin, Jin Yi, Yu Miao, Zhao Muzhi

Yi-Cheng Liou * , Yuh-Lin Huang

Lin , BO

Yi-An Lin and Yu-Te Lin

LIN 1310

Lin Crase

Ting-Yu Lin and Jennifer C. Hou

Chun-Yi Lin 08-28-2007

Harris Lin