140 likes | 160 Views
Explore the impact of feature engineering on Chinese text categorization. Compare classification approaches like Naïve Bayes, SVM, Decision Tree, and more. Learn about data preparation, feature selection, and the role of feature engineering in improving classification accuracy.
E N D
Automatic Chinese Text Categorization Feature Engineering and Comparison of Classification Approaches Yi-An Lin and Yu-Te Lin
Motivation • Text categorization (TC) is extensively researched in English but not in Chinese. • What’s feature engineering help in Chinese? • Should Chinese content be segmented? • What’s ML best for TC? – Naïve Bayes, SVM, Decision Tree, k Nearest Neighbor, MaxEnt, or Language Model Methods?
Outline • Data Preparation • Feature Selection • Feature Vector Encoding • Comparison of Classifiers • Feature Engineering • Comparison after Feature Engineering • Conclusion
Data Preparation • Tool: Yahoo News Crawler • Category • Entertainment • Politics • Business • Sports
Feature Selection • statistics:
Feature Vector Encoding • Binary: whether contains a word. • Count: number of occurrence. • TF: ratio of words occurrence. • TF-IDF: with inverse document freq.
Feature Engineering • Stop Terms: similar to stop words in English. • Group Terms: common substrings. • Key Terms: distinctive terms.
Comparison of feature engineering methods S: stop terms G: group terms K: key terms
Conclusion • N-gram model outperforms other methods: • Language Models’ nature: considering all features and avoid error-prone ones. • No restrictive independence (ex. NB). • Better smoothing. • Feature engineering also helps reducing the sparsity but may cause ambiguity. • Semantic understanding could be the next to try in future research.