1 / 14

Advancing Chinese Text Categorization with Feature Engineering: A Comparative Study

Explore the impact of feature engineering on Chinese text categorization. Compare classification approaches like Naïve Bayes, SVM, Decision Tree, and more. Learn about data preparation, feature selection, and the role of feature engineering in improving classification accuracy.

kwelch
Download Presentation

Advancing Chinese Text Categorization with Feature Engineering: A Comparative Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Chinese Text Categorization Feature Engineering and Comparison of Classification Approaches Yi-An Lin and Yu-Te Lin

  2. Motivation • Text categorization (TC) is extensively researched in English but not in Chinese. • What’s feature engineering help in Chinese? • Should Chinese content be segmented? • What’s ML best for TC? – Naïve Bayes, SVM, Decision Tree, k Nearest Neighbor, MaxEnt, or Language Model Methods?

  3. Outline • Data Preparation • Feature Selection • Feature Vector Encoding • Comparison of Classifiers • Feature Engineering • Comparison after Feature Engineering • Conclusion

  4. Data Preparation • Tool: Yahoo News Crawler • Category • Entertainment • Politics • Business • Sports

  5. Feature Selection • statistics:

  6. Top Features by

  7. Feature Vector Encoding • Binary: whether contains a word. • Count: number of occurrence. • TF: ratio of words occurrence. • TF-IDF: with inverse document freq.

  8. Comparison of different encoding

  9. Classifier Comparison Ⅰ

  10. Classifier Comparison Ⅱ

  11. Feature Engineering • Stop Terms: similar to stop words in English. • Group Terms: common substrings. • Key Terms: distinctive terms.

  12. Comparison of feature engineering methods S: stop terms G: group terms K: key terms

  13. Comparison after FE

  14. Conclusion • N-gram model outperforms other methods: • Language Models’ nature: considering all features and avoid error-prone ones. • No restrictive independence (ex. NB). • Better smoothing. • Feature engineering also helps reducing the sparsity but may cause ambiguity. • Semantic understanding could be the next to try in future research.

More Related