1 / 14

Yi-An Lin and Yu-Te Lin

Automatic Chinese Text Categorization Feature Engineering and Comparison of Classification Approaches. Yi-An Lin and Yu-Te Lin. Motivation. Text categorization (TC) is extensively researched in English but not in Chinese . What’s feature engineering help in Chinese?

thane
Download Presentation

Yi-An Lin and Yu-Te Lin

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Chinese Text Categorization Feature Engineering and Comparison of Classification Approaches Yi-An Lin and Yu-Te Lin

  2. Motivation • Text categorization (TC) is extensively researched in English but not in Chinese. • What’s feature engineering help in Chinese? • Should Chinese content be segmented? • What’s ML best for TC? – Naïve Bayes, SVM, Decision Tree, k Nearest Neighbor, MaxEnt, or Language Model Methods?

  3. Outline • Data Preparation • Feature Selection • Feature Vector Encoding • Comparison of Classifiers • Feature Engineering • Comparison after Feature Engineering • Conclusion

  4. Data Preparation • Tool: Yahoo News Crawler • Category • Entertainment • Politics • Business • Sports

  5. Feature Selection • statistics:

  6. Top Features by

  7. Feature Vector Encoding • Binary: whether contains a word. • Count: number of occurrence. • TF: ratio of words occurrence. • TF-IDF: with inverse document freq.

  8. Comparison of different encoding

  9. Classifier Comparison Ⅰ

  10. Classifier Comparison Ⅱ

  11. Feature Engineering • Stop Terms: similar to stop words in English. • Group Terms: common substrings. • Key Terms: distinctive terms.

  12. Comparison of feature engineering methods S: stop terms G: group terms K: key terms

  13. Comparison after FE

  14. Conclusion • N-gram model outperforms other methods: • Language Models’ nature: considering all features and avoid error-prone ones. • No restrictive independence (ex. NB). • Better smoothing. • Feature engineering also helps reducing the sparsity but may cause ambiguity. • Semantic understanding could be the next to try in future research.

More Related