130 likes | 295 Views
Combining Unsupervised Feature Selection Strategy for Automatic Text Categorization. Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University. Introduction. In recent researches
E N D
Combining Unsupervised Feature Selection Strategy for Automatic Text Categorization Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Ping-Tsun Chang
Introduction • In recent researches • The limit of using statistic or computational approach for natural language understanding • The develop of machine learning technique is almost reached its bound • Natural language is infinite and nonlinear! • Unsupervised Feature Selection Ping-Tsun Chang
Sensing Classification Segmentation Post-Processing Feature Extraction Decision Text CategorizationBackground Knowledge • Problem Definition: Text Categorization is a problem to assign a unknown lebel to a large amount of document by a large amount of text data. Ping-Tsun Chang
Background KnowledgeMachine Learning • Instance-Based Learning • K-Nearest Neighbors • Neural Networks • Support Vector Machine • Using Computer help us to induction from complex and large amount of pattern data • Bayesian Learning Ping-Tsun Chang
Background KnowledgeFeature Selection • Information Gain • Mutual Information • CHI-Square Ping-Tsun Chang
Baysian Classifier • Recent Researches • Naïve Bayes classifiers are competitive with other techniques in accuracy • Fast: single pass and quickly classify new documents • ATHENA: EDBT 2000 Ping-Tsun Chang
d ? Machine LearningApproaches: kNN Classifier Ping-Tsun Chang
Machine LearningApproaches: Support Vector Machine • Basic hypotheses : Consistent hypotheses of the Version Space • Project the original training data in space X to a higher dimension feature space F via a Mercel operator K Ping-Tsun Chang
What is Certainly? • Rule for kNN • Rule for SVM Ping-Tsun Chang
ALGORITHM Two-Stage-Text-Categorization (input: document d) returns category C Statistic: Trained classifier: Traditional-Classifier The feature set: F The new feature set by user feedback: Ui for related catehory Ci For new document d C ← Traditional-Classifier (d) If NOT satisfy the rule of uncertainly Return C Algorithm for Two-StageAutomatic Text Categorization Else For all category Ci If d have the feature in F C ← Ci Return C End If Cj ←User-Input Uj ← Uj + User-Selected C ←Cj END If Return C Ping-Tsun Chang
Determine threshold of the Rule Ping-Tsun Chang
Experienments Ping-Tsun Chang
References [1] Dunja Mladenic, J. Stefen Institute, Text-Learning and Related Intelligent Agents: A Survey, IEEE Transactions on Intelligent Systems, pp. 44-54, 1999. [2] Yiming Yang, Improving Text Categorization Methods for Event Tracking, In Proceedings of the 23th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’00), 2000. [3] Yiming Yang, Combining Multiple Learning Strategies for Effective Cross Vaildation, In Proceedings of the 17th International Conference on Machine Learning (ICML ’00) ,2000. [4] V. Vapnik, The Nature of Statiscal Learning Theory. Springer, New York, 1995. [5] Thorsten Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevent Features.In European Conference on Machine Learning(ECML ’98), pages 137-142, Berlin, 1998, Springer. [6] Yiming Yang, A re-examination of Text Categorization Methods, In Proceedings of the 22th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’99), 1999. [7] Lee-Feng Chien. Pat-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the 20th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’97), pages 50-58, 1997. [8] Jyh-Jong Tsay and Jing-Doo Wang, Improving Automatic Chinese Text Categorization by Error Correction. In Proceedings of Information Retrieval of Asian Languages(IRAL ’00), 2000. [9] James Tin-Yau Kwok, Automated Text Classification Using Support Vector Machine, International Conference on Neural Information Processing(ICNIP ’98), 1998. [10] Daphne Koller and Simon Tone, Support Vector Machine Active Learning with Applications to Text Classification, In Proceedings of International Conference on Machine Learning(ICML ’00), 2000. [11] Central News Agency, URL: http://www.cna.com.tw [12] Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000. [13] D. E. Appelt, D. J. Israel. Introduction to Information Extraction Technology. Tutorial for International Joint Conference on Artificial Intelligence, Stockholm, August 1999. Ping-Tsun Chang