130 likes | 145 Views
Learn about a novel unsupervised feature selection strategy for automatic text categorization, exploring machine learning techniques and innovative algorithms developed to enhance categorization accuracy. This research addresses the challenges of handling vast amounts of text data efficiently and effectively.
E N D
Combining Unsupervised Feature Selection Strategy for Automatic Text Categorization Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Ping-Tsun Chang
Introduction • In recent researches • The limit of using statistic or computational approach for natural language understanding • The develop of machine learning technique is almost reached its bound • Natural language is infinite and nonlinear! • Unsupervised Feature Selection Ping-Tsun Chang
Sensing Classification Segmentation Post-Processing Feature Extraction Decision Text CategorizationBackground Knowledge • Problem Definition: Text Categorization is a problem to assign a unknown lebel to a large amount of document by a large amount of text data. Ping-Tsun Chang
Background KnowledgeMachine Learning • Instance-Based Learning • K-Nearest Neighbors • Neural Networks • Support Vector Machine • Using Computer help us to induction from complex and large amount of pattern data • Bayesian Learning Ping-Tsun Chang
Background KnowledgeFeature Selection • Information Gain • Mutual Information • CHI-Square Ping-Tsun Chang
Baysian Classifier • Recent Researches • Naïve Bayes classifiers are competitive with other techniques in accuracy • Fast: single pass and quickly classify new documents • ATHENA: EDBT 2000 Ping-Tsun Chang
d ? Machine LearningApproaches: kNN Classifier Ping-Tsun Chang
Machine LearningApproaches: Support Vector Machine • Basic hypotheses : Consistent hypotheses of the Version Space • Project the original training data in space X to a higher dimension feature space F via a Mercel operator K Ping-Tsun Chang
What is Certainly? • Rule for kNN • Rule for SVM Ping-Tsun Chang
ALGORITHM Two-Stage-Text-Categorization (input: document d) returns category C Statistic: Trained classifier: Traditional-Classifier The feature set: F The new feature set by user feedback: Ui for related catehory Ci For new document d C ← Traditional-Classifier (d) If NOT satisfy the rule of uncertainly Return C Algorithm for Two-StageAutomatic Text Categorization Else For all category Ci If d have the feature in F C ← Ci Return C End If Cj ←User-Input Uj ← Uj + User-Selected C ←Cj END If Return C Ping-Tsun Chang
Determine threshold of the Rule Ping-Tsun Chang
Experienments Ping-Tsun Chang
References [1] Dunja Mladenic, J. Stefen Institute, Text-Learning and Related Intelligent Agents: A Survey, IEEE Transactions on Intelligent Systems, pp. 44-54, 1999. [2] Yiming Yang, Improving Text Categorization Methods for Event Tracking, In Proceedings of the 23th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’00), 2000. [3] Yiming Yang, Combining Multiple Learning Strategies for Effective Cross Vaildation, In Proceedings of the 17th International Conference on Machine Learning (ICML ’00) ,2000. [4] V. Vapnik, The Nature of Statiscal Learning Theory. Springer, New York, 1995. [5] Thorsten Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevent Features.In European Conference on Machine Learning(ECML ’98), pages 137-142, Berlin, 1998, Springer. [6] Yiming Yang, A re-examination of Text Categorization Methods, In Proceedings of the 22th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’99), 1999. [7] Lee-Feng Chien. Pat-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the 20th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’97), pages 50-58, 1997. [8] Jyh-Jong Tsay and Jing-Doo Wang, Improving Automatic Chinese Text Categorization by Error Correction. In Proceedings of Information Retrieval of Asian Languages(IRAL ’00), 2000. [9] James Tin-Yau Kwok, Automated Text Classification Using Support Vector Machine, International Conference on Neural Information Processing(ICNIP ’98), 1998. [10] Daphne Koller and Simon Tone, Support Vector Machine Active Learning with Applications to Text Classification, In Proceedings of International Conference on Machine Learning(ICML ’00), 2000. [11] Central News Agency, URL: http://www.cna.com.tw [12] Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000. [13] D. E. Appelt, D. J. Israel. Introduction to Information Extraction Technology. Tutorial for International Joint Conference on Artificial Intelligence, Stockholm, August 1999. Ping-Tsun Chang