• 10 likes • 117 Views
Michigan State University. The Chinese University of Hong Kong. Semi-supervised Text Categorization by Active Search. Zenglin Xu 1 , Rong Jin 2 , Kaizhu Huang 1 , Michael R. Lyu 1 , and Irwin King 1. 2 Department of Computer Science and Engineering Michigan State University
Michigan State University The Chinese University of Hong Kong Semi-supervised Text Categorization by Active Search Zenglin Xu1, Rong Jin2, Kaizhu Huang1, Michael R. Lyu1, and Irwin King1 2 Department of Computer Science and Engineering Michigan State University rongjin@cse.msu.edu 1 Department of Computer Science and Engineering The Chinese University of Hong Kong {zlxu, kzhuang, lyu, king}@cse.cuhk.edu.hk 1 Motivations 2 Contributions • A general framework for semi-supervised text categorization that collects the unlabeled documents via Websearch engines. • A novel discriminative query generation method • The categorization framework can significantly improve the classification accuracy. • Given a small number of labeled documents, it is very challenging to build a reliable classifier • .Unlabeled data are helpful in automated text categorization How to obtain unlabeled documents? • We can collect the unlabeled documents through search engines • Semi-supervised learning can take advantage of both the labeled documents and unlabeled documents 3 Framework & Model • Query generation: generate a query for every labeled document (document: (x,y), Vi: vocabulary for i-th document, w: word weights, ξ: margin error) • 2.Text Categorization Models • D: labeled documents, U: retrieved unlabeled documents • Auxiliary SVM (y* is the input) • Semi-supervised SVM (y* is an optimization variable) • Query generation that generates the textual queries for document retrieval • Document retrieval that retrieves the Web documents through the Web search engine • Semi-supervised text categorization utilizing both the labeled documents and the retrieved unlabeled Web documents 4 Experiment results • Data Repositories: 20-newsgroup, Reuters-21578, Ohsumed • Training data: 5 labeled documents in each category • Each documents generates one query • Each query returns 100 unlabeled documents • Auxi-SVM: Auxiliary SVM (Optimization : QP) • Semi-SVM: Semi-supervised SVM (Optimization: CCCP) • Search engine: Google • Accuracy improvement over SVM: • Auxi-SVM: 26% • Semi-SVM: 34% CIKM 2008, Napa Valley, California October 26-30, 2008