300 likes | 486 Views
Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization. Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong Xiong, Zhongzhi Shi. Outline. Introduction Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence
E N D
Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong Xiong, Zhongzhi Shi
Outline • Introduction • Problem Formulation • Solution for Optimization Problem and Analysis of Algorithm Convergence • Experimental Validation • Related Works • Conclusions Fuzhen Zhuang et al., SDM 2010
Different distribution Introduction • Many traditional learning techniques work well only under the assumption: Training and test data follow the same distribution Fail ! Enterprise News Classification: including the classes “Product Announcement”, “Business scandal”, “Acquisition”, … … Test (unlabeled) Training (labeled) Product announcement: HP's just-released LaserJet Pro P1100 printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance... Announcement for Lenovo ThinkPadThinkCentre – price $150 off Lenovo K300 desktop using coupon code ... Lenovo ThinkPadThinkCentre – price $200 off Lenovo IdeaPad U450p laptop using. ...their performance Classifier Lenovo news HP news FuzhenZhuang et al., SDM 2010
Motivation (1) • Example Analysis: HP news Lenovo news Product announcement: HP's just-released LaserJet Pro P1100 printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance... Announcement for Lenovo ThinkPadThinkCentre – price $150 off Lenovo K300 desktop using coupon code ... Lenovo ThinkPadThinkCentre – price $200 off Lenovo IdeaPad U450p laptop using. ...their performance Related document class: Product announcement LaserJet, printer, announcement, price, ThinkPad, ThinkCentre, announcement, price Product word concept: FuzhenZhuang et al., SDM 2010
Motivation (2) The words expressing the same word concept are domain-dependent • Example Analysis: • Can we model this observation for classification? • We study to model it for cross-domain classification • Domain-dependent word concepts • Domain-independent association between word concepts and document classes word concept indicates Product Product announcement The association between word concepts and document classes is domain-independent Fuzhen Zhuang et al., SDM 2010
Motivation (3) • Example Analysis: Share some common words: announcement, price, performance … HP news Lenovo news Product announcement: HP's just-released LaserJet Pro P1100 printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance... Announcement for Lenovo ThinkPadThinkCentre – price $150 off Lenovo K300 desktop using coupon code ... Lenovo ThinkPadThinkCentre – price $200 off Lenovo IdeaPad U450p laptop using. ...their performance Related document class: Product announcement LaserJet, printer, announcement, price… ThinkPad, ThinkCentre announcement price… Product word concept: FuzhenZhuang et al., SDM 2010
Outline • Introduction • Problem Formulation • Solution for Optimization Problem and Analysis of Algorithm Convergence • Experimental Validation • Related Works • Conclusions FuzhenZhuang et al., SDM 2010
Preliminary Knowledge • Basic formula of matrix tri-factorization: • where the input X is the word-document co-occurrence matrix F S G FuzhenZhuang et al., SDM 2010
Problem Formulation (1) • Input: source domain Xs, target domain Xt • Matrix tri-factorization based classification framework • Two-step Optimization Framework (MTrick0) • Joint Optimization Framework (MTrick) FuzhenZhuang et al., SDM 2010
Problem Formulation (2) • Sketch map of two-step optimization Fs Gs Ss Ss Source domain Xs First step Ft Gt Target domain Xt Second step Fuzhen Zhuang et al., SDM 2010
Problem Formulation (3) • The optimization problem in source domain (First step) • The optimization problem in target domain (Second step) Our goal: to obtain Fs , Gs and Ss G0 is used as the supervision information for this optimization Ss is the solution obtained from the source domain Our goal: to obtain Ft , Gt Fuzhen Zhuang et al., SDM 2010
Problem Formulation (4) • Sketch map of joint optimization Fs Gs Source domain Xs S Knowledge Transfer Ft Gt Target domain Xt Fuzhen Zhuang et al., SDM 2010
Problem Formulation (5) • The joint optimization problem over source and target domain: the association S is shared as bridge to transfer knowledge G0 is the supervision information Fuzhen Zhuang et al., SDM 2010
Outline • Introduction • Problem Formulation • Solution for Optimization Problem and Analysis of Algorithm Convergence • Experimental Validation • Related Works • Conclusions Fuzhen Zhuang et al., SDM 2010
Solution for Optimization • Alternately iterative algorithm is developed and the updated formulas are as follows, This is the solution for joint optimization problem Fuzhen Zhuang et al., SDM 2010
Analysis of Algorithm Convergence • According to the methodology of convergence analysis in the two works [Lee et al., NIPS’01] and [Ding et al., KDD’06], the following theorem holds. Theorem (Convergence): After each round of calculating the iterative formulas, the objective function in the joint optimization will converge monotonically. Fuzhen Zhuang et al., SDM 2010
Outline • Introduction • Problem Formulation • Solution for Optimization Problem and Analysis of Algorithm Convergence • Experimental Validation • Related Works • Conclusions Fuzhen Zhuang et al., SDM 2010
Construct Classification Tasks rec and sci denote the positive and negative classes, respectively For source domain: For target domain: 144 ( ) Tasks can be constructed from this data set rec vs. sci Experimental Preparation (1) rec sci rec.autos + sci.med (4 x 4 cases) (3 x 3 cases) rec.motorcycles + sci.space Fuzhen Zhuang et al., SDM 2010
Data Sets 20 Newsgroup (three top categories are selected) Two data sets for binary classification: rec vs. sci and sci vs. talk rec vs. sci :144 tasks sci vs. talk : 144 tasks Reuters-21578 (the problems constructed in [Gao et al., KDD’08]) Experimental Preparation (2) rec sci talk Fuzhen Zhuang et al., SDM 2010
Compared Algorithms Supervised Learning: Logistic Regression (LG) [David et al., 00] Support Vector Machine (SVM) [Joachims, ICML’99] Semi-supervised Learning: TSVM [Joachims, ICML’99] Cross-domain Learning: CoCC [Dai et al., KDD’07] LWE [Gao et al., KDD’08] Our Methods MTrick0 (Two-step optimization framework) MTrick (Joint optimization framework) Measure: classification accuracy Experimental Preparation (3) Fuzhen Zhuang et al., SDM 2010
Experimental Results (1) • Comparisons among MTrick, MTrick0, CoCC, TSVM, SVM and LG on data set rec vs. sci MTrick can perform well even the accuracy of LG is lower than 65% Fuzhen Zhuang et al., SDM 2010
Experimental Results (2) • Comparisons among MTrick, MTrick0, CoCC, TSVM, SVM and LG on data set sci vs. talk Similar with rec vs. sci Mtrick also achieves the best results in this data set Fuzhen Zhuang et al., SDM 2010
Experimental Results (3) • The performance comparison of MTrick, LWE, CoCC, TSVM, SVM and LG on Reuters-21578 MTrick also performs very well on this data set Fuzhen Zhuang et al., SDM 2010
Experimental Results Summary • The systemic experiments show that MTrick outperforms all the compared algorithms • Especially, MTrick can perform very well when the accuracy of LG is low (< 65%), which indicates that MTrick still works when the difficulty degree of transfer learning is great • Also we can find that the joint optimization is better than the two-step optimization Fuzhen Zhuang et al., SDM 2010
Overview • Introduction • Problem Formulation • Solution for Optimization Problem and Analysis of Algorithm Convergence • Experimental Validation • Related Works • Conclusions Fuzhen Zhuang et al., SDM 2010
Related Work (1) • Cross-domain Learning Solve the distribution mismatch problems between the training and testing data. • Instance weighting based approaches • Boosting based learning by Dai et al.[ICML’07] • Instance weighting framework for NLP tasks by Jiang et al.[ACL’07] • Feature selection based approaches • Two-phase feature selection framework by Jiang et al.[CIKM’07] • Dimensionality reduction approach by Pan et al.[AAAI’08], which focuses on finding out the latent feature space regarded as the bridge knowledge between the source and target domains • Co-Clustering based Classification method by Dai et al. [KDD’07] Fuzhen Zhuang et al., SDM 2010
Related Work (2) • Nonnegative Matrix Factorization (NMF) • Weighted nonnegative matrix factorization (WNMF) by Guillamet et al. [PRL’03] • Incorporating word space knowledge for document clustering by Li et al. [SigIR’08] • Orthogonal constrained NMF by Ding et al.[KDD’06] • Cross-domain collaborative filtering by Li et al.[IJCAI’09] • Transfer the label information by sharing the information of word clusters, proposed by Li et al.[SigIR’09]. However, the word clusters are not exactly the same due to distribution difference cross domains Fuzhen Zhuang et al., SDM 2010
Outline • Introduction • Problem Formulation • Solution for Optimization Problem and Analysis of Algorithm Convergence • Experimental Validation • Related Works • Conclusions Fuzhen Zhuang et al., SDM 2010
Conclusions • Propose a nonnegative matrix factorization based classification framework (MTrick), which explicitly consider • the domain-dependent concepts • the domain-independent association between concepts and document classes • Develop an alternately iterative algorithm to solve the optimization problem, and theoretically analyze the convergence • Experiments on real-world text data sets show the effectiveness of the proposed approach Fuzhen Zhuang et al., SDM 2010
Acknowledgement Thank you!Q. & A. Fuzhen Zhuang et al., SDM 2010