140 likes | 270 Views
Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization. Advisor: Hsin -His Chen Reporter: Chi- Hsin Yu Date: 2009.09.24. From NIPS 2008. Outlines. Introduction Related Work Review SVM
E N D
Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2009.09.24 From NIPS 2008
Outlines • Introduction • Related Work • Review SVM • SSLW (Semi-supervised Learning with Weakly-Related Unlabeled Data) • Experiments • Conclusion
Introduction • Semi-supervised Learning (SSL) • takes advantage of a large amount of unlabeled data to enhance classification accuracy • Cluster assumption • puts the decision boundary in low density areas without crossing the high density regions • is only meaningful when the labeled and unlabeled data are somehow closely related • If they were weakly related, the labeled and unlabeled data could be well separated
Introduction (conti.) • This paper aiming to • Identify a new data representation (in feature space) • By constructing a new kernel function • Advantages • Informative to the target class(category) • consistent with the feature coherence patterns exhibiting in the weakly related unlabeled data
Related Work • The two types of semi-supervised learning (SSL) • Transductive SSL • labels only for the available unlabeled data • Inductive SSL • also learns a classifier that can be used to predict labels for new data • SSLW
SVM • Notations • £ = {(x1, y1), . . . , (xl, yl)} Labeled documents • U= {(xl+1, yl+1), . . . , (xn, yn)} unlabeled documents • Document-word matrix D=(d1, d2, …, dn), di∈NV • V: the size of the vocabulary • di: word-frequency vector for document i • Word-Document matrix G=(g1, g2, …, gV) • gi=(gi,1, gi,2,…,gi,n) K=DTD, K ∈ Rnxn Document pairwise similarity α。y=(α1y1,α2y2, …, αnyn) element-wise product
SSLW • K=DTD K=DTRD • R ∈ RVxV: word-correlation matrix • Two ways to construct the matrix R G=UW, W=(w1,w2,…wV) wi: internal representation o the i-th word R= WTW, T=UUT the top p right eigenvectors of G αi ≥0, ξ ≥0
SSLW (conti.) • An Efficient Algorithm of SSLW
Experiments • Corpus • Reuters-21578 (9400 docs), • WebKB (4518 docs) • TREC AP88: an external information source for both datasets (1000 documents, randomly selected)
Evaluation Methodology • 4 positive + 4 negative samples from each training set • AUR (area under the ROC curve) • Averaging the AUR (ten times of each experiment)
Conclusion • SSLW • Significantly improves both the accuracy and the reliability of text categorization, • given a small training pool and the additional unlabeled data that are weakly related to the test bed.