330 likes | 607 Views
Co-training. LING 572 Fei Xia 02/21/06. Overview. Proposed by Blum and Mitchell (1998) Important work: (Nigam and Ghani, 2000) (Goldman and Zhou, 2000) (Abney, 2002) (Sarkar, 2002) … Used in document classification, parsing, etc. . Outline. Basic concept: (Blum and Mitchell, 1998)
E N D
Co-training LING 572 Fei Xia 02/21/06
Overview • Proposed by Blum and Mitchell (1998) • Important work: • (Nigam and Ghani, 2000) • (Goldman and Zhou, 2000) • (Abney, 2002) • (Sarkar, 2002) • … • Used in document classification, parsing, etc.
Outline • Basic concept: (Blum and Mitchell, 1998) • Relation with other SSL algorithms: (Nigam and Ghani, 2000)
An example • Web-page classification: e.g., find homepages of faculty members. • Page text: words occurring on that page e.g., “research interest”, “teaching” • Hyperlink text: words occurring in hyperlinks that point to that page: e.g., “my advisor”
Two views • Features can be split into two sets: • The instance space: • Each example: • D: the distribution over X • C1: the set of target functions over X1. • C2: the set of target function over X2.
Assumption #1: compatibility • The instance distribution D is compatible with the target function f=(f1, f2) if for any x=(x1, x2) with non-zero prob, f(x)=f1(x1)=f2(x2). • The compatibility of f with D: Each set of features is sufficient for classification
Co-training algorithm (cont) • Why uses U’, in addition to U? • Using U’ yields better results. • Possible explanation: this forces h1 and h2 select examples that are more representative of the underlying distribution D that generates U. • Choosing p and n: the ratio of p/n should match the ratio of positive examples and negative examples in D. • Choosing the iteration number and the size of U’.
Intuition behind the co-training algorithm • h1 adds examples to the labeled set that h2 will be able to use for learning, and vice verse. • If the conditional independence assumption holds, then on average each added document will be as informative as a random document, and the learning will progress.
Experiments: setting • 1051 web pages from 4 CS depts • 263 pages (25%) as test data • The remaining 75% of pages • Labeled data: 3 positive and 9 negative examples • Unlabeled data: the rest (776 pages) • Manually labeled into a number of categories: e.g., “course home page”. • Two views: • View #1 (page-based): words in the page • View #2 (hyperlink-based): words in the hyperlinks • Learner: Naïve Bayes
Experiment: results p=1, n=3 # of iterations: 30 |U’| = 75
Questions • Can co-training algorithms be applied to datasets without natural feature divisions? • How sensitive are the co-training algorithms to the correctness of the assumptions? • What is the relation between co-training and other SSL methods (e.g., self-training)?
EM • Pool the features together. • Use initial labeled data to get initial parameter estimates. • In each iteration use all the data (labeled and unlabeled) to re-estimate the parameters. • Repeat until converge.
Experimental results: WebKB course database EM performs better than co-training Both are close to supervised method when trained on more labeled data.
Another experiment: The News 2*2 dataset • A semi-artificial dataset • Conditional independence assumption holds. Co-training outperforms EM and the “oracle” result.
Co-training vs. EM • Co-training splits features, EM does not. • Co-training incrementally uses the unlabeled data. • EM probabilistically labels all the data at each round; EM iteratively uses the unlabeled data.
Co-EM: EM with feature split • Repeat until converge • Train A-feature-set classifier using the labeled data and the unlabeded data with B’s labels • Use classifier A to probabilistically label all the unlabeled data • Train B-feature-set classifier using the labeled data and the unlabeled data with A’s labels. • B re-labels the data for use by A.
Four SSL methods Results on the News 2*2 dataset
Random feature split Co-training: 3.7% 5.5% Co-EM: 3.3% 5.1% • When the conditional independence assumption does not hold, but there is sufficient redundancy among the features, co-training still works well.
Assumptions • Assumptions made by the underlying classifier (supervised learner): • Naïve Bayes: words occur independently of each other, given the class of the document. • Co-training uses the classifier to rank the unlabeled examples by confidence. • EM uses the classifier to assign probabilities to each unlabeled example. • Assumptions made by SSL method: • Co-training: conditional independence assumption. • EM: maximizing likelihood correlates with reducing classification errors.
Summary of (Nigam and Ghani, 2002) • Comparison of four SSL methods: self-training, co-training, EM, co-EM. • The performance of the SSL methods depends on how well the underlying assumptions are met. • Random splitting features is not as good as natural splitting, but it still works if there is sufficient redundancy among features.
Variations of co-training • Goldman and Zhou (2000) use two learners of different types but both takes the whole feature set. • Zhou and Li (2005) use three learners. If two agree, the data is used to teach the third learner. • Balcan et al. (2005) relax the conditional independence assumption with much weaker expansion condition.
An alternative? • L L1, LL2 • U U1, U U2 • Repeat • Train h1 using L1 on Feat Set1 • Train h2 using L2 on Feat Set2 • Classify U2 with h1 and let U2’ be the subset with the most confident scores, L2 + U2’ L2, U2-U2’ U2 • Classify U1 with h2 and let U1’ be the subset with the most confident scores, L1 + U1’ L1, U1-U1’ U1
Yarowsky’s algorithm • one-sense-per-discourse View #1: the ID of the document that a word is in • one-sense-per-allocation View #2: local context of word in the document • Yarowsky’s algorithm is a special case of co-training (Blum & Mitchell, 1998) • Is this correct? No, according to (Abney, 2002).
Summary of co-training • The original paper: (Blum and Mitchell, 1998) • Two “independent” views: split the features into two sets. • Train a classifier on each view. • Each classifier labels data that can be used to train the other classifier. • Extension: • Relax the conditional independence assumptions • Instead of using two views, use two or more classifiers trained on the whole feature set.
Summary of SSL • Goal: use both labeled and unlabeled data. • Many algorithms: EM, co-EM, self-training, co-training, … • Each algorithm is based on some assumptions. • SSL works well when the assumptions are satisfied.
Rule independence • H1 (H2) consists of rules that are functions of X1 (X2, resp) only.
EM: the data is generated according to some simple known parametric model. • Ex: the positive examples are generated according to an n-dimensional Gaussian D+ centered around the point