250 likes | 400 Views
Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004. Mihai Surdeanu. Goals. Introduce Steven Abney’s “Understanding the Yarowski Algorithm” (Computational Linguistics 30(3) 2004) paper
E N D
Analysis of Bootstrapping AlgorithmsSeminar of Machine Learning for Text MiningUPC, 18/11/2004 Mihai Surdeanu
Goals • Introduce Steven Abney’s “Understanding the Yarowski Algorithm” (Computational Linguistics 30(3) 2004) paper • What are the bootstrapping algorithms covered and their properties? • Will skip theorem proofs • What do they mean in the context of document clustering and pattern acquisition? • How do they compare with other iterative refinement clustering algorithms and with Yangarber 2003?
Notations WSD: x – word j – word sense f – word/context feature Clustering: x – document j – category/domain f – doc feature: word, pattern
Generic Yarowski Algorithm (Y-0) Needs a base learner Changes labeling only if prediction larger than arbitrary threshold Does not change labels of seeds Nothing formal can be shown about Y-0.
Modified Algorithm (Y-1) A labeled example cannot become unlabeled again. Fixed threshold
Properties of Y-1 • If the base learner reduces the divergence on the labeled (or all) examples, algorithm Y-1 decreases H (cross entropy – equation (6)) at each iteration until it reaches a critical point of H
The Original Decision List Induction Algorithm (DL-0) Smooth precision with an arbitrary value Pick the label given by the rule with the best score is NOT a probability distribution! Nothing formal can be shown about DL-0.
The EM-based Decision List Algorithm (DL-EM) • A mixture of is used to compute (see above). Because is a probability distribution, is also a probability distribution. • Whereas in DL-0 the prediction is given by the “strongest” feature, here the algorithm permits a block of “weaker” features to outweigh the strongest feature. • DL-EM does not construct a classifier from scratch (like DL-0), but rather builds upon the previous classifier (fjold and xold).
The EM-based Decision List Algorithm (DL-EM) Probability that feature f was responsible for label j for object x Normalization over all features
Algorithm DL-EM- What are the (0) parameters??? A similar algorithm exists when the feature score is computed over all examples, not just the labeled ones: DL-EM-V.
Properties of DL-EM-* • Y-1/DL-EM- and Y-1/DL-EM-V decrease H at each iteration until they reach a critical point of H (local minimum).
Algorithm DL-1-R “Raw” precision Mixture of feature scores
Algorithm DL-1-VS Precision with variable smoothing for each feature Mixture of feature scores
Properties of DL-1-* • Y-1/DL-1-R minimizes K (an upper limit on H) over labeled examples • Y-1/DL-1-VS minimizes K over all examples X
So far… • Y-0/DL-0 – original Yarowski algorithm. Can not be shown to minimize H or K. • Y-1/DL-EM- and Y-1/DL-EM-V minimize H • Y-1/DL-1-R and Y-1/DL-1-VS minimize K
Sequential Algorithms • All previous algorithms do “parallel” updates, in the sense that the parameters {fj} are all recomputed at every iteration. • Sequential algorithms: one feature is selected at each iteration: St+1 = St U {ft} • Only the score of the selected feature and the scores of the documents containing a chosen feature are recomputed. • More flexible – shown to converge for more base learners.
Algorithm YS Choose a feature that: (1) Is not seed (2) Is seen in training (3) Its score changed
Base Learners for YS Biased towards the feature that maximizes raw precision = anti-smoothing
Properties of YS-* • YS-P and YS-R reduce K in every iteration. • YS-FS reduce K in every iteration for new features.
Yarowski versus Co-training • Co-training attempts to maximize agreement on unlabeled data between classifiers trained on different “views” of the data. • The modified Yarowski algorithms introduced in this paper reduce the upper limit on entropy (H), similarly to co-training. • Co-training uses an assumption of at least two independent views of the data. Hence it is more restricted.
YS versus Yangarber (1) set = 1, else = 0 NOT a probability distribution Recompute
YS versus Yangarber (2) • Yangarber does not require the computation of Y, as its goal is to learn patterns (features) relevant for each label (category) • A plus for Yangarber as Yx = ŷ is a VERY strong statement in document classification = classifies a document based on the limited information available in this iteration • Y can be computed as a side effect when the algorithm completes. This is used as an indirect evaluation.
YS versus Yangarber (3) • The base learner for Yangarber generates scores that are NOT probability distributions! Hard to analyze the algorithm formally! fj = raw_precision(f,j) * log(how many documents contain f) This part similar to YS-R ???
Bootstrapping versus K-Means and EM • K-Means and Bootstrapping “hard” classify objects in each iteration: Yx = ŷ. EM (and Yangarber) compute Y only in the last iteration. • I think K-Means and EM converge more rapidly because they accumulate more features faster than bootstrapping. • In K-Means basically after the first iteration all features are in use. • In FS (and Yangarber) only one (or a very small number) of the features is selected in every iteration.
Conclusions • Abney simple modifications of the Yarowski bootstrapping algorithm can be formally shown to converge to a local minimum (like EM) • Based on this work Yangarber (and Riloff) are far from the formalization required to show that they converge • Is there a better algorithm for pattern learning?