Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004

Analysis of Bootstrapping AlgorithmsSeminar of Machine Learning for Text MiningUPC, 18/11/2004 Mihai Surdeanu

Goals • Introduce Steven Abney’s “Understanding the Yarowski Algorithm” (Computational Linguistics 30(3) 2004) paper • What are the bootstrapping algorithms covered and their properties? • Will skip theorem proofs • What do they mean in the context of document clustering and pattern acquisition? • How do they compare with other iterative refinement clustering algorithms and with Yangarber 2003?

Notations WSD: x – word j – word sense f – word/context feature Clustering: x – document j – category/domain f – doc feature: word, pattern

Generic Yarowski Algorithm (Y-0) Needs a base learner Changes labeling only if prediction larger than arbitrary threshold Does not change labels of seeds Nothing formal can be shown about Y-0.

Modified Algorithm (Y-1) A labeled example cannot become unlabeled again. Fixed threshold

Properties of Y-1 • If the base learner reduces the divergence on the labeled (or all) examples, algorithm Y-1 decreases H (cross entropy – equation (6)) at each iteration until it reaches a critical point of H

The Original Decision List Induction Algorithm (DL-0) Smooth precision with an arbitrary value Pick the label given by the rule with the best score  is NOT a probability distribution! Nothing formal can be shown about DL-0.

The EM-based Decision List Algorithm (DL-EM) • A mixture of  is used to compute  (see above). Because  is a probability distribution,  is also a probability distribution. • Whereas in DL-0 the prediction is given by the “strongest” feature, here the algorithm permits a block of “weaker” features to outweigh the strongest feature. • DL-EM does not construct a classifier from scratch (like DL-0), but rather builds upon the previous classifier (fjold and xold).

The EM-based Decision List Algorithm (DL-EM) Probability that feature f was responsible for label j for object x Normalization over all features

Algorithm DL-EM- What are the (0) parameters??? A similar algorithm exists when the feature score is computed over all examples, not just the labeled ones: DL-EM-V.

Properties of DL-EM-* • Y-1/DL-EM- and Y-1/DL-EM-V decrease H at each iteration until they reach a critical point of H (local minimum).

Algorithm DL-1-R “Raw” precision Mixture of feature scores

Algorithm DL-1-VS Precision with variable smoothing for each feature Mixture of feature scores

Properties of DL-1-* • Y-1/DL-1-R minimizes K (an upper limit on H) over labeled examples  • Y-1/DL-1-VS minimizes K over all examples X

So far… • Y-0/DL-0 – original Yarowski algorithm. Can not be shown to minimize H or K. • Y-1/DL-EM- and Y-1/DL-EM-V minimize H • Y-1/DL-1-R and Y-1/DL-1-VS minimize K

Sequential Algorithms • All previous algorithms do “parallel” updates, in the sense that the parameters {fj} are all recomputed at every iteration. • Sequential algorithms: one feature is selected at each iteration: St+1 = St U {ft} • Only the score of the selected feature and the scores of the documents containing a chosen feature are recomputed. • More flexible – shown to converge for more base learners.

Algorithm YS Choose a feature that: (1) Is not seed (2) Is seen in training (3) Its score changed

Base Learners for YS Biased towards the feature that maximizes raw precision = anti-smoothing

Properties of YS-* • YS-P and YS-R reduce K in every iteration. • YS-FS reduce K in every iteration for new features.

Yarowski versus Co-training • Co-training attempts to maximize agreement on unlabeled data between classifiers trained on different “views” of the data. • The modified Yarowski algorithms introduced in this paper reduce the upper limit on entropy (H), similarly to co-training. • Co-training uses an assumption of at least two independent views of the data. Hence it is more restricted.

YS versus Yangarber (1) set  = 1, else  = 0 NOT a probability distribution Recompute 

YS versus Yangarber (2) • Yangarber does not require the computation of Y, as its goal is to learn patterns (features) relevant for each label (category) • A plus for Yangarber as Yx = ŷ is a VERY strong statement in document classification = classifies a document based on the limited information available in this iteration • Y can be computed as a side effect when the algorithm completes. This is used as an indirect evaluation.

YS versus Yangarber (3) • The base learner for Yangarber generates scores that are NOT probability distributions! Hard to analyze the algorithm formally! fj = raw_precision(f,j) * log(how many documents contain f) This part similar to YS-R ???

Bootstrapping versus K-Means and EM • K-Means and Bootstrapping “hard” classify objects in each iteration: Yx = ŷ. EM (and Yangarber) compute Y only in the last iteration. • I think K-Means and EM converge more rapidly because they accumulate more features faster than bootstrapping. • In K-Means basically after the first iteration all features are in use. • In FS (and Yangarber) only one (or a very small number) of the features is selected in every iteration.

Conclusions • Abney  simple modifications of the Yarowski bootstrapping algorithm can be formally shown to converge to a local minimum (like EM) • Based on this work  Yangarber (and Riloff) are far from the formalization required to show that they converge • Is there a better algorithm for pattern learning?

Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004

Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004

Presentation Transcript

Text-Mining: analysis of text data

Comparison of Principal Component Analysis and Random Projection in Text Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms

An Analysis of Machine Learning Algorithms for Condensing Reverse Engineered Class Diagrams

Introduction to Machine Learning and Text Mining

Intro to Data Mining/Machine Learning Algorithms for Business Intelligence

Performance Evaluation of Machine Learning Algorithms

Seminar on Machine Learning Rada Mihalcea

Overview of Text Data Mining

Text Mining with Machine Learning Techniques

Meta-learning for automatic selection of algorithms for text classification

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Machine Learning Algorithms

Improving Text Categorization Bootstrapping via Unsupervised Learning

Overview of Machine Learning: Algorithms and Applications

Learning Text Mining

Analysis of Algorithms

Important Types of Machine Learning Algorithms

Text-Mining: analysis of text data

Opportunities for Text Mining in Bioinformatics

Analysis of Text Classification Algorithms A Review