Final review

Final review LING 572 Fei Xia 03/07/06

Misc • Parts 3 and 4 were due at 6am today. • Presentation: email me the slides by 6am on 3/9 • Final report: email me by 6am on 3/14. • Group meetings: 1:30-4:00pm on 3/16.

Outline • Main topics • Applying to NLP tasks • Tricks

Main topics

Main topics • Supervised learning • Decision tree • Decision list • TBL • MaxEnt • Boosting • Semi-supervised learning • Self-training • Co-training • EM • Co-EM

Main topics (cont) • Unsupervised learning • The EM algorithm • The EM algorithm for PM models • Forward-backward • Inside-outside • IBM models for MT • Others • Two dynamic models: FSA and HMM • Re-sampling: bootstrap • System combination • Bagging

Main topics (cont) • Homework • Hw1: FSA and HMM • Hw2: DT, DL, CNF, DNF, and TBL • Hw3: Boosting • Project: • P1: Trigram (learn to use Carmel, relation between HMM and FSA) • P2: TBL • P3: MaxEnt • P4: Bagging, boosting, system combination, SSL

Supervised learning

A classification problem

Classification and estimation problems • Given • x: input attributes • y: the goal • training data: a set of (x, y) • Predict y given a new x: • y is a discrete variable  classification problem • y is a continuous variable  estimation problem

Five ML methods • Decision tree • Decision list • TBL • Boosting • MaxEnt

Decision tree • Modeling: tree representation • Training: top-down induction, greedy algorithm • Decoding: find the path from root to a leaf node, where the tests along the path are satisfied.

Decision tree (cont) • Main algorithms: ID3, C4.5, CART • Strengths: • Ability to generate understandable rules • Ability to clearly indicate best attributes • Weakness: • Data splitting • Trouble with non-rectangular regions • The instability of top-down induction  bagging

Decision list • Modeling: a list of decision rules • Training: greedy, iterative algorithm • Decoding: find the 1st rule that applies • Each decision is based on a single piece of evidence, in contrast to MaxEnt, boosting, TBL

TBL • Modeling: a list of transformations (similar to decision rules) • Training: • Greedy, iterative algorithm • The concept of current state • Decoding: apply every transformation to the data

TBL (cont) • Strengths: • Minimizing error rate directly • Ability to handle non-classification problem • Dynamic problem: POS tagging • Non-classification problem: parsing • Weaknesses: • Transformations are hard to interpret as they interact with one another • Probabilistic TBL: TBL-DT

Boosting Weighted Sample ML f1 Training Sample ML Weighted Sample f2 f … ML fT

Boosting (cont) • Modeling: combining a set of weak classifiers to produce a powerful committee. • Training: learn one classifier at each iteration • Decoding: use the weighted majority vote of the weak classifiers

Boosting (cont) • Strengths • It comes with a set of theoretical guarantee (e.g., training error, test error). • It only needs to find weak classifiers. • Weaknesses: • It is susceptible to noise. • The actual performance depends on the data and the base learner.

MaxEnt The task: find p* s.t. where If p* exists, it has of the form

MaxEnt (cont) • If p* exists, then where

MaxEnt (cont) • Training: GIS, IIS • Feature selection: • Greedy algorithm • Select one (or more) at a time • In general, MaxEnt achieves good performance on many NLP tasks.

Common issues • Objective function / Quality measure: • DT, DL: e.g., information gain • TBL, Boosting: minimize training errors • MaxEnt: maximize entropy while satisfying constraints

Common issues (cont) • Avoiding overfitting • Use development data • Two strategies: • stop early • post-pruning

Common issues (cont) • Missing attribute values: • Assume a “blank” value • Assign most common value among all “similar” examples in the training data • (DL, DT): Assign a fraction of example to each possible class. • Continuous-valued attributes • Choosing thresholds by checking the training data

Common issues (cont) • Attributes with different costs • DT: Change the quality measure to include the costs • Continuous-valued goal attribute • DT, DL: each “leaf” node is marked with a real value or a linear function • TBL, MaxEnt, Boosting: ??

Comparison of supervised learners

Semi-supervised Learning

Semi-supervised learning • Each learning method makes some assumptions about the problem. • SSL works when those assumptions are satisfied. • SSL could degrade the performance when mistakes reinforce themselves.

SSL (cont) • We have covered four methods: self-training, co-training, EM, co-EM

Co-training • The original paper: (Blum and Mitchell, 1998) • Two “independent” views: split the features into two sets. • Train a classifier on each view. • Each classifier labels data that can be used to train the other classifier. • Extension: • Relax the conditional independence assumptions • Instead of using two views, use two or more classifiers trained on the whole feature set.

Unsupervised learning

Unsupervised learning • EM is a method of estimating parameters in the MLE framework. • It finds a sequence of parameters that improve the likelihood of the training data.

The EM algorithm • Start with initial estimate, θ0 • Repeat until convergence • E-step: calculate • M-step: find

The EM algorithm (cont) • The optimal solution for the M-step exists for many classes of problems.  A number of well-known methods are special cases of EM. • The EM algorithm for PM models • Forward-backward algorithm • Inside-outside algorithm • …

Other topics

FSA and HMM • Two types of HMMs: • State-emission and arc-emission HMMs • They are equivalent • We can convert HMM into WFA • Modeling: Marcov assumption • Training: • Supervised: counting • Unsupervised: forward-backward algorithm • Decoding: Viterbi algorithm

Bootstrap ML f1 ML f2 f ML fB

Bootstrap (cont) • A method of re-sampling: • One original sample  B bootstrap samples • It has a strong mathematical background. • It is a method for estimating standard errors, bias, and so on.

ML1 f1 ML2 f2 f MLB fB System combination

System combination (cont) • Hybridization: combine substructures to produce a new one. • Voting • Naïve Bayes • Switching: choose one of the fi(x) • Similarity switching • Naïve Bayes

Bagging ML f1 ML f2 f ML fB bootstrap + system combination

Bagging (cont) • It is effective for unstable learning methods: • Decision tree • Regression tree • Neural network • It does not help stable learning methods • K-nearest neighbors

Relations

Relations • WFSA and HMM • DL, DT, TBL • EM, EM for PM

Start Finish WFSA and HMM HMM Add a “Start” state and a transition from “Start” to any state in HMM. Add a “Finish” state and a transition from any state in HMM to “Finish”.

DT, CNF, DNF, DT, TBL K-DL k-CNF k-DT k-DNF k-TBL

The EM algorithm The generalized EM The EM algorithm PM Inside-Outside Forward-backward IBM models Gaussian Mix

Solving a NLP problem

Issues • Modeling: represent the problem as a formula and decompose the formula into a function of parameters • Training: estimate model parameters • Decoding: find the best answer given the parameters • Other issues: • Preprocessing • Postprocessing • Evaluation • …

Final review

Final review

Presentation Transcript

Final Review

Final Review

Final Review

Final Review

Final Review

Final review

Final review

Final Review

Final Review

Final Review

Final Review

Final Review

Final Review

Final Review

Final Review

FINAL REVIEW

Final Review