720 likes | 901 Views
Final review. LING 572 Fei Xia 03/07/06. Misc. Parts 3 and 4 were due at 6am today. Presentation: email me the slides by 6am on 3/9 Final report: email me by 6am on 3/14. Group meetings: 1:30-4:00pm on 3/16. Outline. Main topics Applying to NLP tasks Tricks. Main topics. Main topics.
E N D
Final review LING 572 Fei Xia 03/07/06
Misc • Parts 3 and 4 were due at 6am today. • Presentation: email me the slides by 6am on 3/9 • Final report: email me by 6am on 3/14. • Group meetings: 1:30-4:00pm on 3/16.
Outline • Main topics • Applying to NLP tasks • Tricks
Main topics • Supervised learning • Decision tree • Decision list • TBL • MaxEnt • Boosting • Semi-supervised learning • Self-training • Co-training • EM • Co-EM
Main topics (cont) • Unsupervised learning • The EM algorithm • The EM algorithm for PM models • Forward-backward • Inside-outside • IBM models for MT • Others • Two dynamic models: FSA and HMM • Re-sampling: bootstrap • System combination • Bagging
Main topics (cont) • Homework • Hw1: FSA and HMM • Hw2: DT, DL, CNF, DNF, and TBL • Hw3: Boosting • Project: • P1: Trigram (learn to use Carmel, relation between HMM and FSA) • P2: TBL • P3: MaxEnt • P4: Bagging, boosting, system combination, SSL
Classification and estimation problems • Given • x: input attributes • y: the goal • training data: a set of (x, y) • Predict y given a new x: • y is a discrete variable classification problem • y is a continuous variable estimation problem
Five ML methods • Decision tree • Decision list • TBL • Boosting • MaxEnt
Decision tree • Modeling: tree representation • Training: top-down induction, greedy algorithm • Decoding: find the path from root to a leaf node, where the tests along the path are satisfied.
Decision tree (cont) • Main algorithms: ID3, C4.5, CART • Strengths: • Ability to generate understandable rules • Ability to clearly indicate best attributes • Weakness: • Data splitting • Trouble with non-rectangular regions • The instability of top-down induction bagging
Decision list • Modeling: a list of decision rules • Training: greedy, iterative algorithm • Decoding: find the 1st rule that applies • Each decision is based on a single piece of evidence, in contrast to MaxEnt, boosting, TBL
TBL • Modeling: a list of transformations (similar to decision rules) • Training: • Greedy, iterative algorithm • The concept of current state • Decoding: apply every transformation to the data
TBL (cont) • Strengths: • Minimizing error rate directly • Ability to handle non-classification problem • Dynamic problem: POS tagging • Non-classification problem: parsing • Weaknesses: • Transformations are hard to interpret as they interact with one another • Probabilistic TBL: TBL-DT
Boosting Weighted Sample ML f1 Training Sample ML Weighted Sample f2 f … ML fT
Boosting (cont) • Modeling: combining a set of weak classifiers to produce a powerful committee. • Training: learn one classifier at each iteration • Decoding: use the weighted majority vote of the weak classifiers
Boosting (cont) • Strengths • It comes with a set of theoretical guarantee (e.g., training error, test error). • It only needs to find weak classifiers. • Weaknesses: • It is susceptible to noise. • The actual performance depends on the data and the base learner.
MaxEnt The task: find p* s.t. where If p* exists, it has of the form
MaxEnt (cont) • If p* exists, then where
MaxEnt (cont) • Training: GIS, IIS • Feature selection: • Greedy algorithm • Select one (or more) at a time • In general, MaxEnt achieves good performance on many NLP tasks.
Common issues • Objective function / Quality measure: • DT, DL: e.g., information gain • TBL, Boosting: minimize training errors • MaxEnt: maximize entropy while satisfying constraints
Common issues (cont) • Avoiding overfitting • Use development data • Two strategies: • stop early • post-pruning
Common issues (cont) • Missing attribute values: • Assume a “blank” value • Assign most common value among all “similar” examples in the training data • (DL, DT): Assign a fraction of example to each possible class. • Continuous-valued attributes • Choosing thresholds by checking the training data
Common issues (cont) • Attributes with different costs • DT: Change the quality measure to include the costs • Continuous-valued goal attribute • DT, DL: each “leaf” node is marked with a real value or a linear function • TBL, MaxEnt, Boosting: ??
Semi-supervised learning • Each learning method makes some assumptions about the problem. • SSL works when those assumptions are satisfied. • SSL could degrade the performance when mistakes reinforce themselves.
SSL (cont) • We have covered four methods: self-training, co-training, EM, co-EM
Co-training • The original paper: (Blum and Mitchell, 1998) • Two “independent” views: split the features into two sets. • Train a classifier on each view. • Each classifier labels data that can be used to train the other classifier. • Extension: • Relax the conditional independence assumptions • Instead of using two views, use two or more classifiers trained on the whole feature set.
Unsupervised learning • EM is a method of estimating parameters in the MLE framework. • It finds a sequence of parameters that improve the likelihood of the training data.
The EM algorithm • Start with initial estimate, θ0 • Repeat until convergence • E-step: calculate • M-step: find
The EM algorithm (cont) • The optimal solution for the M-step exists for many classes of problems. A number of well-known methods are special cases of EM. • The EM algorithm for PM models • Forward-backward algorithm • Inside-outside algorithm • …
FSA and HMM • Two types of HMMs: • State-emission and arc-emission HMMs • They are equivalent • We can convert HMM into WFA • Modeling: Marcov assumption • Training: • Supervised: counting • Unsupervised: forward-backward algorithm • Decoding: Viterbi algorithm
Bootstrap ML f1 ML f2 f ML fB
Bootstrap (cont) • A method of re-sampling: • One original sample B bootstrap samples • It has a strong mathematical background. • It is a method for estimating standard errors, bias, and so on.
ML1 f1 ML2 f2 f MLB fB System combination
System combination (cont) • Hybridization: combine substructures to produce a new one. • Voting • Naïve Bayes • Switching: choose one of the fi(x) • Similarity switching • Naïve Bayes
Bagging ML f1 ML f2 f ML fB bootstrap + system combination
Bagging (cont) • It is effective for unstable learning methods: • Decision tree • Regression tree • Neural network • It does not help stable learning methods • K-nearest neighbors
Relations • WFSA and HMM • DL, DT, TBL • EM, EM for PM
Start Finish WFSA and HMM HMM Add a “Start” state and a transition from “Start” to any state in HMM. Add a “Finish” state and a transition from any state in HMM to “Finish”.
DT, CNF, DNF, DT, TBL K-DL k-CNF k-DT k-DNF k-TBL
The EM algorithm The generalized EM The EM algorithm PM Inside-Outside Forward-backward IBM models Gaussian Mix
Issues • Modeling: represent the problem as a formula and decompose the formula into a function of parameters • Training: estimate model parameters • Decoding: find the best answer given the parameters • Other issues: • Preprocessing • Postprocessing • Evaluation • …