270 likes | 385 Views
Max-margin sequential learning methods. William W. Cohen CALD. Announcements. Upcoming assignments: Wed 3/3: project proposal due: personnel + 1-2 page Spring break next week, no class Will get feedback on project proposals by end of break
E N D
Max-margin sequential learning methods William W. Cohen CALD
Announcements • Upcoming assignments: • Wed 3/3: project proposal due: • personnel + 1-2 page • Spring break next week, no class • Will get feedback on project proposals by end of break • No write-ups for “Distance Metrics for Text” week are due Wed 3/17 • not the Monday after spring break
Collins’ paper • Notation: • label (y) is a “tag” t • observation (x) is word w • history h is a 4-tuple <ti,ti-1,w[1:n],i> • phis(h,t) is a feature of h, t
Collins’ papers • Notation con’t: • Phi is summation of phi for all positions i • alphas is weight to give phis
The theory Claim 1: the algorithm is an instance of this perceptron variant: Claim 2: the arguments in the mistake-bounded classification results of F&S99 extend immediately to this ranking task as well.
Results • Two experiments • POS tagging, using the Adwait’s features • NP chunking (Start,Continue,Outside tags) • NER on special AT&T dataset (another paper)
The dual version of a perceptron: w is built up by repeatedly adding examples => w is a weighted sum of the examples x1,...,xn inner product <w,x> is can be rewritten: More ideas
Dual version of perceptron ranking alpha i,j = i,j range over example and correct/incorrect tag sequence
Altun et al paper • Starting point – dual version of Collins’ perceptron algorithm • final hypothesis is weighted sum of inner products with a subset of the examples • this a lot like an SVM – except that the perceptron algorithm is used to set the weights rather than quadratic optimization
SVM optimization • Notation: • yiis the correct tag for xi • y is an incorrect tag • F(xi,yi) are features • Optimization problem: • find weights w on the examples that maximize minimal margin, limiting ||w||=1, or • minimize ||w||2 such that every margin >= 1
SVMs for ranking Proposition: (14) and (15) are equivalent:
SVMs for ranking A binary classification problem – with xi yi thepositive example and xi y’negative examples, except that thetai varies for each example. Why? because we’re ranking.
SVMs for ranking • Altun et al work give the remaining details • Like for perceptron learning, “negative” data is found by running Viterbi given the learned weights and looking for errors • Each mistake is a possible new support vector • Need to iterate over the data repeatedly • Could be exponential time before convergence if the support vectors are dense...
Altun et al results • NER on 300 sentences from CoNLL2002 shared task • Spanish • Four entity types, nine labels (beginning-T, intermediate-T, other) • POS tagging on 300 sentences from Penn TreeBank • 5-CV, window of size 3, simple features