Max-margin sequential learning methods

Max-margin sequential learning methods William W. Cohen CALD

Announcements • Upcoming assignments: • Wed 3/3: project proposal due: • personnel + 1-2 page • Spring break next week, no class • Will get feedback on project proposals by end of break • No write-ups for “Distance Metrics for Text” week are due Wed 3/17 • not the Monday after spring break

Collins’ paper • Notation: • label (y) is a “tag” t • observation (x) is word w • history h is a 4-tuple <ti,ti-1,w[1:n],i> • phis(h,t) is a feature of h, t

Collins’ papers • Notation con’t: • Phi is summation of phi for all positions i • alphas is weight to give phis

Collins’ paper

The theory Claim 1: the algorithm is an instance of this perceptron variant: Claim 2: the arguments in the mistake-bounded classification results of F&S99 extend immediately to this ranking task as well.

F&S99 algorithm

F&S99 result

Collins’ result

Results • Two experiments • POS tagging, using the Adwait’s features • NP chunking (Start,Continue,Outside tags) • NER on special AT&T dataset (another paper)

Features for NP chunking

Results

The dual version of a perceptron: w is built up by repeatedly adding examples => w is a weighted sum of the examples x1,...,xn inner product <w,x> is can be rewritten: More ideas

Dual version of perceptron ranking alpha i,j = i,j range over example and correct/incorrect tag sequence

NER features for re-ranking MAXENT tagger output

NER features

NER results

Altun et al paper • Starting point – dual version of Collins’ perceptron algorithm • final hypothesis is weighted sum of inner products with a subset of the examples • this a lot like an SVM – except that the perceptron algorithm is used to set the weights rather than quadratic optimization

SVM optimization • Notation: • yiis the correct tag for xi • y is an incorrect tag • F(xi,yi) are features • Optimization problem: • find weights w on the examples that maximize minimal margin, limiting ||w||=1, or • minimize ||w||2 such that every margin >= 1

SVMs for ranking

SVMs for ranking Proposition: (14) and (15) are equivalent:

SVMs for ranking A binary classification problem – with xi yi thepositive example and xi y’negative examples, except that thetai varies for each example. Why? because we’re ranking.

SVMs for ranking • Altun et al work give the remaining details • Like for perceptron learning, “negative” data is found by running Viterbi given the learned weights and looking for errors • Each mistake is a possible new support vector • Need to iterate over the data repeatedly • Could be exponential time before convergence if the support vectors are dense...

Altun et al results • NER on 300 sentences from CoNLL2002 shared task • Spanish • Four entity types, nine labels (beginning-T, intermediate-T, other) • POS tagging on 300 sentences from Penn TreeBank • 5-CV, window of size 3, simple features

Altun et al results

Max-margin sequential learning methods

Max-margin sequential learning methods

Presentation Transcript

Sequential Learning

Sequential Learning

Max-Margin Additive Classifiers for Detection

Robust Subspace Discovery: Low-rank and Max-margin Approaches

Contextual Classification with Functional Max-Margin Markov Networks

Max-Margin Latent Variable Models

Online Max-Margin Weight Learning for Markov Logic Networks

Max-Margin Early Event Detectors

LEARNING METHODS

Max-Margin Weight Learning for Markov Logic Networks

Convergence of Sequential Monte Carlo Methods

Max-Margin Minimum Entropy Models

Online Max-Margin Weight Learning with Markov Logic Networks

Stacked Sequential Learning

Multi-Kernel Multi-Label Learning with Max-Margin Concept Network

Accurate Max-Margin Training for Structured Output Spaces

Sequential and Spatial Supervised Learning

Mixed Methods Design using Sequential Explanatory Strategy

Max-Margin Classification of Data with Absent Features

Sequential Learning with Dependency Nets

Convergence of Sequential Monte Carlo Methods

Stacked Sequential Learning