260 likes | 756 Views
Structured Perceptron. Alice Lai and Shi Zhi. Presentation Outline. Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable Perceptron. Motivation. An algorithm to learn weights for structured prediction
E N D
Structured Perceptron Alice Lai and Shi Zhi
Presentation Outline • Introduction to Structured Perceptron • ILP-CRF Model • Averaged Perceptron • Latent Variable Perceptron
Motivation • An algorithm to learn weights for structured prediction • Alternative to POS tagging with MEMM and CRF (Collins 2002) • Convergence guarantees under certain conditions even for inseparable data • Generalizes to new examples and other sequence labeling problems
POS Tagging Example Gold labels: the/D man/N saw/V the/D dog/N Prediction: the/D man/N saw/N the/D dog/N Parameter update: Add 1: Subtract 1: Example: the man saw the dog D D D D D N N N N N A A A A A V V V V V
MEMM Approach • Conditional model: probability of the current state given previous state and current observation • For tagging problem, define local features for each tag in context • Features are often indicator functions • Learn parameter vector α with Generalized Iterative Scaling or gradient descent
Global Features • Local features are defined only for a single label • Global features are defined for an observed sequence and a possible label sequence • Simple version: global features are local features summed over an observation-label sequence pair • Compared to original perceptron algorithm, we have prediction of a vector of labels instead of a single label • Which of the possible incorrect label vectors do we use as the negative example in training?
Structured Perceptron Algorithm Input: training examples Initialize parameter vector For t = 1…max_iter: For i = 1…n: If then update: Output: parameter vector enumerates possible label sequences for observed sequence .
Properties • Convergence • Data is separable with margin if there is some vector where such that • For data that is separable with margin , then the number of mistakes made in training is bounded by where is a constant such that • Inseparable case • Number of mistakes • Generalization Theorems and proofs from Collins 2002
Global vs. Local Learning • Global learning (IBT): constraints are used during training • Local learning (L+I): classifiers are trained without constraints, constraints are applied later to produce global output • Example: ILP-CRF model [Roth and Yih 2005]
Perceptron IBT • This is structured perceptron! Input: training examples Initialize parameter vector For t = 1…max_iter: For i = 1…n: If then update: Output: parameter vector enumerates possible label sequences for observed sequence . F is a scoring function.
Perceptron I+L • Decomposition: • Prediction: • If then update: • Either learn parameter vector for global features or do inference only at evaluation time
s t A A A A A B B B B B C C C C C ILP-CRF Introduction [Roth and Yih2005] • ILP-CRF model for Semantic Role Labeling as a sequence labeling problem • Viterbi inference for CRFs can include constraints • Cannot handle long-range or general constraints • Viterbi is a shortest path problem that can be solved with ILP • Use integer linear programming to express general constraints during inference • Allows incorporation of expressive constraints, including long-range constraints between distant tokens that cannot be handled by Viterbi
ILP-CRF Models • CRF trained with max log-likelihood • CRF trained with voted perceptron • I+L • IBT • Local training (L+I) • Perceptron, winnow, voted perceptron, voted winnow
ILP-CRF Results Sequential Models Local IBT L+I L+I
ILP-CRF Conclusions • Performance of local learning models perform poorly improves dramatically when constraints are added at evaluation • Performance is comparable to IBT methods • The best models for global and local training show comparable results • L+I vs. IBT: L+I requires fewer training examples, is more efficient, outperforms IBT in most situations (unless local problems are difficult to solve) [Punyakanok et. al , IJCAI 2005]
Variations: Voted Perceptron • For iteration t=1,…,T • For example i=1,…,n • Given parameter ,by Viterbi Decoding, • Get sequence labels for one example • Each example define a tagging sequence. • The voted perceptron: takes the most frequently ocurring output in the set
Variations: Voted Perceptron • Averaged algorithm(Collins‘02): approximation of the voted method.It takes the averaging parameter instead of final parameter • Performance: • Higher F-Measure, Lower error rate • Greater Stability on variance in its scores • Variation: modified averaged algorithm for latent perceptron
Variations: Latent Structure Perceptron • Model Definition • is the parameter for perceptron. is the feature encoding function mapping to feature vector • In NER task, x is word sequence, y is the named-entity type sequence, h is the hidden latent variable sequence. • Features: unigram bigram for word, POS and orthography (prefix, upper/lower case) • Why latent variables? • Capture latent dependencies (i.e. hidden sub-structure)
Variations: Latent Structure Perceptron • Purely Latent Structure Perceptron(Connor’s) • Training(Structure perceptron with margin) • C: margin • Alpha: learning rate • Variation: modified averaging parameter method(Sun’s): re-initiate the parameter with averaged parameter in each k iteration. • Advantage: reduce overfitting of the latent perceptron
Variations: Latent Structure Perceptron • Disadvantage of purely latent perceptron: h* is found and then forgotten for each x. • Solution: Online Latent Classifier (Connor’s) • Two classifiers: latent classifier: parameter: u label classifier: parameter: w
Variations: Latent Structure Perceptron • Online Latent Classifier Training(Connor’s)
Variations: Latent Structure Perceptron • Experiments: Bio-NER with purely latent perceptron cc: cut-off Odr:#order dependency Train-time F-measure High-order
Variations: Latent Structure Perceptron • Experiments: Semantic Role Labeling with argument/predicate as latent structure • X: She likes yellow flowers (sentence) • Y: agent predicate ------ patient (role) • H: predicate: only one; argument: at least one (latent structure) • Optimization for (h*,y*): search all possible argument/predicate structure. For more complex data, need other methods. On test set:
Summary • Structured Perceptron definition and motivation • IBT vs. L+I • Variations of Structure Perceptron References: • Discriminative Training for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, M. Collins, EMNLP 2002. • Latent Variable Perceptron Algorithm for Structured Classification, Sun, Xu, Takuya Matsuzaki, Daisuke Okanohara and Jun'ichiTsujii, IJCAI 2009. • Integer Linear Programming Inference for Conditional Random Fields, D. Roth, W. Yih, ICML 2005. • Online Latent Structure Training for Language Acquisition, M. Connor and C. Fisher and D. Roth, IJCAI 2011