170 likes | 255 Views
Training dependency parsers by jointly optimizing multiple objectives. Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard. Evaluation. Intrinsic How well does system replicate gold annotations? Precision/recall/F1, accuracy, BLEU, ROUGE, etc. Extrinsic
E N D
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyanMcDonaldJason Katz-BrownMichaelRinggaard
Evaluation • Intrinsic • How well does system replicate gold annotations? • Precision/recall/F1, accuracy, BLEU, ROUGE, etc. • Extrinsic • How useful is system for some downstream task? • High performance on one doesn’t necessarily mean high performance on the other • Can be hard to evaluate extrinsically
Dependency Parsing • Given a sentence, label the dependencies • (from nltk.org) • Output is useful for downstream tasks like machine translation • Also of interest to NLP reaserchers
Overview of paper • Optimize parser for two metrics • Intrinsic evalutation • Downstream task (here reranker in machine translation system) • Algorithm to do this • Experiments
Perceptron Algorithm • Takes: set of labeled training examples; loss function • For each example, predicts output, updates model if the output is incorrect • Rewards features that fire in gold standard model • Penalizes those that fire in predicted output
Augmented Loss Perceptron Algorithm • Similar to perceptron, except takes: multiple loss functions; multiple datasets (one for each loss function); scheduler to weight loss functions • Perceptron is an instance of ALP with one loss function, one dataset, and a trivial scheduler • Will look at ALP with 2 loss functions • Can use extrinsic evaluator as loss function
Reranker loss function • Takes k-best output from parser • Assign cost to each parse • Take lowest cost parse to be “correct” parse • If 1-best parse is lowest cost do nothing • Otherwise update parameters based on correct parse • Standard loss function is instance of this in which the cost is always lowest for 1-best
Experiment 1 • English to Japanese MT system, specifically word reordering step • Given a parse, reorder the English sentence into Japanese word order • Transition-based and graph-based dependency parsers • 17,260 manually annotated word reorderings • 10,930 training, 6,338 test • These are cheaper to produce than dependency parses
Experiment 1 • 2nd loss function based off of METEOR • Score = 1 – (#chunks – 1)/(#unigrams matched – 1) • Cost = 1 – score • Unigrams matched are those in reference and hypothesis • Chunks are sets of unigrams that are adjacent in reference and hypothesis • Vary weights of primary and secondary loss
Experiment 1 • As ratio of extrinsic loss : intrinsic loss increases, performance on reordering task improves • Transition based parser
Experiment 2 • Semi-supervised adaptation: Penn Treebank (PTB) to Question Treebank (QTB) • PTB trained parser bombs on QTB • QTB trained parser does much better on QTB • Ask annotators a simple question about QTB sentences • What is the main verb? • ROOT usually attaches to main verb • Use answers and PTB to adapt to QTB
Experiment 2 • Augmented loss data set: QTB data with ROOT attached to main verb • No other labels on QTB data • Loss function: 0 if ROOT dependency correct, 1 otherwise • Secondary loss function looks at k-best, chooses highest ranked parse with correct ROOT dependency
Experiment 2 • Results for transition parser • Huge improvement with data that is very cheap to collect • Cheaper to get Turkers to annotate main verbs than grad students to manually parse sentences
Experiment 3 • Improving accuracy on labeled and unlabeled dependency parsing (all intrinsic) • Use labeled attachment score as primary loss function • Secondary loss function weights lengths of incorrect and correct arcs • One version uses labeled arcs, the other unlabeled • Idea is to have model account for arc length • Parsers tend to do poorly on long dependencies (McDonald and Nivre, 2007)
Experiment 3 • Weighted Arc Length Score (ALS) • Sum of lengths of all correct arcs divided by sum of lengths of all arcs • In unlabeled version only head (and dependency) need to match • In labeled version arc label must match too
Experiment 3 • Results with transition parser • Small improvement likely due to fact that ALS is similar to LAS and UAS
Conclusions • Possible to train tools for particular downstream tasks • Might not want to use the same parses for MT as for information extraction • Can leverage cheap(er) data to improve task performance • Japanese translations/word orderings for MT • Main verb identification instead of dependency parses for domain adaptation • Not necessarily easy to define the task or a good extrinsic evaluation metric • MT to word reordering score • METEOR-based metric