Machine Translation Minimum Error Rate Training

Machine TranslationMinimum Error Rate Training Stephan Vogel Spring Semester 2011 Stephan Vogel - Machine Translation

Overview • Optimization approaches • Simplex • MER • Avoiding local minima • Additional considerations • Tuning towards different metrics • Tuning on different development sets Stephan Vogel - Machine Translation

Tuning the SMT System • We use different models in SMT system • Models have simplifications • Trained on different amounts of data • => Models have different levels of reliability and scores have different ranges • => Give different weight to different Models Q = c1 Q1 + c2 Q2 + … + cn Qn • Find optimal scaling factors (feature weights) c1 … cn • Optimal means: Highest score for chosen evaluation metric Mie: find (c1, …, cn) such that M(argmine{Q(e,f)}) is high • Metric M is our objective function Stephan Vogel - Machine Translation

Problems • The surface of the objective function is not nice • Not convex -> local minima (actually, many local minima) • Not differentiable -> gradient descent methods not readily applicable • There may be dangerousareas (‘boundary cliffs’) • Example: • Tune on Dev set withshort reference translations • Optimization leads towardsshort translations • New test set has long reference translations • Translations are now too short ->length penalty Small change Big effect Stephan Vogel - Machine Translation

Brute Force Approach – Manual Tuning • Decode with different scaling factors • Get feeling for range of good values • Get feeling for importance of models • LM is typically most important • Sentence length (word count feature) to balance shortening effect of LM • Word reordering is more or less effective depending on language • Narrow down range in which scaling factors are tested • Essentially multi-linear optimization • Works good for small number of models • Time consuming (CPU wise) if decoding takes long time Stephan Vogel - Machine Translation

Automatic Tuning • Many algorithms to find (near) optimal solutions available • Simplex • Powell (line search) • MIRA (Margin Infused Relaxed Algorithm) • Specially designed minimum error training (Och 2003) • Genetic algorithm • Note: models are not improved, only their combination • Note: some parameters change performance of decoder, but are not in Q • Number of alternative translation • Beam size • Word reordering restrictions Stephan Vogel - Machine Translation

Automatic Tuning on N-best List • Optimization algorithm need many iterations – too expensive to run full translations • => Use n-best lists • e.g. for each of 500 source sentences 1000 translations • Change scaling factors results in re-ranking the n-best lists • Evaluate new 1-best translations • Apply any of the standard optimization techniques • Advantage: much faster • Can pre-calculate the counts (e.g. n-gram matches) for each translation to speed up evaluation Stephan Vogel - Machine Translation

Simplex (Nelder-Mead) • Start with n+1 random configurations • Get 1-best translation for each configuration -> objective function • Sort points xk according to objective function: f(x1) < f(x2) < … < f(xn+1) • Calculate x0 as center of gravity for x1 … xn • Replace worst point with a point reflected through the centroid xr = x0 + r * (x0 – xn+1) Stephan Vogel - Machine Translation

Demo • Obviously, we need to change the size of the simplex to enforce convergence • Also, want to adjust the step size • If new point is best point – increase step size • If new point is worse then x1 … xn – decrease step size 8 6 9 7 12 9 11 Stephan Vogel - Machine Translation

Expansion and Contraction • Reflection: Calculate xr = x0 + r * (x0 – xn+1)if f(x1) <= f(xr) < f(xn) replace xn+1 with xr; Next iteration • Expansion: If reflected point is better then best, i.e. f(xr) < f(x1) Calculate xe = x0 + e * (x0 – xn+1) If f(xe) < f(xr) then replace xn+1 with xe else replace xn+1 with xr Next iteration else Contract • Contraction: Reflected point f(xr) >= f(xn)Calculate xc = xn+1 + c * (x0 – xn+1)If f(xc) <= f(xn+1) then replace xn+1 with xc else Shrink • Shrinking: For all xk, k = 2 … n+1: xk = x1 + s * (xk – x1)Next iteration Stephan Vogel - Machine Translation

Changing the Simplex x0 x0 expansion xn+1 xn+1 reflection x0 xn+1 xn+1 x1 contraction shrinking Stephan Vogel - Machine Translation

Powell Line Search • Select directions in search space, then Loop until convergence Loop over directions d Perform line search for direction d until convergence • Many variants • Select directions • Easiest is to use the model scores • Or combine multiple scores • Step size in line search • MER (Och 2003) is line search along models with smart selection of steps Stephan Vogel - Machine Translation

Minimum Error Training • For each hypothesis we have Q = S ck*Qk • Select one Q\k = ck Qk + Sn\k cn*Qn = ck Qk + QRest Total Model Score Metric Score WER = 8 Qk Individual model score gives slope 1 QRest ck Stephan Vogel - Machine Translation

Minimum Error Training • Source sentence 1 • Depending on scaling factor ck, different hyps are in 1-best position • Set ck to have metric-best hyp also being model-best Model Score h11: WER = 8 h12 : WER = 5 h13 : WER = 4 best hyp: ck h12 h13 h11 8 5 4 Stephan Vogel - Machine Translation

Minimum Error Training • Select minimum number of evaluation points • Calculate intersection point • Keep only if hyps are minimum at that point • Choose evaluation points between intersection points Model Score h11: WER = 8 h12 : WER = 5 h13 : WER = 4 best hyp: ck h12 h13 h11 8 5 4 Stephan Vogel - Machine Translation

Minimum Error Training • Source sentence 1, now different error scores • Optimization would find a different ck • => Different metrics lead to different scaling factors Model Score h11: WER = 8 h12 : WER = 2 h13 : WER = 4 best hyp: ck h12 h13 h11 8 2 4 Stephan Vogel - Machine Translation

Minimum Error Training • Sentence 2 • Best ck in a different range • No matter which ck, h22 would newer be 1-best h21: WER = 2 Model Score h22 : WER = 0 h23 : WER = 5 best hyp: ck h23 h21 2 5 Stephan Vogel - Machine Translation

Minimum Error Training • Multiple sentences h21: WER = 2 h22 : WER = 0 Model Score h23 : WER = 5 h11: WER = 8 h12 : WER = 5 h13 : WER = 4 h22 h21 best hyp: ck h12 h13 h11 10 7 10 9 Stephan Vogel - Machine Translation

Iterate Decoding - Optimization • N-best list is (very restricted) substitute for search space • With updated feature weights we may have generated other (better) translations • Some of the hyps in the n-best list would have been pruned • Iterate • Re-translate with new feature weights • Merge new translations with old translations (increases stability) • Run optimizer over larger n-best lists • Repeat until no new translations, or improvement < epsilon, or just k times (typically 5-10 iterations) Stephan Vogel - Machine Translation

Avoiding Local Minima • Optimization can get stuck in local minimum • Remedies • Fiddle around with the parameters of your optimization algorithm • Larger n-best list -> more evaluation points • Combine with Simulated Annealing type approach (Smith & Eisner, 2007) • Restart multiple times Stephan Vogel - Machine Translation

Random Restarts • Comparison Simplex/Powell (Alok, unpublished) • Comparison Simplex/ext. Simplex/MER (Bing Zhao, unpubl.) • Observations: • Alok: Simplex ‘jumpier’ then Powell • Bing: Simplex better than MER • Both: you need many restarts Stephan Vogel - Machine Translation

Optimizing NOT Towards References • Ideally, we want system output identical to reference translations • But there is not guarantee that system can generate reference translations (under realistic conditions) • E.g. we restrict reordering window • We have unknown words • Reference translations may have words unknow to the system • Instead of forcing decoder towards reference translations optimize towards best translations generated by the system • Find hypotheses with best metric score • Use those as pseudo references • Optimize towards the pseudo references Stephan Vogel - Machine Translation

Optimizing Towards Different Metrics • Automatic metrics have different characteristics • Optimizing towards one does not mean that other metric scores will also go up • Esp. Different metrics prefer shorter or longer translationsTypically: TER < BLEU < METEOR (< means ‘shorter translation’) • Mauser et al (2007) on Ch-En NIST 2005 test set • Reasonably well behaved • Resulting length of translation differs by more than 15% Stephan Vogel - Machine Translation

Generalization to other Test Sets • Optimize on one set, test on multiple other sets • Again Mauser et al, Ch-En • Shown is behavior overSimplex optimization iterations • Nice, nearly parallel developmentof metric scores • However, we had also observed brittle behavior • Esp. when ratio src_length / ref_length is very different between dev and eval test sets Stephan Vogel - Machine Translation

Large Weight = Important Feature? • Assume we have cLM = 1.0, cTM = 0.55, cWC = 3.2 • Which feature is most important? • Cannot say!!! • We want to re-rank the n-best lists • Feature weights scale feature values such that they can compete • Example: • Variation in LM and TM largerthen for WC • Need large weight for WC to makesmall differences effective • To know if feature is important, remove it and look at drop in metric score Stephan Vogel - Machine Translation

Open Issues • Should not all optimizers get the same results, if done right • The models are the same, it’s just finding the right mix • If local minima can be avoided, then similar good optima should be found • How to stay save • Avoid good optima close to ‘cliffs’ • Different configurations give very similar metric scores, pick one which is more stable • One hat fits all? • Why one set of feature weights? • How about different sets for • Good/bad translations (tuning on tail: mixed results so far) • Short/long sentences • Begin/middle/end of sentence • ... Stephan Vogel - Machine Translation

Summary • Optimizing system by modifying scaling factors (feature weights) • Different optimization approaches can be used • Simplex, Powell most common • MERT (Och) is similar to Powell, with pre-calculation of grid points • Many local optima, avoid getting stuck early • Most effective: many restarts • Generalization • To unseen test data: mostly ok, sometimes selection of dev set has big impact (length penalty!) • To different metrics: reasonably stable (metrics are reasonably correlated in most cases) • Still open questions => more research needed  Stephan Vogel - Machine Translation

Machine Translation Minimum Error Rate Training

Machine Translation Minimum Error Rate Training

Presentation Transcript

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Feasibility of Human-in-the-loop Minimum Error Rate Training

Machine Translation

Machine Translation

Minimum Error Rate Training in Statistical Machine Translation

Minimum Error Rate Training in Statistical Machine Translation

Machine Translation

Machine Translation

Machine Translation

Minimum Rank Error Training for Language Modeling

Translation Error Rate Metric

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation