340 likes | 488 Views
Machine Translation Decoder for Phrase-Based SMT. Stephan Vogel Spring Semester 2011. Decoder. Decoding issues (Previous Session) Two step decoding Generation of translation lattice Best path search With limited word reordering Specific Issues Recombination of hypotheses Pruning
E N D
Machine TranslationDecoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011 Stephan Vogel - Machine Translation
Decoder • Decoding issues (Previous Session) • Two step decoding • Generation of translation lattice • Best path search • With limited word reordering • Specific Issues • Recombination of hypotheses • Pruning • N-best list generation • Future cost estimation Stephan Vogel - Machine Translation
Recombination of Hypotheses • Recombination: Of two hypotheses keep only the better one if no future information can switch their current ranking • Notice: this depends on the models • Model score depends on current partial translation and the extension, e.g. LM • Model score depends on global features known only at the sentence end, e.g. sentence length model • The models define equivalence classes for the hypotheses • Expand only best hypothesis in each equivalence class Stephan Vogel - Machine Translation
Recombination of Hypotheses: Example • n-gram LM • Hypotheses H1: I would like to go H2: I would not like to go Assume as possible expansions:to the movies | to the cinema | and watch a film • LMscore is identical for H1+Expansion as for H2+Expansion for bi, tri, four-gram LMs • E.g : 3-gram LMscore Expansion 1 is:-log p( to | to go ) – log p( the | go to ) – log p( movies | to the) • Therefore: Cost(H1) < Cost(H2) => Cost(H1+E) < Cost(H2+E)for all possible expansions E Stephan Vogel - Machine Translation
Recombination of Hypotheses: Example 2 • Sentence length model p( I | J ) • Hypothesis H1: I would like to go H2: I would not like to go Assume as possible expansions:to the movies | to the cinema | and watch a film • Length( H1 ) = 5, Length( H2 ) = 6 • For identical expansions the lengths will remain different • Situation at sentence end • Possible that -log P( len( H1 + E ) | J ) > -log P( len( H2 + E ) | J ) • Then possible that TotalCost( H1 + E ) > TotalCost( H2 + E ) • I.e. reranking of hypotheses • Therefore: can not recombine H2 into H1 Stephan Vogel - Machine Translation
Recombination: Keep ‘em around • Expand only best hyp • Store pointers to recombined hyps for n-best list generation hb hr hb Better hr hr hr Increasing coverage Stephan Vogel - Machine Translation
Recombination of Hypotheses • Typical features for recombination of partial hypotheses • LM history • Positions of covered source words – some translations are more expensive • Number of generated words on target side – for sentence length model • Often only number of covered source words is considered, rather then actual positions • Fits with typical organization of decoder: hyps are stored according to number of covered source words • Hyps are recombined which are not strictly comparable • Use future cost estimate to lessen its impact • Overall: trade-off between speed and ‘correctness’ of search • Ideally: only compare (and recombine) hyps if all models used in the search see them as equivalent • Realistically: use fewer, coarser equivalence classes by ‘forgetting’ some of the models (they still add to the scores) Stephan Vogel - Machine Translation
Pruning • Pruning • Even after recombination too many hyps • Remove bad hyps and keep only the best ones • In recombination we compared hyps which are equivalent under the models • Now we need to compare hyps, which are not strictly equivalent under the models • We risk to remove hyps which would have won the race in the long run • I.e. we introduce errors into the search • Search Error – Model Errors • Model errors: our models give higher probability to worse translation • Search errors: our decoder looses translations with higher probability Stephan Vogel - Machine Translation
Pruning: Which Hyps to Compare? • Which hyps are we comparing? • How many should we keep? Recombination Pruning Stephan Vogel - Machine Translation
Pruning: Which Hyps to Compare? • Coarser equivalence relation => need to drop at least one of the models, or replace by simpler model • Recombination according to translated positions and LM statePruning according to number of translated positions and LM state • Recombination according to number of translated positions and LM statePruning according to number of translated positions OR LM state • Recombination with 5-gram LMPruning with 3-gram LM • Question: which is the more important feature? • Which leads to more search errors? • How much loss in translation quality? • Quality more important than speed in most applications! • Not one correct answer – depends on other components of the system • Ideally, decoder allows for different recombination and pruning settings Stephan Vogel - Machine Translation
How Many Hyps to Keep? • Beam search: keep hyp h if Cost(h) < Cost(hbest) + const Cost Prune bad hyps Models separate alternatives a lot -> keep few hyps Models do not separate alternatives -> keep many hyps # translated words Stephan Vogel - Machine Translation
Additive Beam • Is additive constant (in log domain) the right thing to do? • Hyps may spread more and more Cost Fewer and fewer hyps Inside beam # translated words Stephan Vogel - Machine Translation
Multiplicative Beam • Beam search: keep hyp h if Cost(h) < Cost(hbest) * const Cost Opening beam Covers more hyps # translated words Stephan Vogel - Machine Translation
Pruning and Optimization • Each feature has a feature weight • Optimization by adjusting feature weights • Can result in compressing or spreading the scores • This actually happened in our first MERT implementation:Higher and higher feature weights => Hyps spreading further and further appart => Fewer hyps inside the beam => Lower and lower Bleu score • Two-pronged repair: • Normalizing feature weights • Not proper beam pruning, but restricting number of hyps Stephan Vogel - Machine Translation
How Many Hyps to Keep? • Keep n-best hyps • Does not use the information from the models to decide how many hyps to keep Cost Prune bad hyps Keep constant number of hyps # translated words Stephan Vogel - Machine Translation
Efficiency • Two problems • Sorting • Generating lots of hyps which are pruned (what a waste of time) • Can we avoid generating hyps, which would most likely be pruned? Stephan Vogel - Machine Translation
Efficiency • Assumptions: • We want to generate hyps which cover n positions • All hyp sets Hk, k < n, are sorted according to total score • All phrase pairs (edges in translation lattice), which can be used to expand a hyp h in Hk to cover n positions, are sorted according to their score (weighted sum of individual scores) h1 p1 h1p2 h2 p2 h1p1 h3 p3 h2p3 h4 p4 h4p2 h5 h1p3 h3p2 prune h2p1 Hyps sorted Phrases sorted New Hyps sorted Stephan Vogel - Machine Translation
Naïve Way • Naïve way: Foreach hyp h Foreach phrasepair p newhyp = h 8 p Cost(newhyp) = Cost(h)+ Cost(p)+ CostLM + CostDM + … • This generates many hyps which will be pruned Stephan Vogel - Machine Translation
Early Termination • If Cost(newhyp) = Cost(h) + Cost(p) it would be easy Besthyp = h18 p1Loop h = next hyp Loop p = next p newhyp = h 8 p Cost(newhyp) = Cost(h) + Cost(p) Until Cost(newhyp) > Cost(besthyp) + constUntil Cost(newhyp) > Cost(besthyp) + const • That’s for proper beam pruning, would still generate too many hyps for max number of hyp strategy • In addition, we have LM and DM, etc Stephan Vogel - Machine Translation
‘Cube’ Pruning • Expand always best hyp until • No hyps within beam anymore • Or max number of hyps reached 1 2 3.1 5.6 3.4 2 1 6.5 4.2 3 4 4.6 4 Stephan Vogel - Machine Translation
Effect of Recombination and Pruning • Average number of expanded hypotheses and NIST scores for different recombination (R) and pruning (P) combinations and different beam sizes (= number of hyps) • Test Set: Arabic DevTest (203 sentences) c = number of translation words, C = coverage vector, i.e. positions, L = LM historyNIST scores: higher is better Stephan Vogel - Machine Translation
Number of Hypotheses versus NIST • Language model state required as recombination feature • More hypotheses – better quality • Different ways to achieve similar translation quality • CL : C generates more ‘useless’ hypotheses (number of bad hyps grows faster than number of good hyps) Stephan Vogel - Machine Translation
N-Best List Generation • Benefit: • Required for optimizing model scaling factors • Rescoring with richer models • For down-stream processing • Translation with pivot language: L1 -> L2 -> L3 • Information extraction • … • We have n-best translations at sentence end • But: Hypotheses are recombined -> many good translations don’t reach the sentence end • Recover those translations Stephan Vogel - Machine Translation
Storing Multiple Backpointers • When recombining hypotheses, store them with the best (i.e. surviving) hypothesis, but don’t expand them hb hr hb hr hr hr Stephan Vogel - Machine Translation
Calculating True Score • Propagate final score backwards • For best hypothesis we have correct final score Qf(hb) • For recombined hypothesis we know current score Qc(hr) and difference to current score Qc(hb) of best hypothesis • Final score of recombined hypothesis is then: Q(hr) = Qf(hb) + ( Qc(hr) - Qc(hb) ) • Use B = (Q, h, B’ ) to store sequences of hypotheses which make up a translation • Start with n-best final hypotheses • For each of top n Bs, go to predecessor hypothesis and to recombined hypotheses of predecessor hypothesis • Store Bs according to coverage Stephan Vogel - Machine Translation
Problem with N-Best Generation • Duplicates when using phrases US # companies # and # other # institutions US companies # and # other # institutions US # companies and # other # institutions US # companies # and other # institutions . . . • Example run: 1000-best -> ~400 different strings on average Extreme case: only 10 different strings • Possible solution: Checking uniqueness during backtracking,i.e. creating and hashing partial translations Stephan Vogel - Machine Translation
Rest-Cost Estimation • In Pruning we compare hyps, which are not strictly equivalent under the models • Risk: prefer hypotheses which have covered the easy parts • Remedy: estimate remaining cost for each hypothesis compare hypotheses based on ActualCost + FutureCost • Want to know minimum expected cost (similar to A* search) • Gives a bound for pruning • However, not possible with acceptable effort for all models • Want to include as many models as possible • Translation model costs, word count, phrase count • Language model costs • Distortion model costs • Calculate expected cost R(l, r) for each span (l, r) Stephan Vogel - Machine Translation
Rest Cost for Translation Models • Translation model, word count and phrase count features are ‘local’ costs • Depend only on current phrase pair • Strictly additive: R(l, m) + R(m, r) = R(l, r) • Minimize over alternative translations • For each source phrase span (l, r): initialize with cost for best translation • Combine adjacent spans, take best combination Stephan Vogel - Machine Translation
Rest Cost for Language Models • We do not have history -> only approximation • For each span (l, r) calculate LM score without history • Combine LM scores for adjacent spans • Notice: p(e1 … em) * p(em+1 … en) != p(e1 … en) beyond 1-gram LM • Alternative: fast monotone decoding with TM-best translations • History available • Then R(l,r) = R(1,r) – R(1,l) Stephan Vogel - Machine Translation
Rest Cost for Distance-Based DM • Distance-based DM: rest cost depends on coverage pattern • To many different coverage patterns, can not pre-calculate • Estimate by jumping to first gap, then filling gaps in sequence • Moore & Quirk 2007: DM cost plus rest cost S S’ S’’ Previous phrase Gap-free initial segment Current phrase L(.) = length of phrase, D(.,.) = distance between phrases S adjacent S’’: d=0 S left of S’: d=2L(S) S’ subsequence of S’’: d=2(D(S,S’’)+L(S)) Otherwise: d=2(D(S,S’)+L(S)) Stephan Vogel - Machine Translation
Rest Cost for Lexicalized DM • Lexicalized DM per phrase (f, e) = (f, t(f)) • DM(f,e) scores: in-mon, in-swap, in-dist, out-mon, out-swap, out-dist • Treat as local cost for each span (l, r) • Minimize over alternative translations and different orientations in-* and out-* Stephan Vogel - Machine Translation
Effect of Rest-Cost Estimation • From Richard Zens 2008 • We did not describe ‘per Position’ • LM is important, DM is important Stephan Vogel - Machine Translation
Summary • Different translation strategies – related to word reordering • Two level decoding strategy (one possible way to do it) • Generating translation lattice: contains all word and phrase translations • Finding best path • Word reordering as extension to best path search • Jump ahead in lattice, fill in gap later • Short reordering window: decoding time exponential in size of window • Recombination of hypotheses • If models can not re-rank hypotheses, keep only best • Depends on models used Stephan Vogel - Machine Translation
Summary • Pruning of hypotheses • Beam pruning • Problem with too few hyps in beam (e.g. when running MERT) • Keeping a maximum number of hyps • Efficiency of implementation • Try to avoid generating hyps, which are pruned • Cube pruning • N-best list generation • Needed for MERT • Spurious ambiguity Stephan Vogel - Machine Translation