1 / 25

Machine Translation Decoder for Phrase-Based SMT

Machine Translation Decoder for Phrase-Based SMT. Stephan Vogel Spring Semester 2011. Decoder. Decoding issues Two step decoding Generation of translation lattice Best path search With limited word reordering Specific Issues (Next Session) Recombination of hypotheses Pruning

chipo
Download Presentation

Machine Translation Decoder for Phrase-Based SMT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine TranslationDecoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011 Stephan Vogel - Machine Translation

  2. Decoder • Decoding issues • Two step decoding • Generation of translation lattice • Best path search • With limited word reordering • Specific Issues (Next Session) • Recombination of hypotheses • Pruning • N-best list generation • Future cost estimation Stephan Vogel - Machine Translation

  3. Decoding Issues Decoder takes source sentence and all available knowledge (translation model, distortion model, language model etc)and generates a target sentence • Many alternative translations are possible • Too many to explore them all -> pruning is necessary • Pruning leads to search errors • Decoder outputs model-best translation • Ranking of hyps according to model is different from ranking according to external metric  • Bad translations get better models scores than good translations -> model errors • Models see only limited context • Different hypotheses become identical under the model • -> Hypothesis recombination Stephan Vogel - Machine Translation

  4. Decoding Issues • Languages have different word order • Modeled by distortion models • Exploring all possible reorderings to expensive (essentially O(J!)) • Need to restrict reordering -> different reordering strategies • Optimizing the system • We use a bunch of models (features), need to optimize scaling factors (feature weights) • Decoding is expensive • Optimize on n-best list -> need to generate n-best lists Stephan Vogel - Machine Translation

  5. Decoder: The Knowledge Sources • Translation models • Phrase translation table • Statistical lexicon and/or manual lexicon • Named entities • Translation information stored as transducers or extracted on the fly • Language model: standard n-gram LM • Distortion model: distance-based or lexicalized • Sentence length model • Typically simulated by word-count feature • Other features • Phrase-count • Number of untranslated words • … Stephan Vogel - Machine Translation

  6. The Decoder: Two Level Approach • Build translation lattice • Run left-right over test sentence • Search for matching phrases between source sentence and phrase table (and other translation tables) • For each translation, insert edges into the lattice • First best search (or n-best search) • Run left-right over lattice • Apply n-gram language model • Combine translation model scores and language model score • Recombine and prune hypotheses • At sentence end: add sentence length model score • Trace back best hypothesis (or n-best hypotheses) • Notice: this is convenient for describing decoder • Implementation can interleave both processes • Implementation can make a difference due to pruning Stephan Vogel - Machine Translation

  7. Building Translation Lattice Sentence: ich komme morgen zu dir Reference: I will come to you tomorrow • Search in corpus for phrases and their translations • Insert edges into the lattice I will come to your office I come tomorrow to you come I ich komme morgen zu dir 0 1 2 … … J Stephan Vogel - Machine Translation

  8. Phrase Table in Hash Map • Store phrase table in hash map (source phrase as key) • For each n-gram in source sentence access hash map foreach j = 1 to J-1 // start position of phrase foreach l = 0 to lmax-1 // phrase length SourcePhrase = (wj … wj+l) TargetPhrases = Hashmap.Get( SourcePhrase ) foreach TargetPhrase t in TargetPhrases create new edge e’ = (j-1, j+l, t ) // add TM scores • Works fine for sentence input, but too expensive for lattices • Lattices from speech recognizer • Paraphrases • Reordering as preprocessing step • Hierarchical transducers Stephan Vogel - Machine Translation

  9. Example: Paraphrase Lattice • Large: top-5 paraphrases • Pruned Stephan Vogel - Machine Translation

  10. Phrase Table as Prefix Tree Stephan Vogel - Machine Translation

  11. Phrase Table as Prefix Tree ja , okay dann Montag bei mir okay then okay on Monday then Stephan Vogel - Machine Translation

  12. Building the Translation Lattice • Book-keeping: hypothesis h = (n, n, s0, hprev, e ) n – nodes0– initial state in transducerhprev – previous hypothesise– edge • Convert sentence into lattice structure • At each node n, insert ‘empty’ hypothesis h = (n, n, s0, hprev = nil, e = nil ) as starting point for phrase search from this position • Note: Previous hyp and edge are only needed for hierarchical transducers, to be able to ‘propagate’ partial translations Stephan Vogel - Machine Translation

  13. Algorithm for Building Translation Lattice foreach node n = 0 to J create empty hypothesis h0 = (n, n, s0, NIL, NIL) Hyps( n ) = Hyps( n ) + h0 foreach incoming edge e in n w = WordAt( e ) nprev = FromNode( e ) foreach hypothesis hprev = (nstart, nprev, sprev, hx, ex ) in Hyps( nprev ) if transducer T has transition (s->s’ : w ) ifs’ is emitting state foreach translation t emitted in s’ create new edge e’ = (ns, n, t ) // add TM scores ifs’ is not final state create new hypothesis h’ = (ns, n, s’, hprev, e ) Hyps( n ) = Hyps( n ) + h’ Stephan Vogel - Machine Translation

  14. Searching for Best Translation • We have constructed a graph • Directed • No cycles • Each edge carries a partial translation (with scores) • Now we need to find the best path • Adding additional information (DM, LM, ….) • Allowing for some reordering Stephan Vogel - Machine Translation

  15. Monotone Search • Hypotheses describe partial translations • Coverage information, translation, scores • Expand hypothesis over outgoing edges I will come to your office I come tomorrow to you come I ich komme morgen zu dir h: c=0..4, t=I will come tomorrow zu h: c=0..3, t=I will come tomorrow h: c=0..4, t=I will come tomorrow to h: c=0..5, t=I will come tomorrow to your office Stephan Vogel - Machine Translation

  16. Reordering Strategies • All permutations • Any re-ordering possible • Complexity of traveling salesman -> only possible for very short sentences • Small jumps ahead – filling the gaps pretty soon • Only local word reordering • Implemented in STTK decoder • Leaving small number of gaps – fill in at any time • Allows for global but limited reordering • Similar decoding complexity – exponential in number of gaps • IBM-style reordering (described in IBM patent) • Merging neighboring regions with swap – no gaps at all • Allows for global reordering • Complexity lower than 1, but higher than 2 and 3 Stephan Vogel - Machine Translation

  17. IBM Style Reordering • Example: first word translated last! 0 1 gap 2 another gap 3 4 partially filled 5 6 7 • Resulting reordering: 2 3 7 8 9 10 11 5 6 4 12 13 14 1 Stephan Vogel - Machine Translation

  18. Sliding Window Reordering • Local reordering within sliding window of size 6 [ ] 0 [ ] 1 gap [ ] 2 another gap [ ] 3 [ ] 4 partially filled [ ] 5 new gap [ ] 6 [ ] 7 [ ] 8 Stephan Vogel - Machine Translation

  19. Coverage Information • Need to know which source words have already been translated • Don’t want to miss some words • Don’t want to translate words twice • Can compare hypotheses which cover the same words • Use Coverage vector to store this information • For ‘small jumps ahead’: position of first gap plus short bit vector • For ‘small number of gaps’: array of positions of uncovered words • For ‘merging neighboring regions’: left and right position Stephan Vogel - Machine Translation

  20. Limited Distance Word Reordering • Word and phrase reordering within a given window • From first un-translated source word next k positions • Window length 1: monotone decoding • Restrict total number of reordering (typically 3 per 10 words) • Simple ‘Jump’ model or lexicalized distortion model • Use bit vector 1001100… = words 1, 4, and 5 translated • For long sentences long bit vectors, but only limited reordering allowed, therefore: Coverage = ( first untranslated word, bit vector)i.e. 111100110… -> (4, 00110…) Stephan Vogel - Machine Translation

  21. Jumping ahead in the Lattice • Hypotheses describe a partial translation • Coverage information, translation, scores • Expand hypothesis over uncovered position (within window) I will come to your office I come tomorrow to you come I ich komme morgen zu dir h: c=11000, t=I will come h: c=11011, t=I will come to your office h: c=11111, t=I will come to your office tomorrow Stephan Vogel - Machine Translation

  22. Hypothesis for Search • Organize search according to number of translated words c • It is expensive to expand the translation • Replace by back-trace information • Generate full translation only for the best (n-best) final translation • Book-keeping: hypothesis h = (Q, C, L, i, hprev, e) • Q – total cost (we keep also cumulative costs for individual models) • C – coverage information: positions already translated • L – language model state: e.g. last n-1 words for n-gram LM • i – number of target words • hprev – pointer to previous hypothesis • e – edge traversed to expand hprev into h • hprev and e is the back-trace information: used to reconstruct the full translation Stephan Vogel - Machine Translation

  23. Algorithm for Applying Language Model for coverage c = 0 to J-1 foreach h in Hyps( c ) foreach node n within reordering window foreach outgoing edge e in n if no coverage collision between h.C and C(e) TMScore = -log( p( t| s ) // typically several scores DMScore = -log p( jump ) // or lexicalized DM score // other scores like word count, phrase count, etc foreach target word tk in t LMScore += -log p (tk | Lk-1 ) Lk = Lk-18ti endfor Q’ = k1*TMScore + k2*LMScore + k3*DMScore + … h’ = ( h.Q + Q’, h.C & C(e), L’, h.i + |t|, h, e ) Hyps( c’ ) += h’ Stephan Vogel - Machine Translation

  24. Algorithm for Applying LM cont. // coverage is now J, i.e. sentence end reached foreach h in Hyps( J ) SLScore = -log p( h.i | J ) // sentence length model LMScore += -log p (</s> | Lh ); // end-of-sentence LM score L’= Lh8</s> Q’ = a*LMScore + b*SLScore h’ = ( h.Q + Q’, h.C , L’, h.i, h, e ) Hyps( J+1 ) += h’ Sort Hyps( J+1 ) according to total score Q Trace back over sequence of (h, e) to construct actual translation Stephan Vogel - Machine Translation

  25. Sentence Length Model • Different language have different level of ‘wordiness’ • Histogram over source sentence length – target sentence length shows that distribution is rather flat -> p( J | I ) is not very helpful • Very simple sentence length model: the more – the better • i.e. give bonus for each word (not a probabilistic model) • Balances shortening effect of LM • Can be applied immediately, as absolute length is not important • However: this is insensitive to what’s in the sentence • Optimize length of translations for entire test set, not each sentence • Some sentences are made too long to cover for sentences which are too short Stephan Vogel - Machine Translation

More Related