330 likes | 468 Views
Search Applications: Machine Translation. Next time: Constraint Satisfaction Reading for today: See “Machine Translation Paper” under links Reading for next time: Chapter 5 . Homework Questions?. Agenda. Introduction to machine translation Statistical approaches Use of parallel data
E N D
Search Applications:Machine Translation Next time: Constraint Satisfaction Reading for today: See “Machine Translation Paper” under links Reading for next time: Chapter 5
Agenda • Introduction to machine translation • Statistical approaches • Use of parallel data • Alignment • What functions must be optimized? • Comparison of A* and greedy local search (hill climbing) algorithms for translation • How they work • Their performance
Approach to Statistical MT • Translate from past experience • Observe how words, and phrases, and sentences are translated • Given new sentences in the source language, choose the most probable translation in the target language • Data: large corpus of parallel text • E.g., Canadian Parliamentary proceedings
Data • Example • Ce n’est pas clair. • It is not clear. • Quantity • 200 billion words (2004 MT evaluation) • Sources • Hansards: Canadian parliamentary proceedings • Hong Kong: official documents published in multiple languages • Newspapers published in multiple languages • Religious and literary works
Alignment – the first step • Which sentences or paragraphs in one language correspond to which paragraphs or sentences in another language? (Or what words?) • Problems • Translators don’t use word for word translations • Crossing alignments • Types of alignment • 1:1 (90% of the cases) • 1:2, 2:1 • 3:1, 1:3
Fertility: a word may be translated by more than 1 word • Notamment -> in particular (fertility 2) • Limonades -> soft drinks • Fertility 0: A word translated by 0 words • Des ventes -> sales • Les boissons a base de cola -> cola drinks • Many to many: • Elles rencontrent toujours plus d’adeptes -> The growing popularity
Bead for sentence alignment • A group of sentences in one language that corresponds in content to some group of sentences in the other language • Either group can be empty • How much content has to overlap between sentences to count it as alignment? • An overlapping clause can be sufficient
Methods for alignment • Length based • Offset alignment • Word based • Anchors (e.g., cognates)
Word Based Alignment • Assume first and last sentences of the texts align (anchors). • Then until most sentences aligned: • Form an envelope of alignments from the cartesian product of the list of sentences • Exclude alignments if they cross anchors or too distance • Choose pairs of words that tend to occur in alignments • Find pairs of source and target sentences which contain many possible lexical correspondences. • The most reliable augment the set of anchors
The Noisy Channel Model for MT Language Model P(e) Decoder e’=argmaxeP(e|f) Translation Model P(f|e) Noisy Channel
The problem • Language model constructed from a large corpus of English • Bigram model: probability of word pairs • Trigram model: probability of 3 words in a row • From these, compute sentence probability • Translation model can be derived from alignment • For any pair of English/French words, what is the probability that pair is a translation? • Decoding is the problem: Given an unseen French sentence, how do we determine the translation?
Language Model • Predict the next word given the previous words • P(Wn| W1……Wn-1) • Markov assumption • Only the last few words affects the next word • Usual cases: bigram, trigram, 4gram • Sue swallowed the large green …. • Parameter estimation • Bigram: 20,000X19,000 = 400 million • Trigram: 20,0002X19,000 = 8 trillion • 4gram: 20,0003X19,000=1.6X1017
Translation Model • For a particular word alignment, multiply the m translation probabilities: • P(Jean aime Marie | John loves Mary) • P(Jean|John)XP(aime|loves)XP(Marie|Mary) • Then sum the probabilities of all alignments
Decoding is NP complete • When considering any word re-ordering • Swapped words • Words with fertility > n (insertions) • Words with fertility 0 (deletions) • Usual strategy: examine a subset of likely possibilities and choose from that • Search error: decoder returns e’ but there exists some e s.t. P(e|f) > P (e’|f)
Example Decoding Errors • Search ErrorPermettez que je donne un example a la chambre.Let me give the House one example.Let me give an example in the House • Model Error Vous avez besoin de toute l’aide disponible.You need all the help you can get.You need of the whole benefits available.
Search • Traditional decoding method: stack decoder • A* algorithm • Deeply explore each hypothesis • Fast greedy algorithm • Much faster than A* • How often does it fail? • Integer Programming Method • Transform to Traveling Salesman (see paper) • Very slow • Guaranteed to find the best choice
Large branching factors • Machine Translation • Input: sequence of n words, each with up to 200 possible target word translations. • Output: sequence of m words in the target language that has high score under some goodness criterion. • Search space: • 6 words French sentence has 10300 distinct translation scores under the IBM M4 translation model. [Soricut, Knight, Marcu, AMTA’2002] … …
Stack decoder: A* • Initialize the stack with an empty hypothesis • Loop • Pop h, the best hypothesis off the stack • If h is a complete sentence, output h and terminate • For each possible next word w, extend h by adding w and push the resulting hypothesis onto the stack.
Complications • It’s not a simple left-to-right translation • Because we multiply probabilities as we add words, shorter hypotheses will always win • Use multiple stacks, one for each length • Given fertility possibilities, when we add a new target word for an input source word, how many do we add?
Hill climbing function HillClimbing(problem, initial-state, queuing-fn) node← MakeNode(initial-state(problem)); whileTdo next ← Best(SearchOperator-fn(node,cost-fn)); if(IsBetter-fn(next, node)) then continue; else if(GoalTest(node)) then return node; else exit; end while return Failure; MT (Germann et al., ACL-2001) node ← targetGloss(sourceSentence); whileTdo next ← Best( LocallyModifiedTranslationOf(node)); if(IsBetter(next, node)) then continue; else print node; exit; end while
Types of changes • Translate one or two words (j1e1j2e2) • Translate and insert (j e1 e2) • Remove word of fertility 0 (i) • Swap segments (i1 i2 j1 j2) • Join words (i1 i2)
Example • Total of 77,421 possible translations attempted
How to search better? • MakeNode(initial-state(problem)) • RemoveFront(Q) • SearchOperator-fn(node, cost-fn); • queuing-fn(problem, Q, (Next,Cost));
Example 1: Greedy Search MakeNode(initial-state(problem)) Machine Translation (Marcu and Wong, EMNLP-2002) node ← targetGloss(sourceSentence); whileTdo next ← Best( LocallyModifiedTranslationOf(node)); if(IsBetter(next, node)) then continue; else print node; exit; end while
Model validation Model stress-testing Climbing the wrong peak What sentence is more grammatical? 1. better bart than madonna , i say 2. i say better than bart madonna , Can you make a sentence with these words? a and apparently as be could dissimilar firing identical neural really so things thought two
Language-model stress-testing • Input: bag of words • Output: best sequence according to a linear combination of an • ngram LM • syntax-based LM (Collins, 1997)
Size: 10-25 words long • Best searched • 51.6: and so could really be a neural • apparently thought things as • dissimilar firing two identical • Original word order • 64.3: could two things so apparently • dissimilar as a thought and neural • firing really be identical Size: 3-7 words long • Best searched • 32.3:i say better than bart madonna , • Original word order • 41.6: better bart than madonna, i say SBLM*: trained on an additional 160k WSJ sentences.