660 likes | 877 Views
The State of the Art in Phrase-Based Statistical Machine Translation (SMT) Roland Kuhn, George Foster, Nicola Ueffing February 2007. Tutorial Plan. A. Overview B. Details & research topics
E N D
The State of the Art in Phrase-Based Statistical Machine Translation (SMT) Roland Kuhn, George Foster, Nicola Ueffing February 2007
Tutorial Plan A. Overview B. Details & research topics NOTE:best overall reference for SMT hasn’t been published yet – Philipp Koehn’s « Statistical Machine Translation » (to be published by Cambridge University Press). Some of the material presented here is from a draft of that book.
Tutorial Plan • Overview The MT Task & Approaches to it Examples of SMT output SMT Research: Culture, Evaluations, & Metrics SMT History: IBM Models Phrase-based SMT Phrase-Based Search Loglinear Model Combination Target Language Model P(T) Flaws of Phrase-based, Loglinear Systems PORTAGE: a Typical SMT System
The MT Task & Approaches to it • Core MT task: translate a sentence from a source language S to target language T • Conventional expert system approach: hire experts to write rules for translating S to T • Statistical approach: using a bilingual text corpus (lots of Ssentences & their translations into T), train a statistical translation model that will map each new Ssentence into a T sentence
S: Mais où sont les neiges d’antan? Manually coded rules If « … » then … If « … » then … …… …… Else …. Statistical rules P(but | mais)=0.7 P(however | mais)=0.3 P(where | où)=1.0 …… T1: But where are the snows of yesteryear?P = 0.41 T2: However, where are yesterday’s snows?P = 0.33 T3: Hey - where did the old snow go?P = 0.18 … Expert system output T: But where are the snows of yesteryear? The MT Task & Approaches to it Statistical System Expert System Experts Bilingual parallel corpus S T + + Machine Learning Statistical system output
The MT Task & Approaches to it “Expert” vs. “Statistical” systems • Expert systems incorporate deep linguistic knowledge • They still yield top performance for well-studied language pairs in non-specialized domains • Computationally cheap (compared to statistical MT) BUT - • Brittle • Expensive to maintain (messy software engineering) • Expensive to port to new semantic domains or new language pairs • Typically yield only one T sentence for each S sentence
The MT Task & Approaches to it “Expert” vs. “Statistical” systems • More E-text, better algorithms, stronger machines quality of SMT output approaching that of expert systems • Statistical approach has beaten expert systems in related areas - e.g., automatic speech recognition • SMT is robust (does well on frequent phenomena) • Easy to maintain • Easily ported to new semantic domain or new language pairs – IF training corpora available • For each S sentence, yields many T sentences (each with a probabilistic score) – useful for semi-supervised translation
Bilingual parallel corpus S T Preprocessor Decoder S: Mais où sont les neiges d’antan? Postprocessor The MT Task & Approaches to it Structure of Typical SMT System Extra Target Corpora offline training Phrase Translation Model Target Language Model (optional extra LM training corpora) mais où sont les neiges d’ antan ? Other Knowledge Sources Final N-best hypotheses Initial N-best hypotheses T1: But where are the snows of yesteryear?P=0.41 T2: However, where are yesterday’s snows?P = 0.33 … T1: however where are the snows #d’ antan# P = 0.22 T2: but where are the snows #d’ antan# P = 0.21 T3: but where did the #d’ antan# snow goP = 0.13 … Reordering
The MT Task & Approaches to it Commercial Systems • Systran, biggest MT company, uses expert systems; so do most MT companies. However, Systran has recently begun exploring possibility of adding a statistical component to their system. • Important exception: LanguageWeaver, new company based on SMT (closely linked to researchers at ISI, U. Southern California) • Google has superb SMT research team – but online, they still mainly use Systran (probably because of computational cost of online SMT). Seem to be gradually swapping in SMT systems for language pairs with lower traffic.
Examples of SMT output Chinese → English output: REF: Hong Kong citizens jumped for joy when they knew Beijing's bid for 2008 Olympic games was successful. PORTAGE Dec. 2004: The public see that Beijing's hosting of the Olympic Games in 2008 excited. PORTAGE Nov. 2006: Hong Kong people see Beijing's successful bid for the 2008 Olympic Games, very happy. REF: The U.S. delegation includes a China expert from Stanford University, two Senate foreign policy aides and a former State Department official who has negotiated with North Korea. PORTAGE Dec. 2004: The United States delegation comprising members from the Stanford University, one of the Chinese experts, two of the Senate foreign policy as well as assistant who was responsible for dealing with Pyongyang authorities of the former State Department officials. PORTAGE Nov. 2006: The US delegation included members from Stanford University and an expert on China, two Senate foreign policy, and one who is responsible for dealing with Pyongyang authorities, a former State Department officials. REF: Kuwait foreign minister Mohammad Al Sabah and visiting Jordan foreign minister Muasher jointly presided the first meeting of the joint higher committee of the two countries on that day. PORTAGE Dec. 2004: Kuwaiti Foreign Secretary Sabah on that day and visiting Jordan Foreign Secretary maasher co-chaired the section about the two countries mixed Committee at the inaugural meeting. PORTAGE Nov. 2006: Kuwaiti Foreign Minister Sabah day and visiting Jordanian Foreign Minister of Malaysia, co-chaired by the two countries, the joint commission met for the first time. REF: The Beagle 2 was scheduled to land on Mars on Christmas Day, but its signal is still difficult to pin down. PORTAGE Dec. 2004: small dog meat, originally scheduled for Christmas landing Mars, but it is a signal remains elusive. PORTAGE Nov. 2006: 2 small dog meat for Christmas landing on Mars, but it signals is still unpredictable.
Examples of SMT output And a silly English → German example from Google (Jan. 25, 2007): the hotel has a squash court das Hotel hat ein Kürbisgericht (think “zucchini tribunal”) * but this kind of error – perfect syntax, never-seen word combination – isn’t typical of a statistical system, so this was probably a rule-based system
SMT Research: Culture,Evaluations, & Metrics Culture • SMT research is very engineering-oriented; driven by performance in NIST & other evaluations (see later slides) if a heuristic yields a big improvement in BLEU scores & a wonderful new theoretical approach doesn’t, expect the former to get much more attention than the latter • Advantages of SMT culture: open-minded to new ideas that can be tested quickly; researchers who count have working systems with reasonably well-written software (so they can participate in evaluations) • Disadvantages of SMT culture: closed-minded to ideas not tested in a working system if you have a brilliant theory that doesn’t show a BLEU score improvement in a reasonable baseline system, don’t expect SMT researchers to read your paper!
Since 2001, US National Institute of Standards & Technology (NIST) has been evaluating MT systems Participants include MIT , IBM , CMU , RWTH , Hong Kong UST , ATR , IRST , others … and NRC: NRC’s system is called PORTAGE (in NIST evaluation 2005 & 2006). Main NIST language pairs: ChineseEnglish, ArabicEnglish Semantic domains: news stories & multigenre Training corpora released each fall, test corpus each spring; participants have 1 working week to submit target sentences NIST evaluates systems comparatively In 2005 http://www.nist.gov/speech/tests/mt/mt05eval_official_results_release_20050801_v3.html & 2006 http://www.nist.gov/speech/tests/mt/mt06eval_official_results.html statistical systems beat expert systems according to BLEU metric SMT Research: Culture,Evaluations, & Metrics The NIST MT Evaluations
Other MT Evaluations WPT/WMT usually organized each spring by Philipp Koehn & Christoph Monz – smaller training corpora than NIST, European language pairs. In 2006, evaluated on French <-> English, German <-> English, Spanish <->English. http://www.statmt.org/wmt06/proceedings/ TC-STAR Evaluation for spoken language translation. In 2006, evaluated on Chinese->English (one direction only) and Spanish <->English http://www.elda.org/tcstar-workshop/2006eval.htm IWSLT Evaluation for spoken language translation. In 2006, evaluated on Arabic->English, Chinese->English, Italian->English, Japanese->English http://www.slt.atr.jp/IWSLT2006_whatsnew/index.html SMT Research: Culture,Evaluations, & Metrics
GALE Project Huge DARPA-sponsored project: $50 million per year for 5 years. Three consortia: BBN-led « Agile », IBM-led « Rosetta », SRI-led « Nightingale ». NRC team is in MT working group of Nightingale. (Arabic or Chinese) speech (Arabic or Chinese) transcriptions English text (Arabic or Chinese) documents IR/database component SMT Research: Culture,Evaluations, & Metrics Automatic speech recognition (ASR) Machine translation (MT) Distillation
What is BLEU? Human evaluation of automatic translation quality hard & expensive. BLEU metric (invented at IBM) compares MT output with human-generated reference translations via N-gram matches. N-gram precision = # (N-grams in MT output seen in ref.) # (N-grams in MT output) Example (from P. Koehn): REF = Israeli officials are responsible for airport security Sys A = Israeli officials responsibility of airport safety Sys B = airport security Israeli officials are responsible SMT Research: Culture,Evaluations, & Metrics 1-gram match 2-gram matches 4-gram match
What is BLEU? REF = Israeli officials are responsible for airport security Sys A = Israeli officials responsibility of airport safety Sys B = airport security Israeli officials are responsible Sys A: 1-gram precision = 3/6 (Israeli, officials, airport); 2-gram precision = 2/5 (Israeli officials); 3-gram precision = 0/4 = 4-gram precision = 0/3. Sys B: 1-gram precision = 6/6; 2-gram precision = 4/5; 3-gram precision = 2/4; 4-gram precision = 1/3. BLEU-N multiplies together the N N-gram precisions – the higher the value, the better the translation. But, could cheat by having very few words in MT output – so, brevity penalty. SMT Research: Culture,Evaluations, & Metrics
What is BLEU? BLEU-N = (brevity-penalty)*Πi=1N(precisioni)i, where brevity-penalty = min(1,output-length/ref-length) . Usually, we set N=4 and all i = 1, so we have BLEU-4 = (min(1,output-length/ref-length))*Πi=14precisioni. If any MT output has no N-grams matching ref., for some N=1, …, 4, BLEU-4 is zero. So, normally compute BLEU over whole test set of at least a hundred or so sentences. Multiple references: if an N-gram has K occurrences in output, look for single ref. that has K or more copies of that N-gram. If find such a single ref., that N-gram has matched K times. If not, look for a ref. that has the highest # of copies (L) of that N-gram; use L in precision calculation. Ref-length = closest length. SMT Research: Culture,Evaluations, & Metrics
Does BLEU correlate with human judgment? SMT Research: Culture,Evaluations, & Metrics Quality score: 0 = terrible, 3 = excellent Translator Identity * BLEU kind of correlates with human judgment ; works best with multiple references.
Why BLEU Is Controversial If system produces a brilliant translation that uses many N-grams not found in the references, it will receive a low score. Proponents of the expert system approach argue that BLEU is biased against this approach, & favours SMT Partial confirmation: 1. in NIST 2006 Arabic-to-English evaluation, AppTek hybrid system (rule-based + SMT system) did best according to human evaluators, but not according to BLEU. 2. in 2006 WMT evaluation Systran was scored comparably to other systems for some European language pairs (e.g., French-English) by human evaluators, but had much lower in-domain BLEU scores (see graphs in http://www.statmt.org/wmt06/proceedings/pdf/WMT14.pdf). SMT Research: Culture,Evaluations, & Metrics
Other Automatic Metrics SMT systems need an automatic metric for tuning (must try out thousands of variants). Automatic metrics compare MT output with human-generated reference translations. Rivals of BLEU: * translation edit rate (TER) – how many edit ops to match references? http://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf * METEOR – compares MT output with references in way that’s less dependent on word choice (via stemming, WordNet, etc.)Gaining credibility: correlates better than BLEU with human scores. However, METEOR only defined for translation into English. http://www.cs.cmu.edu/~alavie/METEOR/. SMT Research: Culture,Evaluations, & Metrics
Manual Metrics Human evaluation of SMT preferable to automatic evaluation, but much slower & more expensive. Can’t use for system tuning. Ask humans to rank systems by adequacy and fluency. Adequacy: does MT output convey same meaning as source?Fluency: does MT output look like normal target-language text? (Good syntax & idiom). Metrics based on human postediting of MT output. E.g., HTER. Metrics based on human understanding of MT output. Related to adequacy, but less subjective. E.g., Lincoln Labs metric: give English output of Arabic MT system to unilingual English analyst, then test him with standard « Defense Language Proficiency Test » (see Jones05). SMT Research: Culture,Evaluations, & Metrics
Who Uses Which Metric When? Many groups use BLEU for automatic system tuning NIST, WPT/WMT, TC-STAR, & other evaluations often have BLEU as official metric, with some human reality checks. Koehn & Monz WPT/WMT: participants do human fluency/adequacy evaluations - nice analyses! Many « expert/rule-based MT » researchers hate BLEU (can become excuse not to evaluate system competitively) In theory, manual metrics should be related to MT task: e.g., adequacy for browsing/gisting, Lincoln Labs metric for intelligence community, HTER if MT output will be post-edited. So why is HTER GALE’s official metric? HTER = Human Translation Edit Rate: MT output hand-edited by humans; measure # of operations performed. SMT Research: Culture,Evaluations, & Metrics
SMT History: IBM Models • In the late 1980s, members of IBM’s speech recognition group applied statistical learning techniques to bilingual corpora. These American researchers worked mainly with the Canadian Hansard – bilingual transcription of parliamentary proceedings. • These researchers quit IBM around 1991 for a hedge fund, Renaissance Technologies – they are now very rich! • Renewed interest in their work sparked the revival of research into statistical learning for MT that occurred from late 1990s onward. Newer « phrase-based » approach still partially relies on IBM models. • The IBM approach used Bayes’s Theorem to define the « Fundamental Equation » of MT (Brown et al. 1993)
^ T = argmaxT[P(T)*P(S|T)] search task language model word translation model SMT History: IBM Models Fundamental Equation of MT The best-fit translation of a source-language (French) sentence S into a target-language (English) sentence T is: Job of language model: ensure well-formed target-language T Job of translation model: ensure T could have generated S Search task: find T maximizing product P(T)*P(S|T)
SMT History: IBM Models • The IBM researchers defined five statistical translation models (numbered in order of complexity) • Each defines a mechanism for generation of text in one language (e.g., French or foreign = F) from another (e.g., English = E) • Most general many-to-many case is not covered by IBM models; in this forbidden case, a group of E words generates a group of F words, e.g. : The poor don’t have any money Les pauvres sont démunis
And the program has been implemented SMT History: IBM Models • The IBM models only allow one-to-many generation, e.g.: Ø Le programme a été mis en application • IBM models 1 & 2 – all lengths for F sentence equally likely • Model 1 is « bag of words » - word order in F & E doesn’t matter • In model 2, chance that an E word generates given F word(s) depends on position • IBM models 3, 4, & 5 are fertility-based
e1 e2 …. eL f1 f2 …. fM f1 f2 …. fM SMT History: IBM Models (draw with uniform probability) IBM model 1: « bag of words » P(L→M) IBM model 2: « position-dependent bag of words » P(1 →1) e1 e2 …. eL P(1→M) P(2 →1) …. (draw with position-dep. prob) P(2 →M) …. P(L→1) P(L→M)
Distortion model Π Distortion model Π P(1→1), P(1→2), …, P(M→M) f3 f2 f1 f3 f2 f1 e1 e2 …. eL e1 e2 …. eL f1 f2 …. fM f2 fM …. f1 SMT History: IBM Models Parameters: φ(ei) = fertility of ei = prob. will produce 0, 1, 2 … words in F; t(f|ei) = probability that ei can generate f; Π(j | i, k) = distortion prob. = prob. that kth word generated by eiends up in pos. jof F IBM model 4 IBM model 3 NOTE: phrases can be broken up,but with lower prob. than in model 3 φ(e1) φ(e1) 3 2 t fM t φ(e2) φ(e2) t 0 (phrase) 0 Ø Ø φ(eL) φ(eL) …. 1 1 t t fM IBM model 5:cleaned-up version of model 4 (e.g., two F words can’t be given same position)
Phrase-based SMT • Four key ideas • phrase-based models (Och04, Koehn03, Marcu02) • dynamic programming search algorithms (Koehn04) • loglinear model combination (Och02) • error-driven learning (Och03)
Phrase-based SMT Phrase-based approach introduced around 1998 by Franz Josef Och & others (Ney, Wong, Marcu): many-words-to-many-words (improvement on IBM one-to-many) Example: « cul de sac » word-based translation = « ass of bag » (N. Am), « arse of bag » (British)phrase-based translation = « dead end » (N. Am.), « blind alley » (British) This knowledge is stored in a phrase table : collection of conditional probabilities of form P(S|T) = backward phrase table or P(T|S) = forward phrase table. Recall Bayes:T = argmaxT[P(T)*P(S|T)] backward table essential, forward table used for heuristics. Tables for French->English: ^ forward: P(T|S) p(bag|sac) = 0.5 p(hand bag|sac) = 0.2 … p(cul|ass) = 0.5 p(dead end|cul de sac) = 0.85 … backward: P(S|T) p(sac|bag) = 0.9 p(sacoche|bag) = 0.1 … p(cul de sac|dead end) = 0.7 p(impasse|dead end) = 0.3 …
Phrase-based SMT Overall Phrase Pair Extraction Algorithm 1. Run a sentence aligner on a parallel bilingual corpus (won’t go over this) 2. Run word aligner (e.g., one based on IBM models) on each aligned sentence pair – see next slide. 3. From each aligned sentence pair, extract all phrase pairs with no external links - see two slides ahead.
Phrase-based SMT Symmetrized Word Alignment using IBM Models Alignments produced by IBM models are asymmetrical: source words have at most one connection, but target words may have many connections. To improve quality, use symmetrization heuristic (Och00): 1. Perform two separate alignments, one in each different translation direction. 2. Take intersection of links as starting point. 3. Add neighbouring links from union until all words are covered. S: I want to go home T: Je veux aller chez moi I want to go home Je veux aller chez moi S: Je veux aller chez moi T: I want to go home
Phrase-based SMT « Diag-And » phrase extraction Je l’ ai vu à la télévision I saw him on television Input: aligned sentence pair Output: set of consistent phrases Extract all phrase pairs with no external links, for example: Good pairs: (Je, I) (Je l’ ai vu, I saw him) (ai vu, saw) (l’ ai vu à la, saw him on) Bad pairs: (Je l’ ai vu, I saw) (l’ ai vu à, saw him on) (la télévision, television)
Generative process: 1. Split source sentence into “phrases” (N-grams). 2. Translate each source phrase (one-to-one). 3. Permute target phrases to get final translation. much simpler and more intuitive than the IBM process, but the price of this is no provision for gaps, e.g., ne VERB pas Je l’ ai vu à la télévision Phrase-Based Search 1 2 3 I Je l’ ai vu à la télévision I saw him on television him saw on television *** NOTE: XRCE’s Matrax does handle gaps
Segmentation Source: s1 s2 s3 s4 s5 s6 s7 s8 s9 Source: s1 s2 s3 s4 s5 s6 s7 s8 s9 Source: s1 s2 s3 s4 s5 s6 s7 s8 s9 Phrase-Based Search Order: Target hypotheses grow left->right, from source segments consumed in any order Backward Table Source: s1 s2 s3 s4 s5 s6 s7 s8 s9 P(S|T) p(s2 s3 | t8) p(s2 s3 | t5 t3) … p(s3 s4 | t4 t9) … (pick s2 s3 first) (pick s3 s4 first) (phrase transl) Tgt hyp: t5 t3| … Tgt hyp: t8| … phrase table: 1. suggests possible segments 2. supplies phrase translation scores (phrase transl) (pick s5 s6 s7) … Tgt hyp: t4 t9| … … (phrase transl) Language Model P(T) Tgt hyp: t8| t6 t2| … language model: scores growing target hypotheses left -> right …
Loglinear Model Combination Previous slides show basic system that ranks hypotheses by P(S|T)*P(T). Now let’s introduce an alignment/reordering variable A (aligns T & S phrases). We want T = argmaxTP(T|S) ≈ argmaxT ,AP(T, A|S) = argmaxT, Af1(T,A,S)λ1* f2(T,A,S)λ2 * … * fM(T,A,S)λM = argmax exp (∑iλi log fi(T,A,S)). The finow typically include not only functions related to P(S|T) and language model P(T), but also to A « distortion », P(T|S), length(T), etc. The λi serve as reliability weights. This change in score computation doesn’t fundamentally change the search algorithm. ^
Loglinear Model Combination Advantages Very flexible! Anyone can devise dozens of features. • E.g., if lots of mismatched brackets in output, include feature function that outputs +1 if no mismatched brackets, -1 if have mismatched brackets. • So lots of new features being tried in somewhat haphazard way. • But systems steadily improving – outputs from NIST 2006 look much better than those from NIST 2002. SMT not good enough to replace human translators, but good enough for, e.g., most Web browsing. Using 1000 machines and massive quantities of data, Google got 45.4 BLEU for Arabic to English, 35.0 for Chinese to English – very high scores!
Loglinear Model Combination Typical Loglinear Components for SMT Decoding • Joint counts C(S,T) from phrase extraction yield estimates P(S|T) stored in “backward” phrase table and estimates P(T|S) stored in “forward” phrase table. These are typically relative frequency estimates (but we’ve looked at smoothed variants). • Distortion model D(T,A,S) assigns score to amount of phrase reordering incurred in going from S to hypothesis T. Can be based purely on displacement, or be lexicalized (identity of words in S & T is important). • Length model L(T,S) scores probability that hypothesis of length |T| generated from source of length |S|. • Language model P(T) gives probability of word sequence T in target language – see next few slides. NOTE:these are just for decoding – you can use lots more components for N-best/lattice reordering!
Target Language Model P(T) The Stupidest Thing Noam Chomsky Ever Said « It must be recognized that the notion of a ‘probability of a sentence’ is an entirely useless one, under any interpretation of this term ». Chomsky, 1969.
Target Language Model P(T) • Language model helps generate fluent output by 1. assigning higher probability to correct word order – e.g., PLM(the house is small) >> PLM(small the is house) 2. assigning higher probability to correct word choices – e.g., PLM(i am going home) >> PLM(I am going house) • Almost everyone in both SMT and ASR (automatic speech recognition) communities uses N-gram language models. Start with P(W) = P(w1)*P(w2|w1)*P(w3|w1,w2)*…*P(wi|w1,…,wi-1)*…*P(wm|w1,…,wm-1), then limit window to N words. E.g., for N=3, trigram LM: P(W) = P(w1)*P(w2|w1)*P(w3|w1,w2)*…*P(wi|wi-2,wi-1)*…*P(wm|wm-2,wm-1).
Target Language Model P(T) • Estimation is done by relative frequency on large corpus : P(wi|wi-2,wi-1) ≈ f(wi|wi-2,wi-1) =C(wi-2,wi-1,wi)/Σw C(wi-2,wi-1,w). E.g., in Europarl corpus, see 225 trigrams starting « the red … »: C(the red cross)=123, C(the red tape)=31, C(the red army)=9, C(the red card)=7, C(the red ,)=5 (and 50 other trigrams). So estimate P(cross | the red) = 123/225 = 0.547 . • But need to reserve probability mass for unseen events - maybe never saw « the red planet » in Europarl, but don’t want to have estimate P(planet | the red) = 0. Also, want estimates whose variance isn’t too high. Smoothing techniques are used to solve both problems. E.g., could linearly smooth trigrams with bigrams & unigrams:P(wi|wi-2,wi-1) ≈ *f (wi|wi-2,wi-1) + μ*f(wi|wi-1) + (1--μ)*f(wi); 0 < , μ < 1.
Target Language Model P(T) Measuring Language Model Quality • Perplexity: metric that measures predictive power of an LM on new data as an average branching factor. E.g., model that says «any digit 0, …, 9 has equal probability of occurrence » will yield perplexity of 10.0 on digit sequence generated randomly from these 10 digits. • Perplexity of LM measured on corpus W = (w1 … wN) is PerpLM(T) = (Πwi P(wi|LM))-1/N= 1/(average per word prob.) The better the LM is as a model for W, the less « surprised » it is by words of W higher estimated prob. lower entropy. Typical perplexities for well-trained English trigram LMs with lexica of about 25K words for various dictation domains: Perp(radiology)=20, Perp(emergency medicine)=60, Perp(journalism)=105, Perp(general English)=247 .
Target Language Model P(T) • « A Bit of Progress in Language Modeling » (Goodman01) is good summary of state of the art in N-gram language modeling. • Consistently superior method: Kneser-Ney. Intuition: if «Francisco» & «eggplant» each seen 103 times in our corpus of 106 words, and neither «eggplant Francisco» nor «eggplant stew» seen, which should be higher, P(Francisco|eggplant) or P(stew|eggplant)? Interpolation answer: P(wi|wi-1) ≈ *f(wi|wi-1) + (1-)*f(wi ). SoP(Francisco|eggplant) ≈ *0 + (1- )*10-3 = P(stew|eggplant). Kneser-Ney answer: no, «Francisco» only occurs after «San», but 1,000 occurrences of « stew » preceded by 100 different words. So when (wi-1 wi) has never been seen before, wi= «stew» more probable than wi = «Francisco» P(stew|eggplant) >> P(Francisco|eggplant).
Target Language Model P(T) • Kneser-Ney formula (for bigrams – easily extended to N-grams): PKN(wi | wi-1) = max [C(wi-1 wi)-D, 0]/C(wi-1) + (wi-1)*#{v | C(v wi) > 0}/w #{v | C(v w) > 0} , where D is a discount factor < 1, (wi-1) is a normalization constant, #{v | C(v wi) > 0}is the number of different words that precede wi in the training corpus, and w #{v | C(v w) > 0} is the number of different bigrams in the training corpus.
Flaws of Phrase-based, Loglinear Systems • Loglinear feature function combination is too flexible! Makes it easy not to think about theoretical properties of models. • The IBM models were true models: given arbitrary source sentence S and target sentence T, could estimate non-zero P(T|S). Phrase-based “models” are not models: in general, for T which is a good translation of S, they give P(T|S) = 0. They don’t guarantee existence of an alignment between T and S. Thus, the only translations T’ to which a phrase-based system is guaranteed to assign P(T’|S) > 0 are T’ output by same system. • This has practical consequences: in general, a phrase-based MT system can’t be used for analyzing pre-existing translations. This rules out many useful forms of assistance to human translators - e.g., spotting potential errors in translations based on regions of low P(T|S).
PORTAGE: A Typical SMT System • Sentence-align a big bilingual corpus • On each sentence pair, use IBM models to align words • Build phrase tables from word alignments via “diag-and” or similar heuristic (Koehn03). Backwards phrase table gives P(S|T) (& is implicit segmentation model). • Build language model (LM) for target language: estimates P(T) , based on n-grams in T • 5. P(S|T) and P(T) are sufficient for decoding, but one often adds other loglinear feature functions such as a distortion penalty • 6. Use (Och03) method to find good weights λi for loglinear features • 7. Optionally, include reordering step: i.e., decoder outputs many hypotheses (via N-best list or lattice) which are rescored by larger set of feature functions
LM TM DM NM H1: but where are the snows of yesteryear ? P = 0.53 H2: however , where are yesterday’s snows ?P = 0.20 … H1: hey , where did the old snow go ? P = 0.41 H2: yet where are yesterday’s snows ?P = 0.33 H3: but where are the snows of yesteryear ?P = 0.18 … PORTAGE: A Typical SMT System Core Engine « Small » set of information sources – for Canoe decoder (number-of-words model) (at least 1 phrase translation model) (any # of additional info. sources - for rescorer only) (at least one distortion model) (at least one language model) A1 A2 A3 feature functions Source sentence « Large » set of information sources – for Rescorer Weighted « large »info Weighted « small » info mais où sont les neiges d’ antan ? kLM*LM kTM*TM … kA3*A3 Weights for « large » set wLM*LM wTM*TM … wNM*NM Weights for « small » set Rescorer Canoe decoder Rescored N-best N-best hypotheses
Training Core Components of PORTAGE src preproc. tgt. preproc. sentence aligner Clean, aligned parallel corpus Src-lang text Tgt-lang text lang. model builder phrase pair extraction IBM training (models 1 & 2) rescorer wt optimizer decoder wt optimizer dev2 corpus dev1 corpus src tgt src tgt Preprocessing Raw parallel corpus src-lang text tgt-lang text Additional monolingual corpora Tgt-lang text Tgt-lang text … phrase translation model language model PT LM other small set models small set info only … modelK model3 large set info extra models for large set … modelM modelK+1 large set wts small set wts w1’, …, wM’ w1, …, wK
Canoe Optimization of Weights (COW) Purpose: find weights [w1, …, ws] on « small » set of information sources (N around 100) Dev corpus for COW (D sentences) Initial Weights D source-language sentences [w1i, w2i ,…, wsi] S1: hé quoi ?S2: charmante élise , vous devenez mélancolique . …. SD: la fin . D target-language ref. translations T1: what’s this ?T2: charming élise , you’re becoming melancholy . …. TD:the end . H1(S1): what’s up ? … HN (S1): are you OK ? H1(S2): cute élise , you’re bummed out . … … HN(SD): all done . Expanded list union of old & new hypotheses BLEU scoring (based on top hyp.) H1(S1), …, HN(S1), … (>N hyp. for S1) H1(S2), …, HN(S2), … (>N hyp. for S2) … H1(SD), …, HN(SD), … (>N hyp. for SD) Canoe decoder Powell’s alg. Powell’s alg. « Small » set of information sources … I2 IS I1 (first call to Canoe) (2nd & subsequent calls to Canoe) New Weights (from « rescore-train ») [w1r, w2r,…, wsr] List of D N-best hyp. (union: 2nd & subsequent calls to rescore-train) (first call to rescore-train) K random wt. vectors Ŵ1 … ŴK W1=[w11, w21,…, ws1] … WK=[w1K, w2K,…, wsK] }Ŵ … Rescore_train