510 likes | 590 Views
Open Problems in Statistical Machine Translation Roland Kuhn January 2008. Plan. A. Overview of Statistical Machine Translation (SMT) B. Open Problems
E N D
Open Problems in Statistical Machine Translation Roland Kuhn January 2008
Plan A. Overview of Statistical Machine Translation (SMT) B. Open Problems NOTE:best overall reference for SMT hasn’t been published yet – Philipp Koehn’s « Statistical Machine Translation » (to be published by Cambridge University Press). Some of the material presented here is from a draft of that book.
The MT Task & Approaches to it • Core MT task: translate a sentence from a source language S to target language T • Conventional expert system approach: hire experts to write rules for translating S to T • Statistical approach: using a bilingual text corpus (lots of Ssentences & their translations into T), train a statistical translation model that will map each new Ssentence into a T sentence. IMPORTANT: SMT models are trained on huge corpora – e.g., our Chinese-English system is trained on 10 million sentence pairs.
The MT Task & Approaches to it • We seem to be in a period of transition, during which SMT is replacing the expert system approach. This happened in automatic speech recognition in the 1980s … • Systran, biggest MT company, uses expert systems; so do most MT companies. However, Systran has recently begun exploring possibility of adding a statistical component to their system. We (NRC) created a very good « serial combination » Systran-NRC hybrid. • LanguageWeaver is new company based on SMT (closely linked to researchers at ISI, U. Southern California) • Google has superb SMT research team – but online, Google mainly used Systran until a few months ago (probably because of computational cost of online SMT). Recently, they swapped in SMT systems for most language pairs.
Bilingual parallel corpus S T Preprocessor Decoder S: Mais où sont les neiges d’antan? Postprocessor The MT Task & Approaches to it Structure of Typical SMT System Extra Target Corpora offline training Phrase Translation Model Target Language Model (optional extra LM training corpora) mais où sont les neiges d’ antan ? Other Knowledge Sources Final N-best hypotheses Initial N-best hypotheses T1: But where are the snows of yesteryear?P=0.41 T2: However, where are yesterday’s snows?P = 0.33 … T1: however where are the snows #d’ antan# P = 0.22 T2: but where are the snows #d’ antan# P = 0.21 T3: but where did the #d’ antan# snow goP = 0.13 … Reordering
Example of SMT output A silly English → German example from Google (Jan. 25, 2007): the hotel has a squash court das Hotel hat ein Kürbisgericht (think “zucchini tribunal”) * but this kind of error – perfect syntax, never-seen word combination – isn’t typical of a statistical system: this was a rule-based system Tried again Dec. 13, 2007: now translation is das Hotel verfügt über ein Squashplatz
SMT Research: Culture,Evaluations, & Metrics Culture: • SMT research is very engineering-oriented; driven by performance in NIST & other evaluations • Advantages of SMT culture: open-minded to new ideas that can be tested quickly; researchers who count have working systems with reasonably well-written software (so they can participate in evaluations) • Disadvantages of SMT culture: closed-minded to ideas not tested in a working system if you have a brilliant theory that doesn’t show a BLEU score improvement in a reasonable baseline system, don’t expect SMT researchers to read your paper! (*** this is slight exaggeration – besides BLEU, other automatic & manual metrics sometimes used)
Since 2001, US National Institute of Standards & Technology (NIST) has been evaluating MT systems Participants include MIT , IBM , CMU , RWTH , Hong Kong UST , ATR , IRST , others … and NRC: NRC’s system is called PORTAGE Main NIST language pairs: ChineseEnglish, ArabicEnglish In 2005 & 2006 statistical systems beat expert systems according to BLEU metric in NIST in most recent (2006) NIST MT evaluation, PORTAGE ranked 4th/24 & 6th/24 for the two large track Chinese→English tasks. Other evaluations: WMT, TC-STAR, IWSLT SMT Research: Culture,Evaluations, & Metrics MT Evaluations
What is BLEU? Human evaluation of automatic translation quality hard & expensive. BLEU metric compares MT output with human-generated reference translations via N-gram matches. BLEU correlates roughly with human judgment (the more reference translations, the better it correlates) SMT Research: Metrics
Why BLEU Is Controversial If system produces a brilliant translation that uses many N-grams not found in the references, it will receive a low score. Proponents of the expert system approach argue that BLEU is biased against this approach, & that it favours SMT Partial confirmation: 1. in NIST 2006 Arabic-to-English evaluation, AppTek hybrid system (rule-based + SMT system) did best according to human evaluators, but not according to BLEU. 2. in 2006 WMT evaluation Systran was scored comparably to other systems for some European language pairs (e.g., French-English) by human evaluators, but had much lower in-domain BLEU scores (see graphs in http://www.statmt.org/wmt06/proceedings/pdf/WMT14.pdf). SMT Research: Metrics
SMT History: IBM Models • In the late 1980s, members of IBM’s speech recognition group applied statistical learning techniques to bilingual corpora. These American researchers worked mainly with the Canadian Hansard – bilingual transcription of parliamentary proceedings. • These researchers quit IBM around 1991 for a hedge fund, Renaissance Technologies – they are now very rich! • Renewed interest in their work sparked the revival of research into statistical learning for MT that occurred from late 1990s onward. Newer « phrase-based » approach still partially relies on IBM models. • The IBM approach used Bayes’s Theorem to define the « Fundamental Equation » of MT (Brown et al. 1993)
^ T = argmaxT[P(T)*P(S|T)] search task language model word translation model SMT History: IBM Models Fundamental Equation of MT The best-fit translation of a source-language (French) sentence S into a target-language (English) sentence T is: Job of language model: ensure well-formed target-language T Job of translation model: ensure T could have generated S Search task: find T maximizing product P(T)*P(S|T)
SMT History: IBM Models • The IBM researchers defined five statistical translation models (numbered in order of complexity) • Each defines a mechanism for generation of text in one language (e.g., French or foreign = F) from another (e.g., English = E): P(F|E) • Most general many-to-many case is not covered by IBM models; in this forbidden case, a group of E words generates a group of F words, e.g. : The poor don’t have any money Les pauvres sont démunis
SMT History: IBM Models • The IBM models only allow zero or one connections for each source word; a target word has arbitrary # of connections, e.g. (F=source, E=target): And the program has been implemented Le programme a été mis en application • IBM Model 1 is « bag of words » - word order in F & E doesn’t • matter • In model 2, chance that an E word generates given F word(s) depends on position • IBM models 3, 4, & 5 are fertility-based; some E words (e.g., « implemented ») may be specially fertile
e1 e2 …. eL f1 f2 …. fM f1 f2 …. fM SMT History: IBM Models (draw with uniform probability) IBM model 1: « bag of words » P(L→M) IBM model 2: « position-dependent bag of words » P(1 →1) e1 e2 …. eL P(1→M) P(2 →1) …. (draw with position-dep. prob) P(2 →M) …. P(L→1) P(L→M)
Distortion model Π Distortion model Π P(1→1), P(1→2), …, P(M→M) f3 f2 f1 f3 f2 f1 e1 e2 …. eL e1 e2 …. eL f1 f2 …. fM f2 fM …. f1 SMT History: IBM Models Parameters: φ(ei) = fertility of ei = prob. will produce 0, 1, 2 … words in F; t(f|ei) = probability that ei can generate f; Π(j | i, k) = distortion prob. = prob. that kth word generated by eiends up in pos. jof F IBM model 4 IBM model 3 NOTE: phrases can be broken up,but with lower prob. than in model 3 φ(e1) φ(e1) 3 2 t fM t φ(e2) φ(e2) t 0 (phrase) 0 Ø Ø φ(eL) φ(eL) …. 1 1 t t fM IBM model 5:cleaned-up version of model 4 (e.g., two F words can’t be given same position)
Phrase-based SMT • Five key ideas • phrase-based models (Och04, Koehn03, • Marcu02) • N-gram language model (Jelinek90) • dynamic programming search algorithms • (Koehn04) • loglinear model combination (Och02) • error-driven learning (Och03)
Phrase-based SMT Phrase-based approach introduced around 1998 by Franz Josef Och & others (Ney, Wong, Marcu): many words to many words (improvement on IBM one-to-many) Example: « cul de sac » word-based translation = « ass of bag » (N. Am), « arse of bag » (British)phrase-based translation = « dead end » (N. Am.), « blind alley » (British) This knowledge is stored in a phrase table : collection of conditional probabilities of form P(S|T) = backward phrase table or P(T|S) = forward phrase table. Recall Bayes:T = argmaxT[P(T)*P(S|T)] backward table essential, forward table used for heuristics. Tables for French->English: ^ forward: P(T|S) p(bag|sac) = 0.5 p(hand bag|sac) = 0.2 … p(ass|cul) = 0.5 p(dead end|cul de sac) = 0.85 … backward: P(S|T) p(sac|bag) = 0.9 p(sacoche|bag) = 0.1 … p(cul de sac|dead end) = 0.7 p(impasse|dead end) = 0.3 …
Phrase-based SMT Overall Phrase Pair Extraction Algorithm 1. Run a sentence aligner on a parallel bilingual corpus (won’t go over this) 2. Run word aligner (e.g., one based on IBM models) on each aligned sentence pair – see next slide. 3. From each aligned sentence pair, extract all phrase pairs with no external links - see two slides ahead.
Phrase-based SMT Symmetrized Word Alignment using IBM Models Alignments produced by IBM models are asymmetrical: source words have at most one connection, but target words may have many connections. To improve quality, use symmetrization heuristic (Och00): 1. Perform two separate alignments, one in each different translation direction. 2. Take intersection of links as starting point. 3. Add neighbouring links from union until all words are covered. S: I want to go home T: Je veux aller chez moi I want to go home Je veux aller chez moi S: Je veux aller chez moi T: I want to go home
Phrase-based SMT « diag-and » phrase extraction Je l’ ai vu à la télévision I saw him on television Input: aligned sentence pair Output: set of consistent phrases Extract all phrase pairs with no external links, for example: Good pairs: (Je, I) (Je l’ ai vu, I saw him) (ai vu, saw) (l’ ai vu à la, saw him on) Bad pairs: (Je l’ ai vu, I saw) (l’ ai vu à, saw him on) (la télévision, television)
Target Language Model P(T) • Language model helps generate fluent output by 1. assigning higher probability to correct word order – e.g., PLM(the house is small) >> PLM(small the is house) 2. assigning higher probability to correct word choices – e.g., PLM(i am going home) >> PLM(I am going house) • Almost everyone in both SMT and ASR (automatic speech recognition) communities uses N-gram language models. Start with P(W) = P(w1)*P(w2|w1)*P(w3|w1,w2)*…*P(wi|w1,…,wi-1)*…*P(wm|w1,…,wm-1), then limit window to N words. E.g., for N=3, trigram LM: P(W) = P(w1)*P(w2|w1)*P(w3|w1,w2)*…*P(wi|wi-2,wi-1)*…*P(wm|wm-2,wm-1). • Language model estimated on large unilingual T-language corpus, often using freeware (SRILM or CMU toolkit) • « A Bit of Progress in Language Modeling » (Goodman01) is good summary of state of the art in N-gram language modeling.
Segmentation Source: s1 s2 s3 s4 s5 s6 s7 s8 s9 Source: s1 s2 s3 s4 s5 s6 s7 s8 s9 Source: s1 s2 s3 s4 s5 s6 s7 s8 s9 Phrase-Based Search: finding argmaxT[P(T)*P(S|T)] Order: Target hypotheses grow left->right, from source segments consumed in any order Backward Table Source: s1 s2 s3 s4 s5 s6 s7 s8 s9 P(S|T) p(s2 s3 | t8) p(s2 s3 | t5 t3) … p(s3 s4 | t4 t9) … (pick s2 s3 first) (pick s3 s4 first) (phrase transl) Tgt hyp: t5 t3| … Tgt hyp: t8| … phrase table: 1. suggests possible segments 2. supplies phrase translation scores (phrase transl) (pick s5 s6 s7) … Tgt hyp: t4 t9| … … (phrase transl) Language Model P(T) Tgt hyp: t8| t6 t2| … language model: scores growing target hypotheses left -> right …
Loglinear Model Combination Previous slides show basic system that ranks hypotheses by P(S|T)*P(T). Now let’s introduce an alignment/reordering variable A (aligns T & S phrases). We want T = argmaxTP(T|S) ≈ argmaxT ,AP(T, A|S) = argmaxT, Af1(T,A,S)λ1* f2(T,A,S)λ2 * … * fM(T,A,S)λM = argmaxT,Aexp (∑iλi log fi(T,A,S)). The finow typically include not only functions related to P(S|T) and language model P(T), but also to A « distortion », P(T|S), length(T), etc. The λi serve as reliability weights. This change in score computation doesn’t fundamentally change the search algorithm. ^
Loglinear Model Combination Advantages Very flexible! Anyone can devise dozens of features. • E.g., if lots of mismatched brackets in output, include feature function that outputs +1 if no mismatched brackets, -1 if have mismatched brackets. • So lots of new features being tried in somewhat haphazard way. • But systems steadily improving – outputs from NIST 2006 look much better than those from NIST 2002. SMT not good enough to replace human translators, but good enough for, e.g., most Web browsing. Using 1000 machines and massive quantities of data, Google got 45.4 BLEU for Arabic to English, 35.0 for Chinese to English – very high scores!
Flaws of Phrase-based, Loglinear Systems • Loglinear feature function combination is too flexible! Makes it easy not to think about theoretical properties of models. • The IBM models were true models: given arbitrary source sentence S and target sentence T, could estimate non-zero P(T|S). Phrase-based “models” are not models: in general, for T which is a good translation of S, they give P(T|S) = 0. They don’t guarantee existence of an alignment between T and S. Thus, the only translations T’ to which a phrase-based system is guaranteed to assign P(T’|S) > 0 are T’ output by same system. • This has practical consequences: in general, a phrase-based MT system can’t be used for analyzing pre-existing translations. This rules out many useful forms of assistance to human translators - e.g., spotting potential errors in translations based on regions of low P(T|S).
Plan B. Open Problems in SMT • Adaptation: how make SMT system work well in new domain? • Feature selection: can we devise more principled ways of selecting phrases for the phrase tables P(S|T) and P(T|S)? • Generalizing beyond phrases: how enable system to learn gappy & hierarchical patterns? E.g., “ne VERB pas” ↔ “not VERB”, “don’t VERB” • Can SMT systems be trained discriminatively? • Can kernel regression techniques be applied to SMT?
Adaptation Foster07: • For SMT to work well, there shouldn’t be mismatch between training and test material. • Adaptive approaches strive to reduce mismatch by modifying model parameters in response to information about test domain. • In mixture-based approaches, model is combination of component models; adaptation adjusts weights on components based on their fit with the test domain → “dynamic adaptation”. • We tried linear adaptation within each of two main components of SMT: “translation model” P(S|T) and “language model” P(T) P(T|S) ≈ P(S|T) * P(T)β * (other features) P(S|T)= a1*TM1 + a2*TM2 + … + anTMn P(T)=b1*LM1 + b2*LM2 + … + bkLMk bj’s shift dynamically ai’s shift dynamically
Adaptation • In our experiments, we trained 7 different TM’s and 7 different LM’s on Chinese-English corpora, and allowed linear weights on these to vary. • Want “distance” (really closeness) measure that gives best linear weights. Will use naively: if Ci is one of 7 training corpora, q is test text, then if metric m gives “distance” m(q,Ci) between q and Ci, the weight w(q,Ci) decoder uses on model trained on Ci when decoding q with metric m is w(q,Ci) = m(q,Ci) /i’m(q,Ci’) [also tried sigmoid function – no good!] • We tried 4 metrics for m(q,Ci) – see Foster07 for details: 1. tf/idf: cos(vC,vq) 2. LSA: cos(uC,uq) 3. Ngram LM perplexity: pC(q)-1/|q|4. Interpolation coeff est. via EM: i(q,Ci), where 1 … 7 = argmax[1 … 7]Πj [Σi i*pCi (qj)] , pCi is Language Model est. on Ci, and 1 + … + 7 =1.0
Adaptation Results: • EM interpolation works best • LM adaptation yields + 0.5 – 1.1 BLEU • TM adaptation helps on its own, but doesn’t add anything to LM adaptation Open questions: • How fine should granularity be? (7 corpora too coarse!) • Can we adapt TMs as well as LMs? • How can we incorporate LMs trained on unilingual target-language corpora? • Can we do better than linear interpolation? (E.g., Box-Cox generalizes linear & loglinear combination – could explore intermediate settings)
Feature Selection Johnson07: • “Diag-and” phrase pair extraction is voodoo – a collection of heuristics. Can we do better? • To begin with, let’s see if we can prune out some of the phrase pairs suggested by “diag-and”. Will test null hypothesis that a phrase pair is due to chance; will reject phrase pairs that could be accidental. • Phrase tables are huge – pruning them makes life easier • Consider two by two contingency table for phrase pair (“chat”, “cat”) from WMT06 French-English corpus (next slide)
Feature Selection H0: co-occurrences of (chat, cat) are result of random hypergeometric process constrained by marginal counts (45 occ. of chat, 26 occ. of cat) [Fisher’s exact test]. Then Pr(#(chat, cat) ≥14 | H0) = 2.636*10-53 = e-121.07 →neg-log-Pr(#(chat, cat)) = 121.07 NOTE: neg-log-Pr()=2.9957 is 5% sig. level, neg-log-Pr()=4.6052 is 1% sig. level.
Feature Selection • Results for English-French (WMT06): setting threshold neg-log-Pr()=25 yields +1 BLEU and eliminates 90% of the phrase table • Results for Chinese-English (NIST06): setting threshold neg-log-Pr()=16 yields +0.3 BLEU and eliminates 65% of the phrase table • But some weird phrases are still left in – e.g.,neg-log-Pr(#(chat, spade)) = 81.48: “appeler un chat un chat” = “call a spade a spade”; neg-log-Pr(#(chat, pig)) = 45.38: “chat dans un sac” = “pig in a poke”. Open questions: • How handle these compositional cases? • Instead of just using sig. testing to prune phrase tables generated by “diag-and”, can we be more ambitious and use sig. testing to find phrase pairs?
Generalizing beyond phrases How enable system to learn gappy & hierarchical patterns? E.g., “ne VERB pas” ↔ “not VERB”, “don’t VERB”. NOTE: there have been many unsuccessful attempts to get syntax into SMT – e.g., OchJHU04 describes summer workshop on syntax for SMT. End result: no improvement from syntax! Chiang05: • SMT system that learns hierarchical phrases from data • Formally, is synchronous context-free grammar • But - grammar is learned, not imposed by linguistic theory! • BLEU: 26.76→28.77 = 7.5% relative improvement on Chinese-English task
Generalizing beyond phrases Mandarin: Aozhou shi yu Bei Han you bangjiao de shaoshu guojia zhiyi English: Australia is with North Korea have diplomatic relations that few countries one of Three synchronous CFG rules learned from data: 1. X → [X1 zhiyi, one of X1] 2. X → [X1 de X2, the X2 that X1] 3. X → [yu X1 you X2, have X2 with X1] Aozhou shi yu Bei Han you bangjiao de shaoshu guojia zhiyi → (1) Aozhou shi one of yu Bei Han you bangjiao de shaoshu guojia → (2) Aozhou shi one of the shaoshu guojia that yu Bei Han you bangjiao → (3) Aozhou shi one of the shaoshu guojia that have bangjiao with Bei Han → (other rules) Australia is one of the few countries that have diplomatic relations with North Korea
Generalizing beyond phrases Learning rules from data: • First, Chiang generates phrase table via “diag-and” • Then, generalizes patterns seen by introducing variables. E.g., if saw (s1, t1), (s2, t2), (s1 zhiyi, one of t1), (s2 zhiyi, one of t2) then system generates rule X → [X1 zhiyi, one of X1]. • This overgenerates rules – in practice, Chiang imposes several heuristics to prune rule set. Open question: • Can one generate all and only useful rules in principled way – e.g., by using significance testing?
Discriminative training ^ SMT uses a model Pθ(t|s) to pick best translation t for source s: t = argmaxtPθ(t|s). In discriminative training, want to find parameters θ that yield best t. Problem: calculating t as a function of θ (SMT decoding) is expensive. Two solutions: 1. Optimize over N-best lists (but no guarantee θ learned on N-best lists is optimal) 2. Constrain decoding so calculation of t(θ) is cheaper (but no guarantee θ learned with constraints is optimal for normal decoding). ^ ^ ^ ^ ^
^ θ cumulative N-best Discriminative training ^ Och03 maxBLEU method for finding loglinear weights: Recall that SMT model is T = argmaxT,A exp (∑iλi log fi(T,A,S)). For each sentence in small « dev » corpus S, generate N-best list. Using Powell’s algorithm, find λithat maximize BLEU (or another automatic metric). Then rerun decoder to get new set of N-best lists, which is merged with old set. Iterate. Stop when set of lists stops growing. * This is self-correcting: bad value of θ adds bad hypotheses to set! References optimizer (Powell’s) S T S Decoder(θ) START
Discriminative training Example of running Och’s maxBLEU method:
Discriminative training But what about large-scale discriminative training? Want to train loglinear model with large feature set on big training corpus, not small “dev” (Och’s method only applied to about 10 weights). • Liang06: perceptron inputs about 1.5M features (global phrase table & LM, individual phrase pairs & Ngrams, etc.) Decoder generates N-best list, perceptron does Viterbi-style update based on N-best oracle. Processed 67K source sentences. • Tillmann06: similar – initialize by using forced alignments to generate maxBLEU translation for each source sentence, then merge with N-best lists. Margin-inspired loss function: punish hypotheses that score higher than N-best oracle. 35M features (individual phrase-pairs, word-pairs, N-grams, and distortion-related features). Processed 230K source sentences. → Both Liang06 & Tillman06 had to severely limit reordering and use narrow beam – their results no better than for conventionally trained systems.
Discriminative training Open question: • Can large-scale discriminative training be carried out for an SMT system with free phrase reordering and a wide beam?
Kernel regression techniques • Kernel methods for machine learning project data into a high-dimensional Euclidean space. Learning takes place in that space, provided that the learning algorithm only acts on inner products of data points in that space. • Wang07: for source language Χdefine its feature space HΧ as all its possible blended word N-grams*, and define Фx: Χ → HΧ , such that a sentence xεΧcan be expressed by its feature vector Фx(x) εHΧ . Given two source-language sentences x1 and x2, a high value of inner product < Фx(x1) , Фx(x2)> indicates that x1 and x2 are similar; a low value indicates they are dissimilar. Similarly, define feature space HY as all possible blended word N-grams for target language Y, and define ФY: Y→ HY . *For blended word N-grams seeLodhi02
Kernel regression techniques • Finally, assume there is a linear mapping W: HΧ →HY, such that WФx(x) =ФY(y), where y is the translation of x. We have many training examples of (x,y), so we can learn W. target sent. y source sent. x t1 t2 … tn Фx s1 s2 … sm Фy . . W Фy(y) Фx(x) HY HΧ
Kernel regression techniques ^ • For a given source sentence x, we want y minimizing the loss from WФx(x). Wang et al. tried two loss functions: least squares regression (LSR) and maximum margin regression (MMR). • In theory, they can’t evaluate the loss until they have a complete target language hypothesis y. In practice, they piggyback on standard phrase-based decoding by pretending that the loss between a partial y and the part of x it translates is representative of final result. This loss estimate is used to score partial hypotheses. • Thus, their system is a hybrid: phrase-based search (with constrained decoding), string kernel hypothesis scoring. They don’t use a language model. • With training on 12K French-English sentences, their LSR model obtains performance comparable to baseline decoder. Surprisingly good (no LM, no “future score”, constrained decoding)!
Kernel regression techniques • Achilles heel of their method: where m = size of training corpus, they need to do an O(m3)-time matrix inversion. Decoding is O(m)-time. That’s why they only used 12K training sentence pairs. • Current research: instead of using 12K randomly chosen sentences, search big parallel corpus for sentence pairs whose source half resembles sentences to be translated (our suggestion). Open questions: • Can string kernel approach to MT be scaled up? • Can a pure string kernel MT system be built?
The Last Word: PORTAGEshared • NRC licenses PORTAGEsharedsource code to Canadian universities for research & education purposes, for a nominal fee. The goal is to develop a PORTAGE research community. The source code may be modified, provided that modifications are granted back to NRC (so we can incorporate them in future releases). • Canadian companies can license PORTAGEshared for 1 year, for evaluation purposes only (no modifications), for a given province or territory → For details, contact Michel Mellinger: Michel.Mellinger@cnrc-nrc.gc.ca
References (1) Best overall reference Philipp Koehn, « Statistical Machine Translation », University of Edinburgh (textbook to appear 2008, Cambridge University Press). Papers(NOTE: short summary of key papers available from Kuhn/Foster) Brown93 Peter F. Brown, Stephen A. Della Pietra, Vincent Della J. Pietra, and Robert L. Mercer. The mathematics of Machine Translation: Parameter estimation. Computational Linguistics, 19(2):263-312, June 1993. Chiang05 David Chiang. A Hierarchical Phrase-Based Model for Statistical Machine Translation. Proceedings of the ACL, Ann Arbor, 2005. Foster07 George Foster and Roland Kuhn. Mixture-Model Adaptation for SMT. SMT Workshop, Prague, Czech Republic, June 23, 2007. Foster06 George Foster, Roland Kuhn, and Howard Johnson. Phrasetable Smoothing for Statistical Machine Translation. EMNLP 2006, Sydney, Australia, July 22-23, 2006. Germann01 Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. Fast decoding and optimal decoding for machine translation. Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL), Toulouse, July 2001.
References (2) Goodman01 Joshua Goodman. A Bit of Progress in Language Modeling (extended version). Microsoft Research Technical Report, Aug. 2001. Downloadable from research.microsoft.com/~joshuago/publications.htm Huang07 Liang Huang and David Chiang. Forest Rescoring: Faster Decoding with Integrated Language Models. Proceedings of the ACL, Prague, Czech Republic, June 2007. Jelinek90 Frederick Jelinek. Self-organized language modeling for speech recognition. In: A. Waibel and K.-F. Lee, Editors, Readings in Speech Recognition, Morgan Kaufmann, San Mateo, CA (1990), pp. 450-506. Johnson07 Howard Johnson, Joel Martin, George Foster, and Roland Kuhn. Improving Translation Quality by Discarding Most of the Phrasetable. Proceedings of EMNLP, Prague, Czech Republic, June 2007. Jones05 Douglas Jones, Edward Gibson, et al. Measuring Human Readability of Machine Generated Text: Studies in Speech Recognition and Machine Translation. Proceedings of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, PA, USA, March 2005 (Special Session on Human Language Technology: Applications and Challenge of Speech Processing). Knight99 Kevin Knight. Decoding complexity in word-replacement translation models. Computational Linguistics, Squibs and Discussion, 25(4), 1999.
References (3) Koehn04 Philipp Koehn. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. Proceedings of the 6th Conference of the Association for Machine Translation in the Americas, Georgetown University, Washington D.C., October 2004. Springer-Verlag. KoehnDec03 Philipp Koehn. PHARAOH - a Beam Search Decoder for Phrase-Based Statistical Machine Translation Models (User Manual and Description). USC Information Sciences Institute, Dec. 2003. KoehnMay03 Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation. In Eduard Hovy, editor, Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL), pp. 127-133, Edmonton, Alberta, Canada, May 2003. Liang06 Percy Liang, Alexandre Bouchard-Côté, Dan Klein, and Ben Taskar. An end-to-end discriminative approach to machine translation. Proceedings of the ACL, Sydney, July 2006. Lodhi02 Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text Classification Using String Kernels. Journal of Machine Learning Research, V. 2, pp. 419-422, 2002.
References (4) Marcu02 Daniel Marcu and William Wong. A phrase-based, joint probability model for statistical machine translation. Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, PA, 2002. OchJHU04 Franz Josef Och, Daniel Gildea, et al. Final Report of the Johns Hopkins 2003 Summer Workshop on Syntax for Statistical Machine Translation (revised version). http://www.clsp.jhu.edu/ws03/groups/translate (JHU-syntax-for-SMT.pdf), Feb. 2004. Och04 Franz Och and Hermann Ney. The alignment template approach to statistical machine translation. Computational Linguistics, V. 30, pp. 417-449, 2004. Och03 Franz Josef Och. Minimum Error Rate Training for Statistical Machine Translation.Proceedings of 41st Annual Meeting of the Association for Computational Linguistics (ACL), Sapporo, Japan, July 2003. Och02 Franz Josef Och and Hermann Ney. Discriminative training and maximum entropy models for statistical machine translation. Proceedings of 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002. Och01 Franz Josef Och, Nicola Ueffing, and Hermann Ney. An Efficient A* Search Algorithm for Statistical Machine Translation. Proc. Data-Driven Machine Translation Workshop, July 2001.