Lecture 8: Statistical Alignment & Machine Translation (Chapter 13 of Manning & Schutze)

Lecture 8: Statistical Alignment & Machine Translation(Chapter 13 of Manning & Schutze) Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information Engineering, National Cheng Kung University 2014/05/12 (Slides from Berlin Chen, National Taiwan Normal University http://140.122.185.120/PastCourses/2004S-NaturalLanguageProcessing/NLP_main_2004S.htm) References: • Brown, P., Pietra, S. A. D., Pietra, V. D. J., Mercer, R. L. The Mathematics of Machine Translation, Computational Linguistics, 1993. • Knight, K. Automating Knowledge Acquisition for Machine Translation, AI Magazine,1997. • W. A. Gale and K. W. Church. A Program for Aligning Sentences in Bilingual Corpora, Computational Linguistics 1993.

Machine Translation (MT) • Definition • Automatic translation of text or speech from one language to another • Goal • Produce close to error-free output that reads fluently in the target language • Far from it ? • Current Status • Existing systems seem not good in quality • Babel Fish Translation (before 2004) • Google Translation • A mix of probabilistic and non-probabilistic components

Issues • Build high-quality semantic-based MT systems in circumscribed domains • Abandon automatic MT, build software to assist human translators instead • Post-edit the output of a buggy translation • Develop automatic knowledge acquisition techniques for improving general-purpose MT • Supervised or unsupervised learning

Interlingua (knowledge representation) knowledge-based translation English (semantic representation) French (semantic representation) semantic transfer English (syntactic parse) French (syntactic parse) syntactic transfer English Text (word string) French Text (word string) word-for-word Different Strategies for MT

Word for Word MT • Translate words one-by-one from one language to another • Problems 1. No one-to-one correspondence between words in different languages (lexical ambiguity) • Need to look at the context larger than individual word (→phrase or clause) 2. Languages have different word orders English French suit lawsuit(訴訟), set of garments (服裝) meanings

Syntactic Transfer MT • Parse the source text, then transfer the parse tree of the source text into a syntactic tree in the target language, and then generate the translation from this syntactic tree • Solve the problems of word ordering • Problems • Syntactic ambiguity • The target syntax will likely mirror that of the source text Adv V N German: Ich esse gern( I like to eat ) English: I eat readily/gladly

Text Alignment: Definition • Definition • Align paragraphs, sentences or words in one language to paragraphs, sentences or words in another languages • Thus can learn which words tend to be translated by which other words in another language • Is not part of MT process per se • But the obligatory first step for making use of multilingual text corpora • Applications • Bilingual lexicography • Machine translation • Multilingual information retrieval • … bilingual dictionaries, MT , parallel grammars …

Text Alignment: Sources and Granularities • Sources of parallel texts or bitexts • Parliamentary proceedings (Hansards) • Newspapers and magazines • Religious and literary works • Two levels of alignment • Gross large scale alignment • Learn which paragraphs or sentences correspond to which paragraphs or sentences in another language • Word alignment • Learn which words tend to be translated by which words in another language • The necessary step for acquiring a bilingual dictionary with less literal translation Orders of word or sentence might not be preserved.

Text Alignment: Example 1 2:2 alignment

Text Alignment: Example 2 2:2 alignment 1:1 alignment 1:1 alignment 2:1 alignment a bead/a sentence alignment Studies show that around 90% of alignments are 1:1 sentence alignment.

Sentence Alignment • Crossing dependencies are not allowed here • Word ordering is preserved ! • Related work

Sentence Alignment • Length-based • Lexical-guided • Offset-based

Sentence AlignmentLength-based method • Rationale: the short sentences will be translated as short sentences and long sentences as long sentences • Length is defined as the number of words or the number of characters • Approach 1 (Gale & Church 1993) • Assumptions • The paragraph structure was clearly marked in the corpus, confusions are checked by hand • Lengths of sentences measured in characters • Crossing dependences are not handled here • The order of sentences are not changed in the translation Union Bank of Switzerland (UBS) corpus : English, French, and German s1 s2 s3 s4 . . . sI t1 t2 t3 t4 . . . . tJ Ignore the rich information available in the text.

Sentence Alignment Length-based method source target Source s1 s2 s3 s4 . . . sI t1 t2 t3 t4 . . . . tJ B1 possible alignments: {1:1, 1:0, 0:1, 2:1,1:2, 2:2,…} B2 B3 Target probability independence between beads Bk a bead

Sentence AlignmentLength-based method • Dynamic Programming • The cost function (Distance Measure) • Sentence is the unit of alignment • Statistically modeling of character lengths Bayes’ Law is a normal distribution square difference of two paragraphs Ratio of texts in two languages The prob. distribution of standard normal distribution

Sentence AlignmentLength-based method • The priori probability Or P(αalign) Source si si-1 si-2 tj-2 tj-1 tj Target

L1 alignment 1 L1 alignment 2 s1 t1 cost(align(s1, t1)) + cost(align(s2, t2)) + cost(align(s3,Ø)) + cost(align(s4, t3)) cost(align(s1, s2, t1)) + cost(align(s3, t2)) + cost(align(s4, t3)) t1 s2 t2 s3 t2 s4 t3 t3 Sentence Alignment Length-based method • A simple example

Sentence Alignment Length-based method • 4% error rate was achieved • Problems: • Can not handle noisy and imperfect input • E.g., OCR output or file containing unknown markup conventions • Finding paragraph or sentence boundaries is difficult • Solution: just align text (position) offsets in two parallel texts (Church 1993) • Questionable for languages with few cognates (同源) or different writing systems • E.g., English ←→ Chinese eastern European languages ←→ Asian languages

Sentence Alignment Lexical method • Approach 1 (Kay and Röscheisen 1993) • First assume the first and last sentences of the text were align as the initial anchors • Form an envelope of possible alignments • Alignments excluded when sentencesacross anchors or their respective distance from an anchor differ greatly • Choose word pairs their distributions are similar in most of the sentences • Find pairs of source and target sentences which contain many possible lexical correspondences • The most reliable of pairs are used to induce a set of partial alignment (add to the list of anchors) Iterations

Word Alignment • The sentence/offset alignment can be extended to a word alignment • Some criteria are then used to select aligned word pairs to include them into the bilingual dictionary • Frequency of word correspondences • Association measures • ….

Statistical Machine Translation • The noisy channel model • Assumptions: • An English word can be aligned with multiple French words while each French word is aligned with at most one English word • Independence of the individual word-to-word translations Language Model Translation Model Decoder f: French e: English |e|=l |f|=m

Statistical Machine Translation • Three important components involved • Decoder • Language model • Give the probability p(e) • Translation model translation probability all possible alignments (the English word that a French word fj is aligned with) normalization constant

An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date: 2011.04.12

Introduction • Translations involve many cultural respects • We only consider the translation of individual sentence, just acceptable sentences. • Every sentence in one language is a possible translation of any sentence in the other • Assign (S,T) a probability, Pr(T|S), to be the probability that a translator will produce T in the target language when presented with S in the source language.

Statistical Machine Translation(SMT) • Noise channel problem

Fundamental of SMT • Given a string of French f, the job of our translation system is to find the string e that the speaker had in mind when he produced f. (Baye’s theorem) • Since denominator Pr(f) here is a constant, the best e is one which has the greatest probability.

Practical Challenges • Computation of translation model Pr(f|e) • Computation of language model Pr(e) • Decoding (i.e., search for e that maximize Pr(f|e)  Pr(e))

Alignment of case 1 (one-to-many)

Alignment of case 2 (many-to-one)

Alignment of case 3 (many-to-many)

e = e1e2…ei…el aj=i f = f1 f2… fj… fm fj = eaj = ei Formulation of Alignment(1) • Let e = e1le1e2…el and f = f1mf1f2…fm • An alignment between a pair of strings e and f use a mapping of every word ei to some word fj • In other words, an alignment a between e and f tells that the word ei, 1il is generated by the word fj, aj{1,…,l} • There are (l+1)mdifferent alignments between e and f. (Including Null – no mapping )

Translation Model • The alignment, a, can be represented by a series, a1m = ala2... am, of m values, each between 0 and lsuch that if the word in position j of the French string is connected to the word in position i of the English string, then aj= i , and if it is not connected to any English word, then aj= 0 (null). Pr(f,a|e,m)

 IBM Model I (1) Pr(f,a|e,m)

IBM Model I (2) • The alignment is determined by specifying the values of ajfor j from 1 to m, each of which can take any value from 0 to l. Pr(f|e) = ∑Pr(f,a|e)

Constrained Maximization • We wish to adjust the translation probabilities so as to maximize Pr(f|e ) subject to the constraints that for each e

Lagrange Multipliers (1) • Method of Lagrange multipliers（拉格朗乘數法）: Lagrange multipliers with one constraint • If there is a maximum or minimum subject to the constraint g(x,y) = 0, then it will occur at one of the critical numbers of the function F defined by is called the • f(x,y) is called the objective function（目標函數）. • g(x,y) is called the constrained equation（條件限制方程式）. • F(x, y, ) is called the Lagrange function（拉格朗函數）. •  is called the Lagrange multiplier（拉格朗乘數）.

Lagrange Multipliers (2) • Following standard practice for constrained maximization, we introduce Lagrange multipliers e, and seek an unconstrained extremum of the auxiliary function

Derivation (1) • The partial derivative of h with respect to t(f|e)is • where  is the Kronecker delta function, equal to one when both of its arguments are the same and equal to zero otherwise

Derivation (2) • We call the expected number of times that e connects to f in the translation (f|e) the count of f given e for (f|e) and denote it by c(f|e; f, e).By definition,

Derivation (3) • replacing e by ePr(f|e), then Equation (11) can be written very compactly as • In practice, our training data consists of a set of translations, (f(1) le(1)), (f(2) le(2)), ..., (f(s) le(s)), , so this equation becomes

Derivation (4) • For an expression that can be evaluated efficiently.

Derivation (5) • Thus, the number of operations necessary to calculate a count is proportional to l + m rather than to (l + 1)m as Equation (12)

Other References about MT

Chinese-English Sentence Alignment • 林語君、高照明,結合統計與語言訊息的混合式中英雙語句對應演算法,ROCLING XVI,2004 • 問題:中文句子的界線並不清楚，句點與逗點都有可能是句子的界線。 • 方法 • 群組句對應模組 • 2 Stages Iterative DP • 句對應參數 • 採用廣泛的字典翻譯詞以及Stop list • 中文翻譯詞的部分比對（Partial Match） • 句中重要標點符號序列相似度 • 共同數字詞、時間詞、原文詞之對應錨 • 群組句對應參數 • 句長相異度 • 句末標點符號要求相同

Bilingual Collocation Extraction Based onSyntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03)

Bilingual Collocation Extraction Based onSyntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03) • Preprocessing steps to calculate the following information: • Lists of preferred POS patterns of collocation in both languages. • Collocation candidates matching the preferred POS patterns. • N-gram statistics for both languages, N = 1, 2. • Log likelihood Ratio statistics for two consecutive words in both languages. • Log likelihood Ratio statistics for a pair of candidates of bilingual collocation across from one language to the other. • Content word alignment based on Competitive Linking Algorithm (Melamed 1997).

Bilingual Collocation Extraction Based onSyntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03)

A Probability Model to Improve Word AlignmentColin Cherry & Dekang Lin,University of Alberta

Lecture 8: Statistical Alignment & Machine Translation (Chapter 13 of Manning & Schutze)