Statistical Machine Translation

Statistical Machine Translation Gary Geunbae Lee Intelligent Software Laboratory, Pohang University of Science & Technology

Contents • Part1: What Samsung requires to survey • Rule-base and Statistical approach • Alignment Model • Decoding Algorithms • Open Sources • Evaluation Methods • Using Syntax Info. in SMT • PartII: State-of-the-art technology • PartIII: SMT/SLT in Isoftlab

Recent Technology Technologies of red text will be introduced in this presentation • Variants of phrase-based system • Factored Translation Model • Variants of other system • Variants of n-gram based translation • Variants of syntax based translation • Word alignment for SMT • Parameter optimization • Reordering • Others • Evaluation • Language Model • Domain adaptation • …

Factored translation Model Mapping lemmas Generation Mapping morphology Factors Concepts of factored model

Factored translation Model Europarl : 751,088 sent. Factor 7-gram LM Europarl : 40,000 sent. Factor 7-gram LM WSJ : 20,000 Sent. Factor 7-gram LM • Performance (higher order n-gram)

Factored translation Model News Commentary 52,185 sent. Phrase-based Model Ignore surface form Use surface form for statistically rich words. Use analysis and generation model for statistically poor words. Performance ( analysis and generation )

N-gram based machine translation • Concept • “The n-gram based SMT system uses a translation model based on bilingual n-grams” • Bi-lingual N-gram • Bi-lingual phrase pair • Consistent with word alignment • Minimal unit (monotone translation) • SMR : statistical machine reordering • “translates from original source language (S) to a reordered source language (S’), given target language(T)”

N-gram based machine translation Original training source corpus Reordered training source corpus Bilingual pair example

N-gram based machine translation Europarl corpus : wmt07 shared task Settings: 4-gram translation model 5-gram target LM 5-gram target class LM Baseline is Moses decoder with no factors. That is, standard phrase-based model Performance

Word alignment for SMT packing • Word packing • Concept: • To assume and treat “several consecutive words as one word” • Condition: • There are some evidences that “the words correspond to a single word in opposite side” • Evidence: Co-occurrence • Example: • Where is the department store? • 그 백화점은 어디에 있습니까?

Word alignment for SMT Source text Aligner Extract 1-n alignment Alignment Result Target text Modified Target text Candidates to pack Target Side Substitution Prune-out Word lists to pack Manual Dictionary Alignment procedure

Word alignment for SMT Phrase-based system trained with Model-4 BTEC Corpus n: # of iterations Performance

Parameter Optimization • Parameters • Most state-of-the-art SMT systems use log-linear models • Parameters are the weights of the model that used in the log-linear model • Minimum error rate training (MERT) • maximizing BLEU score by tuning parameters iteratively • Proposed by F. J. Och (2003) • Most research include this process

Parameter Optimization DevSet Src. Decoder Evaluator DevSet Ref. Translation Result Fact: Evaluation result affect directly parameter tuning Assumption1 :More reliable evaluation can leads better estimation Assumption2 : More ref. makes the evaluation accurate. Goal : Automatic generation of additional Refs. New parameters Evaluation Result Parameter Update MERT Process

Parameter Optimization e1f2e2 e1f1e3 e1f2e4 e2f2e1 e2f2e4 • Paraphrasing • English to English problem with a pivot language • Two alignments • English to French • French to English • Example

Parameter Optimization • Paraphrasing results • Pivot language : French

Parameter Optimization Decoder : Hiero H : References generated by human P : References generated by paraphrasing References are used for parameter training NIST MT evaluation Corpus: Ch-En Performance

Latest Shared-task: wmt07 Summary of participant

Latest Shared-task: wmt07

Latest Shared-task: wmt07 PBMT+RBMT PBMT NBMT PBMT HPBMT PBMT PBMT+RBMT - PBMT Factored Rate of top-ranked count • Human evaluation • Adequacy, fluency, rank and constituent

Latest Shared-task: wmt07 PBMT NBMT PBMT HPBMT Factored SAMT PBMT+SAMT PBMT PBMT PBMT+RBMT PBMT+RBMT PBMT+RBMT Rate of top-ranked count • Automatic evaluation • 11 metrics including METEOR, BLEU, TER, ...

Contents • Part1: What Samsung requires to survey • Rule-base and Statistical approach • Alignment Model • Decoding Algorithms • Open Sources • Evaluation Methods • Using Syntax Info. in SMT • PartII: State-of-the-art technology • PartIII: SMT/SLT in Isoftlab

SMT in ISoft. Lab. • For Korean-English • Pre-processing techniques • Compound Sentence Splitting • Class Word Substitution & Sub-translation

Differences between Korean and English

Spacing Unit Difference • Morpheme • Unit of meaning • Best spacing unit for SMT system • Pseudo-morpheme • morpheme, but some morphemes are not separated • well correspond to acoustic signal • spacing unit for ASR • Eojeol • Human friendly spacing unit

Spacing Unit Difference • English • Words well corresponds morphemes • No need to change spacing unit • Korean • Words(eojeol) is different from morpheme • Need to change spacing unit • Spacing unit can be changed automatically by just applying POS tagger.

저 는 술 마시 는 것 을 그다지 좋 아 하 지 않 습니다 I don't enjoy drinking very much Word Order Difference • English is SVO language while Korean is SOV language. • Long distance distortions are observed frequently.

Difference in expression • plural and singular • English: plural form and singular form are strictly distinguished. • Korean: plural nouns can be written in singular form • Example: • He have 3 children./그는 아이가 세 명 있다. • strictly …child = 아이, children = 아이들 • honorific terms • English: not so much distinction. • Korean: various level of distinction. • Example: go / 간다, 가네, 가오, 갑니다, 가, 가요

Un-translatable words • Case Markers • English doesn’t have case markers. • No English words correspond to “은,는,이,가,을,를, …” • Articles • Korean doesn’t have articles • Usually they are not translated into specific words. • Subjects • In a Korean sentence, subject can be omitted. • In Spoken language this phenomena appears more frequently. • The subject of English sentence can not be aligned to a Korean word.

The Techniques • Adding Part Of Speech information • Spacing Unit problem • Reordering word sequence • Word order problem • Deleting useless words • Un-translatable words • Differences in expression • Language modeling by parts • Appending dictionary

Adding POS information • Motivation • For Korean language, Spacing Unit of ASR Result (or human written text) should be changed into morpheme unit. • Korean Morpheme analysis is usually accomplished by full POS tagging • Some homographs can be distinguished by their POS tag • How did we do ? • just changed the training and test corpus.

Re-ordering word sequence • Motivation • Differences in word order • Previous research: M. Collins et. al. “Clause restructuring for Statistical Machine Translation” • How did we do? • Parse training corpus and analyze the result • Manually generated reordering rule • Applied the rules to train and test corpus

Deleting Useless Words • Motivation • Un-translatable words make word alignment worse. • Various endings caused by Honorific expression increase vocabulary size. But they can not play an important role in translation. • these words are “useless” in translation. • How did we do? • Applied POS tagger • Using the POS tagged, delete the words with specific tags from the train and test corpus

Language Modeling by Parts • Motivation • Assume that translation does not change the category of a given sentence. • Sentence classification is possible. • Smaller language model has less ambiguity • How did we do? • Checking the end of Korean sentences, classify train and test corpus into 2 classes: interrogative and others. • Build language model for each class of train corpus • while decoding, classify input sentence and select appropriate language model

Appending dictionary • Motivation • GIZA++ supports dictionary, but only word to word dictionary • Phrase dictionary would more be helpful • we expected one more count to the exact alignment while GIZA++ training • How did we do? • Dictionary has word pairs and phrase pairs in general domain • Just append the dictionary to the end of corpus

Experiment • Corpus Statistics • Train : 41,566 sentences • Test : 4,619 sentences • Dictionary : about 160K entries

Experiment • Experimental Result

Compound sentence splitting • A problem of Korean – English translation • Longer sentence leads longer reordering • Simple distortion model in a standard phrase-based model prefers monotone decoding • Long sentence  reordering error  Word salads • Solution • Make long sentences short • Split long sentence into short sub-sentences

Concept of Transformation • Rewriting Rule • T1 T2 • Triggering Environment • Sequence of words, tags … • A precondition for the rewriting rule • Example Table: Christopher D. Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. Page 363

Transformation for Sentence Splitting • Triggering Environment • Morpheme sequence • POS tag sequence • Rewriting rule • connecting morpheme sequence (Tagged form) • Ending morpheme sequence (Tagged form) • Splitting position • Junction pattern

Extracting Rewrite Rule • Minimum Edit-distance로 정렬해서 서로 다른 부분을 Rewriting Rule로 뽑는다. • 부었는데  부었어요. ( rewriting rule )

Expanding a triggering environments • Expanding algorithm • Mis-splitting • Splitting a sentence that is not split by human • Splitting result is not same to human’s • The algorithm gives an error-free transformation (on example) T := a transformation to expand foreach example E while T mis-splits E Exapnd T end end

Initial Transformation Window 1 A Morphemes Window 2 POS tags Boundary Sub-sentence Sub-sentence Re-writing Rule : Change A to ending morpheme followed by a Junction.

Expanded Transformation Window 1 A Morphemes Window 2 POS tags Forward Backward Boundary Sub-sentence Sub-sentence Re-writing Rule : Change A to ending morpheme followed by a Junction.

Learning Algorithm • Original TBL (Used in Brill Tagger ) • Minimize Error rate • Training example is modified by training • TBL for Sentence Splitting • Maximize BLEU score • Training example is not modified by training

Applying Transformations • Find Available Transformations • Check Triggering environment • Apply rewriting rule • Connecting morpheme  ending morpheme • Split Sentence • Decode two sentences • Connect the sentences with Junction

Result • Experimental result

Class word substitution for SMT Why we use class words? We can get richer statistics by using class words.

Class word substitution for SMT • Decoding NE Dictionary Automata Input Sentence NE Substituted Sentence Translation Option The Decoder should be trained with corpus containing class word substituted sentences The substituted class words compete against original words while decoding. We hope that the original words defeat class words if the substitution was erroneous Decoding NE re-Substitution Output Sentence

Spoken language translation • Major components • Automatic Speech Recognition (ASR) • Machine Translation (MT) • Text-to-Speech (TTS) ASR MT TTS Target Sentence Source Speech Source Sentence Target Speech 버스 정류장이 어디에 있나요? Where is the bus stop?

Statistical Machine Translation