Machine Translation for Digital Library Content

5th ICUDL – November 2009 Machine Translation for Digital Library Content Carnegie Mellon University Jaime Carbonell et al Relevant MT Paradigms 1. Statistical/Example-Based MT 2. Context-Based MT 3. Learning Transfer Rules 4. Active Data Collection • Background • State of the Art in MT • The General MT Challenge • MT for Low-Density Languages • Active Crowd-Sourcing for MT

Arabic Spanish Chinese Telegu + Multiple other languages, including rare and ancient ones The Multilingual Bookstack English

Multi-lingual Challenges Are Not New The Great Library + Tower of Babel = ? But the Technological Solutions Are New

Coping with Massive Multilinguality • Multilingual Content • Scanning, indexing, interlingual metadata, … • Within scope of UDL • Multilingual Access • Pre-translation upon acquisition? • Need to store content N-times (N languages) • On-demand translation upon query? • Needs fast translation, cross-lingual retrieval • Requisite Technology • Accurate OCR, Cross-lingual search • Accurate machine translation, beyond common lang’s

Challenges for Machine Translation • Ambiguity Resolution • Lexical, phrasal, structural • Structural divergence • Reordering, vanishing/appearing words, … • Inflectional morphology • Spanish 40+ verb conjugations, Arabic has more. • Training Data • Bilingual corpora, aligned corpora, annotated corpora • Evaluation • Automated (BLEU, METEOR, TER) vs HTER vs …

Context Needed to Resolve Ambiguity Example: English  Japanese Powerline – densen (電線) Subwayline – chikatetsu(地下鉄) (Be) online – onrain (オンライン) (Be) on theline – denwachuu (電話中) Lineup – narabu (並ぶ) Lineone’s pockets – kanemochi ni naru (金持ちになる) Line one’s jacket – uwagi o nijuu ni suru (上着を二重にする) Actor’s line – serifu (セリフ) Get alineon – joho o eru (情報を得る) Sometimes local context suffices (as above)  n-grams help . . . but sometimes not

CONTEXT: More is Better • Examples requiring longer-range context: • “The linefor the new play extended for 3 blocks.” • “The line for the new play was changed by the scriptwriter.” • “The line for the new play got tangled with the other props.” • “The line for the new play better protected the quarterback.” • Challenges: • Short n-grams (3-4 words) insufficient • Requires more general semantics  Domain specificity

Low Density Languages • 6,900 languages in 2000 – Ethnologuewww.ethnologue.com/ethno_docs/distribution.asp?by=area • 77 (1.2%) have over 10M speakers • 1st is Chinese, 5th is Bengali, 11th is Javanese • 3,000 have over 10,000 speakers each • 3,000 may survive past 2100 • 5X to 10X number of dialects • # of L’s in some interesting countries: • Pakistan: 77, India 400 • North Korea: 1, Indonesia 700

Some Linguistics Maps

Some (very) LD Languages in N. A. Anishinaabe (Ojibwe, Potawatame, Odawa) Great Lakes

Additional Challenges for Rare and for Ancient Languages • Morpho-syntactics is plentiful • Beyond inflection: verb-incorporation, agglomeration, … • Data is scarce • Insignificant bilingual or annotated data • Fluent computational linguists are scarce • Field linguists know LD languages best • Standardization is scarce • Orthographic, dialectal, rapid evolution, …

Morpho-Syntactics & Multi-Morphemics • Iñupiaq (North Slope Alaska, Lori Levin) • Tauqsiġñiaġviŋmuŋniaŋitchugut. • ‘We won’t go to the store.’ • Kalaallisut (Greenlandic, Per Langaard, p.c.) • Pittsburghimukarthussaqarnavianngilaq • Pittsburgh+PROP+Trim+SG+kar+tuq+ssaq+qar+naviar+nngit+v+IND+3SG • "It is not likely that anyone is going to Pittsburgh"

Morphotactics in Iñupiaq

Type Token Curve for Mapudungun • 400,000+ speakers • Mostly bilingual • Mostly in Chile • Pewenche • Lafkenche • Nguluche • Huilliche

Paradigms for Machine Translation Interlingua Semantic Analysis Sentence Planning Syntactic Parsing Transfer Rules Text Generation Target (e.g. English) Source (e.g. Pashto) Direct: SMT, EBMT CBMT, … 1/1/2020 15

Which MT Paradigms are Best? Towards Filling the Table • DARPA MT: Large S  Large T • Arabic  English; Chinese  English

An Evolutionary Tree of MT Paradigms Larger-Scale TMT Large-scale TMT Transfer MT Stat Transfer MT Interlingua MT Context-Based MT Analogy MT Example-based MT Stat Syntax MT Statistical MT DecodingMT Phrasal SMT 1950 1980 2010

Parallel Text: Requiring Less is Better (Requiring None is Best ) • Challenge • There is just not enough to approach human-quality MT for major language pairs (we need ~100X to ~10,000X) • Much parallel text is not on-point (not on domain) • LD languages or distant pairs have very little parallel text • CBMT Approach [Abir, Carbonell,Sofizade, …] • Requires no parallel text, no transfer rules . . . • Instead, CBMT needs • A fully-inflected bilingual dictionary • A (very large) target-language-only corpus • A (modest) source-language-only corpus [optional, but preferred]

CACHE DATABASE Cross-Language N-gram Database CMBT System Source Language N-gram Segmenter Parser Parser INDEXED RESOURCES N-GRAM BUILDERS (Translation Model) Bilingual Dictionary Flooder (non-parallel text method) Target Corpora Edge Locker [Source Corpora] TTR Stored N-gram Pairs Approved N-gram Pairs Gazetteers Substitution Request N-gram Candidates N-GRAM CONNECTOR Overlap-based Decoder Target Language

Step 1: Source Sentence Chunking • Segment source sentence into overlapping n-grams via sliding window • Typical n-gram length 4 to 9 terms • Each term is a word or a known phrase • Any sentence length (for BLEU test: ave-27; shortest-8; longest-66 words) S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S2 S3 S4 S5 S2 S3 S4 S5 S6 S3 S4 S5 S6 S7 S4 S5 S6 S7 S8 S5 S6 S7 S8 S9

Target Word Lists T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-a T6-a T6-b T6-c Flooding Set Step 2: Dictionary Lookup • Using bilingual dictionary, list all possible target translations for each source word or phrase Source Word-String S2 S3 S4 S5 S6 Inflected Bilingual Dictionary

T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-a T6-a T6-b T6-c Step 3: Search Target Text • Using the Flooding Set, search target text for word-strings containing one word from each group Flooding Set • Find maximum number of words from Flooding Set in minimum length word-string • Words or phrases can be in any order • Ignore function words in initial step (T5 is a function word in this example)

T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-a T6-a T6-b T6-c T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-b T(x) T2-d T(x) T(x) T6-c T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-b T(x) T2-d T(x) T(x) T6-c Target Candidate 1 Step 3: Search Target Text (Example) Flooding Set T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T3-b T(x) T2-d T(x) T(x) T6-c T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) Target Corpus

T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-a T6-a T6-b T6-c T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T4-a T6-b T(x) T2-c T3-a T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T4-a T6-b T(x) T2-c T3-a Target Candidate 2 Step 3: Search Target Text (Example) Flooding Set T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) Target Corpus

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-c T2-b T4-e T5-a T6-a T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-c T2-b T4-e T5-a T6-a Target Candidate 3 Step 3: Search Target Text (Example) T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-a T6-a T6-b T6-c Flooding Set T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) Target Corpus Reintroduce function words after initial match (e.g. T5)

Step 4: Score Word-String Candidates • Scoring of candidates based on: • Proximity (minimize extraneous words in target n-gram  precision) • Number of word matches (maximize coverage  recall)) • Regular words given more weight than function words • Combine results (e.g., optimize F1 or p-norm or …) Target Word-String Candidates Proximity 3rd 1st 1st Word Matches 3rd 2st 1st “Regular” Words 3rd 1st 1st Scoring --- --- --- Total Scoring 3rd 2nd 1st T3-b T(x) T2-d T(x) T(x) T6-c T4-a T6-b T(x) T2-c T3-a T3-c T2-b T4-eT5-a T6-a

T(x1) T3-c T2-b T4-e T(x2) T4-a T6-b T(x3) T2-c T3-b T(x3) T2-d T(x5) T(x6) T6-c T3-b T(x3) T2-d T(x5) T(x6) T6-c T4-a T6-b T(x3)T2-c T3-a T4-a T6-b T(x3)T2-c T3-a T3-c T2-b T4-eT5-a T6-a T3-c T2-b T4-eT5-a T6-a T2-b T4-e T5-a T6-a T(x8) T6-b T(x3) T2-c T3-a T(x8) Step 5: Select Candidates Using Overlap(Propagate context over entire sentence) T(x1) T2-d T3-c T(x2) T4-b Word-String 1 Candidates T(x1) T3-c T2-b T4-e T(x2) T4-a T6-b T(x3) T2-c T3-b T(x3) T2-d T(x5) T(x6) T6-c Word-String 2 Candidates T4-a T6-b T(x3)T2-c T3-a T3-c T2-b T4-eT5-a T6-a T2-b T4-e T5-a T6-a T(x8) Word-String 3 Candidates T6-b T(x11) T2-c T3-a T(x9) T6-b T(x3) T2-c T3-a T(x8)

Best translations selected via maximal overlap T(x2) T4-a T6-b T(x3) T2-c T4-a T6-b T(x3)T2-c T3-a Alternative 1 T6-b T(x3) T2-c T3-a T(x8) T(x2) T4-a T6-b T(x3) T2-c T3-a T(x8) T(x1) T3-c T2-b T4-e T3-c T2-b T4-eT5-a T6-a Alternative 2 T2-b T4-e T5-a T6-a T(x8) T(x1) T3-c T2-b T4-e T5-a T6-a T(x8) Step 5: Select Candidates Using Overlap

A (Simple) Real Example of Overlap Flooding  N-gram fidelity Overlap  Long range fidelity A United States soldier N-grams generated from Flooding United States soldier died soldier died and two others died and two others were injured two others were injured Monday N-grams connected via Overlap A United States soldier died and two others were injured Monday A soldier of the wounded United States died and other two were east Monday Systran

System Scores 0.85 0.8 Human Scoring Range 0.7533 0.7447 0.7189 0.7 0.6 0.5610 BLEU SCORES: 4 Ref Trxs 0.5551 0.5137 0.5 0.4 0.3859 0.3 CBMT Spanish Google Chinese (’06 NIST) Systran Spanish SDL Spanish CBMT Spanish (Non-blind) Google Arabic (‘06 NIST) Google Spanish ’08 top lang Based on same Spanish test set

Historical CBMT Scoring 0.85 Human Scoring Range 0.8 .7533 .7365 .7354 .7447 .7059 .6950 0.7 .7276 .6645 .6929 .6456 .6694 .6374 .6165 .6393 .6144 .6267 0.6 BLEU SCORES .6129 .5670 .5953 Blind 0.5 Non-Blind 0.4 .3743 0.3 Apr 04 Aug 04 Dec 04 Apr 05 Aug 05 Dec 05 Apr 06 Aug 06 Dec 06 Apr 07 Aug 07 Dec 07 2008

An Example • Un soldado de Estados Unidos murió y otros dos resultaron heridos este lunes por el estallido de un artefacto explosivo improvisado en el centro de Bagdad, dijeron funcionarios militares estadounidenses • CBMT: A United States soldier died and two others were injured monday by the explosion of an improvised explosive device in the heart of Baghdad, American military officials said. • Systran: A soldier of the wounded United States died and other two were east Monday by the outbreak from an improvised explosive device in the center of Bagdad, said American military civil employees BTW: Google’s translation is identical to CBMT’s

Stat-Transfer MT: Research Goals(Lavie, Carbonell, Levin, Vogel & Students) • Long-term research agenda (since 2000) focused on developing a unified framework for MT that addresses the core fundamental weaknesses of previous approaches: • Representation – explore richer formalisms that can capture complex divergences between languages • Ability to handle morphologically complex languages • Methods for automatically acquiring MT resourcesfrom available data and combining them with manual resources • Ability to address both rich and poor resource scenarios • Main research funding sources: NSF (AVENUE and LETRAS projects) and DARPA (GALE) 1/1/2020 Stat-XFER 33

Stat-XFER: List of Ingredients Framework:Statistical search-based approach with syntactic translation transfer rules that can be acquired from data but also developed and extended by experts SMT-Phrasal Base:Automatic Word and Phrase translation lexicon acquisition from parallel data Transfer-rule Learning: apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages Elicitation: use bilingual native informants to produce a small high-quality word-aligned bilingual corpus of translated phrases and sentences Rule Refinement: refine the acquired rules via a process of interaction with bilingual informants XFER + Decoder: XFER engine produces a lattice of possible transferred structures at all levels Decoder searches and selects the best scoring combination 1/1/2020 Stat-XFER 34

Stat-XFER MT Approach Interlingua Semantic Analysis Sentence Planning Syntactic Parsing Transfer Rules Text Generation Statistical-XFER Source (e.g. Urdu) Target (e.g. English) Direct: SMT, EBMT 1/1/2020 35

Avenue Architecture Learning Module Learning Module Learned Transfer Rules Handcrafted rules Morphology Analyzer Elicitation Morphology Rule Learning Run-Time System RuleRefinement Translation Correction Tool Word-Aligned Parallel Corpus INPUT TEXT Run Time Transfer System Rule Refinement Module Elicitation Corpus Decoder Elicitation Tool Lexical Resources OUTPUT TEXT AVENUE/LETRAS Mar 1, 2006 36

Syntax-driven Acquisition Process Automatic Process for Extracting Syntax-driven Rules and Lexicons from sentence-parallel data: • Word-alignthe parallel corpus (GIZA++) • Parse the sentencesindependently for both languages • Tree-to-tree Constituent Alignment: • Run our new Constituent Alignerover the parsed sentence pairs • Enhance alignmentswith additional Constituent Projections • Extract all aligned constituentsfrom the parallel trees • Extract all derived synchronous transfer rulesfrom the constituent-aligned parallel trees • Construct a “data-base”of all extracted parallel constituents and synchronous rules with their frequencies and model them statistically (assign them relative-likelihood probabilities) 1/1/2020 Alon Lavie: Stat-XFER 37

PFA Node Alignment Algorithm Example • Any constituent or sub-constituent is a candidate for alignment • Triggered by word/phrase alignments • Tree Structures can be highly divergent 38

PFA Node Alignment Algorithm Example • Tree-tree aligner enforces equivalence constraints and optimizes over terminal alignment scores (words/phrases) • Resulting aligned nodes are highlighted in figure • Transfer rules are partially lexicalized and read off tree. 39

Targeted Data Aquisition • Elicitation Corpus for Structural Diversity • Field Linguist in a Box (Levin et al) • Automated Selection of Next Elicitations • Active Learning • To select maximally informative sentences to translate • E.g. with max freq n-grams in monolingual corpus • E.g. maximally different from current bi-lingual corpus • To select hand alignments • Use Amazon’s Mechanical Turk or Web-game • Multi-source independent verification • Elicit just bilingual dictionaries (for CBMT)

Active Learning for MT Parallel corpus Expert Translator S,T Trainer Model S Sampled corpus MT System Source Language Corpus Active Learner

Active Crowd Translation S,T1 Trainer S,T2 . . . Translation Selection Model S,Tn S Sentence Selection MT System Source Language Corpus ACT Framework

Active Learning Strategy:Diminishing Density Weighted Diversity Sampling Experiments: Language Pair: Spanish-English Iterations: 20 Batch Size: 1000 sentences each Translation: Moses Phrase SMT Development Set: 343 sens Test Set: 506 sens Graph: X: Performance (BLEU ) Y: Data (Thousand words)

Translation Selection from Crowd Data • Translator Reliability • Translation Reliability • Compare translations for agreement with each other • Flexible matching is the key • Translation Selection: • Maximal agreement criterion

Which MT Paradigms are Best? Some Tentative Answers • Unidirectional: LD  E (intelligence) • Bi-directional: LD  E (theater ops) • LDA  LDB

Test of Time: 3K years ago, none of the existing languages existed  3K years from now?

Machine Translation for Digital Library Content

Machine Translation for Digital Library Content

Presentation Transcript

Machine Translation

Machine Translation

Digital Library Content Model

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

NU’s own Digital Content Library

Machine Translation

Machine Translation

Machine Translation for Digital Library Content

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation